ChEMBL 11 Schema

Since I got quite few positive remarks & comments (thank you) about the visualization of the ChEMBL 09 database schema, here is a new version showing the ChEMBL 11 schema. The schema has been created the same way as in the previous post.

So, feel free to find the differences ūüôā – I am going to do it myself in order to update pychembl to ChEMBL version 11 (I also hope that the ChEMBL team doesn’t put version 12 out to soon, they create a lot of pressure there :-)). Here is the ChEMBL 11 schema also as pdf.

More on Chemical Name Resolving

First, we’d like to announce that we have updated OPSIN to version 1.1.0. Secondly, there is a new resolver module available in CIR: ChemSpider provides a name index of excellent quality which you can use now from CIR:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles?resolver=name_by_chemspider

Internally, this request is passed through directly to ChemSpider. As we don’t want to forward our entire traffic through ChemSpider’s service, the URL parameter “?resolver=name_by_chemspider” has to be added explicitly to the URL sent to the CIR. If this parameter is not given, the provided name is resolved as previously: first by OSPIN module in CIR, if this fails, by a lookup in the local name index of CIR.

If you want to change the order of this procedure and/or add the lookup at ChemSpider, you can do the following:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles?resolver=name_by_chemspider,name_by_opsin,name_by_cir

This attempts to resolver the name “L-alanin” first by chemSpider resolver module, then by the OPSIN resolver module and finally with the database name index of CIR. As the lookup is already successful using the ChemSpider module, CIR stops there and doesn’t apply the other two modules.

If you like to see what all three name resolving modules reply, you have to use the xml representation of CIR:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles/xml?resolver=name_by_chemspider,name_by_opsin,name_by_cir

If you like to compare whether all three modules return the same structure for a name, you can “hash” the resolved structures using the HASHISY function available in CACTVS:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/hashisy/xml?resolver=name_by_chemspider,name_by_opsin,name_by_cir

Fortunately, we get the same hashcode value from each module, but that is not generally true. For instance, the ChemSpider name resolver module returns both forms for¬†“fructose”while the other two modules return only the open-chain form of fructose (and of course, other reasons could be some nasty nasty bug):

http://cactus.nci.nih.gov/chemical/structure/fructose/smiles/xml?resolver=name_by_chemspider,name_by_opsin,name_by_cir

Markus

Slides of my Talk at the “9th International Conference on Chemical Structures”

Here are my slides for the talk I gave on the 9th ICCS Conference in Noordwijkerhout:

Other than that, I apologize for being so quiet regarding pychembl, but I am working on continuing this blog post series (the interesting stuff is yet to come) as well as preparing pychembl for working with chembl_10.

Markus

Using pychembl (3) – Active & Parent Molecules


A quite interesting table in ChEMBLdb, also linked to table molecule_dictionary by the mutual primary key molregno, is table molecule_hierarchy. As the name suggests, it stores hierarchical relationships between row entries in table molecule_dictionary and provides a linkage to the parent and active form of a molecule if available in ChEMBLdb.

But first of all, let us load an example molecule from the database again:

> molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==47340).one()

Like shown in previous posts, this delivers a MoleculeDictionary object:

> print molecule
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374d750>

The following two command lines first walk to table molecule_hierarchy using the pre-defined table relationship hierarchy. From there an immediate jump back to table molecule_hierarchy is performed either using the named relationship parent or active. Both calls again provide a MoleculeDictionary object, however, this time the corresponding object represent either the parent structure or the active form of the original molecule.

> print molecule.hierarchy.parent
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374dbd0>
> print molecule.hierarchy.active
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374de90>

A really cool feature of SQLAlchemy is, that it allows to pre-define relationships which can walk over more than one actual table-to-table relationship (more examples of this will come in future posts). In the example shown here, this allows us to eliminate the explicit call of the hierarchy relationship of the¬†MoleculeDictionary object (“hierarchy” is a hard word to type anyway ūüôā ). Internally, these new relationships follow the walk over the same relationship paths as just shown, but provide the attributes “parent” and “active” as direct attributes of the¬†object stored in variable molecule:

> print molecule.parent
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374dbd0>
> print molecule.active
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374de90>

And in the same fashion as described in the pychembl (2) post, we can ask now either for attributes of the original molecule, the parent molecule or the active form of the original molecule:

> print molecule.pref_name
% TAMOXIFEN CITRATE
> print molecule.parent.pref_name
% TAMOXIFEN
> print molecule.active.pref_name
% 4-HYDROXYTAMOXIFEN

…, or follow the structure relationship to structural information of each of the three molecules:

> print molecule.structure.canonical_smiles
% CC\C(=C(/c1ccccc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3.OC(=O)CC(O)(CC(=O)O)C(=O)O
> print molecule.parent.structure.canonical_smiles
% CC\C(=C(/c1ccccc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3
> print molecule.active.structure.canonical_smiles
% CC\C(=C(/c1ccc(O)cc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3

…, or ask for properties of the corresponding¬†CompoundProperty object:

> print molecule.property.hba
% 2
> print molecule.parent.property.hba
% 2
> print molecule.active.property.hba
% 3

Makes walking through ChEMBLdb pretty easy, doesn’t it?
Markus

Note: In case you already had installed pychem earlier, please pull/download it again from GitHub since I added the new relationships for a MoleculeDictionary object.

Using pychembl (2) – Table Relationships

Let’s continue the example from the pychembl(1) post where we loaded a molecule from ChEMBL’s molecule dictionary table:

molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==675049).one()

As you can see from partial ChEMBL database schema above, table molecule_dictionary uses molregno as primary key. On the right side of this table you find two other tables which have the same primary key: table compound_structures and table compound_properties. Hence, the relationships between these two table pairs can each be considered as a so-called one-to-one relationships (although they have been visualized as one-to-many relationship by the software I used to create the image of the ChEMBL database schema). This means, each entry (or row) in table molecule_dictionary either has at most a single corresponding entry in each of these two other tables – it can not occur that more than one row in each table corresponds to the same molregno as this would violate the primary key constraints there.

If you look at the row counts in all three tables, it shows that the majority of rows in table molecule_dictionary have corresponding entries both in table compound_structures and compound_properties:

> print chembldb.query(MoleculeDictionary).count()
% 658075
> print chembldb.query(CompoundStructures).count()
% 657736
> print chembldb.query(CompoundProperties).count()
% 657736

Generally, in order to bring together rows in a database table with corresponding rows in related tables, a join between all involved tables as well as the specification of appropriate join conditions is needed if “regular” sql is used. SQLAlchemy allows you to pre-define and name frequently used relationships between tables. Any pre-defined relationships are available in addition to the column attributes of a table row (e.g. molregno, pref_name, chembl_id, max_phase which have been auto-loaded from the database as discussed in the previous pychembl (1) post).

Pychembl provides the pre-defined relationship attributes property and structure for table molecule_dictionary allowing to retrieve related row objects from the corresponding tables (as the above database schema cutout shows, there are presumably more relationships to other tables, but for today’s example I will restrict myself to these two):

> p = molecule.property
> print p
% <pychembl.db.auto_schema.CompoundProperties object>
> s = molecule.structure
> print s
% <pychembl.db.auto_schema.CompoundStructures object>

These are the same row objects you would gather by querying the database in the following way:

> chembldb.query(CompoundProperties).filter(CompoundProperties.molregno==675049).one()
% <pychembl.db.auto_schema.CompoundProperties object>
> chembldb.query(CompoundStructures).filter(CompoundStructures.molregno==675049).one()
% <pychembl.db.auto_schema.CompoundStructures object>

Like for MoleculeDictionary objects, all column attributes for CompoundProperties objects and CompoundStructure objects have been auto-loaded from the database and are accessible as python object attributes:

> print p.alogp
% -0.383
> print p.hba
% 15
> print p.acd_most_apka
% 1.692
> print s.standard_inchi
% InChI=1S/C17H17N7O8S4.2Na/c1-23-16(20-21-22-23)34....
> print s.canonical_smiles
% [Na+].[Na+].CO[C@]1(NC(=O)C2SC(=C(C(=O)N)C(=O)[O-])S2)[C@H]3SCC(=C(N3C1=O)C(=O)[O-])CSc4nnnn4C
> print s.acd_most_apka
% C17 H15 N7 O8 S4 . 2 Na

And since python is a greatly designed language, it allows you to directly access this attributes also from the MoleculeDictionary object which we had previously stored in variable molecule, i.e. there is no need to intermediately create a CompoundProperties or CompoundStructures object:

> print molecule.property.alogp
% -0.383
> print molecule.property.hba
% 15
> print molecule.property.acd_most_apka
% 1.692
> print molecule.structure.standard_inchi
% InChI=1S/C17H17N7O8S4.2Na/c1-23-16(20-21-22-23)34....
> print molecule.structure.canonical_smiles
% [Na+].[Na+].CO[C@]1(NC(=O)C2SC(=C(C(=O)N)C(=O)[O-])S2)[C@H]3SCC(=C(N3C1=O)C(=O)[O-])CSc4nnnn4C
> print molecule.structure.acd_most_apka
% C17 H15 N7 O8 S4 . 2 Na

Markus

Using ChemSpider ID as Chemical Structure Identifier

ChemSpider IDs are definitely an important identifier to specify or name a chemical structure. Starting with the beta 4 version of the Chemical Identifier Resolver (CIR), ChemSpider IDs are now accepted both as input identifier as well as output representation. If it is used as input format, it has to be “classified” as ChemSpider ID (as we plan to enable the lookup of more database identifiers using the following format):

http://cactus.nci.nih.gov/chemical/structure/chemspider_id=1234567/smiles

The clue is, that for the conversion step “ChemSpider ID to structure” no local lookup in our databases is performed, but it is converted by connectiong to the ChemSpider’s InChI Resolver. If you want, you can combine this with other methods provided by CIR, for instance, the generation of all tautomers for a ChemSpider ID:

http://cactus.nci.nih.gov/chemical/structure/tautomers:chemspider_id=1234567/smiles
http://cactus.nci.nih.gov/chemical/structure/tautomers:chemspider_id=1234567/image

Or you can “twirl” a ChemSpider ID:

http://cactus.nci.nih.gov/chemical/structure/chemspider_id=1234567/twirl

Of course, it also works in the other direction: the following example starts from an IUPAC name, which internally is converted into a chemical structure by OPSIN, and then is resolved into a ChemSpider ID (which again uses ChemSpider’s InChI Resolver):

http://cactus.nci.nih.gov/chemical/structure/(3ő≤)-cholest-5-en-3-ol/chemspider_id

And some final example: resolve a set of Warfarin tautomers into ChemSpider IDs (unfortunately the ChemSpider InChI Resolver returns also deprecated ChemSpider records):

http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/chemspider_id

You can do the same thing by starting from a ChemSpider ID, generate the tautomers and resolve them into a set of ChemSpider IDs again:

http://cactus.nci.nih.gov/chemical/structure/tautomers:chemspider_id=52602/chemspider_id

Markus

Using pychembl (1)

Today, I want to start with some simple examples how to use pychembl. For this, let us walk through the molecule_dictionary table available in ChEMBL.

Well, like in the previous post we first have to import pychembl:

from pychembl.settings import *
from pychembl.db.auto_schema import *

To access the row entries in this table, pychembl’s table mapper class MoleculeDictionary has to be passed to the SQLAlchemy query object available in pychembl (or chembldb, respectively):

> molecules = chembldb.query(MoleculeDictionary)
> print type(molecules)
% <class 'sqlalchemt.orm.query.Query>

Let us count how many are present:

> print molecules.count()
% 658075

If you like to access a specific molecule identified by its ChEMBL molregno (the primary key in this table), one of following filter statements can be used (these are all alternative ways to do it):

> molecule = molecules.filter(MoleculeDictionary.molregno==675049).all()
> molecule = molecules.filter(MoleculeDictionary.molregno==675049).one()
> molecule = molecules.filter(MoleculeDictionary.molregno==675049).first()
> molecule = molecules.get((675049,))

The .all() method returns a python list object with all matching row objects. As it is already clear, that the statement will return only a single object, the .one() method retrieves only this object without generating a list. However, a request using .one() will generate an error message in case the filter criterion would return more than one object. This can be avoided by using the .first() method, which definiteltly returns only the first object regardless of how many rows in the table were matching the filter criterion. Finally, the .get() method can be used if a row is identified by its primary key – it expects a python tuple object as input (if a multi-column primary key is used in a table, the tuple has to contain the corresponding number of elements).

Of course, you can also do the creation of the query object and the definition of the filter criterion as a single statement, e.g. like this:

molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==675049).one()

Accessing the attributes of a row object, i.e. the attributes of the molecule we just fetched from the database, is simple:

print molecule.molregno
% 658075
print molecule.pref_name
% CEFOTETAN DISODIUM
print molecule.chembl_id
% CHEMBL1201098
print molecule.first_approval
% 1984
print molecule.natural_product
% 1

All attribute names available for an object (e.g. molregno, chembl_id, first_approval, etc., see table molecule_dictionary in the schema) are auto-loaded from the database, hence are not changed by pychembl in any form. The python datatype of a returned attribute is according the column datatype as specified in the database (these are also auto-loaded).

With the statement shown earlier

> molecules = chembldb.query(MoleculeDictionary)

you make each query fetching a MoleculeDictionary object for each matching database row. In order to retrieve only certain attributes of a molecule, you can name the attributes:

> molecules = chembldb.query(MoleculeDictionary.chembl_id, MoleculeDictionary.chebi_id)

From this, for instance, you can very easily generate a python dictionary associating the ChEMBL ID with its corresponding ChEBI ID (we restrict it here to the first five):

> molecules = chembldb.query(MoleculeDictionary.chembl_id, MoleculeDictionary.chebi_id)
> chembl_to_chebi_id_dictionary = dict(molecules.limit(5).all())
> print chembl_to_chebi_id
% {'CHEMBL6328': 100002L, 'CHEMBL6329': 100001L, 'CHEMBL267864': 100005L, 'CHEMBL6362': 100004L, 'CHEMBL265667': 100003L}

Markus

pychembl: ChEMBLdb pythonified using SQLAlchemy

I’d like to present a new (still little) side project: pychembl (@GitHub)! Pychembl connects two great things with each other: the ChEMBL database (developed by the ChEMBL team) and python. Pychembl is implemented using SQLAlchemy, which is a very powerful library for linking python with a SQL database. For this, SQLAlchemy follows a very python-like syntax and, on other hand, supports many different SQL engines and dialects;¬†hence, it allows you to design and access a database in/from python while you don’t have to care about the underlying SQL engine.

For the installation of pychembl, please follow the instructions given in the README file.

If you have finished the installation, you can import pychembl into python in the following way:

from pychembl.settings import *
from pychembl.db.auto_schema import *

This gives you access to the full database schema which I had generated earlier from ChEMBL’s 09 unmodified sql dump (see here).

To give you an idea what you can do with pychembl, here are two quick examples (I will post more in the following days … to weeks).

The first, very simple example fetches the first 1000 assay records from ChEMBLdb, containing the word “human” in their description. They are delivered in chunks of 25 records; as output just the description field is printed:

assays = chembldb.query(Assays).filter(Assays.description.like('%human%'))
for assay in assays.limit(1000).yield_per(25):
    print "- %s" % (assay.description)

A more complex example selects a biological target from ChEMBLdb’s target dictionary table (I chose ‘kallikrein 14’ to avoid a too overwhelming result at the end of the example). A great advantage of SQLAlchemy’s ORM is that you can predefine and name¬†relationships¬†between tables and rows, respectively. A relationship can be based on foreign key relationship already present in the database (see the¬†connecting lines between the tables in the ChEMBL schema), or can also be added at the python/SQLAlchemy level. Any of the relationships is then available as¬†additional¬†attribute of a row object fetched from the database. In the example below, “assays” is such an attribute available for each “target” object and retrieves all assays¬†related¬†to specific biological target available in the database. Likewise, for each assay all activities and their corresponding structures and their properties (e.g. canonical SMILES) can be fetched from the database:

targets = chembldb.query(TargetDictionary)\
    .filter(TargetDictionary.pref_name=='kallikrein 14')
for target in targets.all():
    for assay in target.assays:
        for activity in assay.activities:
           print "%s : activity %s %s %s : %s" % (
               target.description
               activity.relation,
               activity.published_value,
               activity.published_units,
               activity.molecule.structure.canonical_smiles
           )

Here is the output it generates (it is also available as part of the GitHub repository):

Kallikrein-14: activitiy = 240.0 nM : NC(=N)NCCCC(NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2cccc3cccnc23)C(=O)c4nccs4
Kallikrein-14: activitiy = 677.0 nM : NC(=N)NCCC[C@@H](NC(=O)CN1CCN(CC1=O)S(=O)(=O)Cc2ccccc2)C(=O)c3nccs3
Kallikrein-14: activitiy = 27.0 nM : NC(=N)NCCC[C@@H](NC(=O)CNC(=O)[C@@H](CCCNC(=N)N)NS(=O)(=O)Cc1ccccc1)C(=O)c2nccs2
Kallikrein-14: activitiy = 9.4 nM : Cc1c(sc2ccc(Cl)cc12)S(=O)(=O)N3CCN(CC(=O)NC(CCCN=C(N)N)C(=O)c4nccs4)C(=O)C3
Kallikrein-14: activitiy = 390.0 nM : NC(=N)NCCCC(NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2ccc3ccccc3c2)C(=O)c4nccs4
Kallikrein-14: activitiy = 406.0 nM : NC(=N)NCCC[C@@H](NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2ccc(Cl)cc2)C(=O)c3nccs3
Kallikrein-14: activitiy = 34.0 nM : NC(=N)NCCCC(NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2ccc3c(Cl)cccc3c2)C(=O)c4nccs4

The (more or less) corresponding SQL command to this python/pychembl walk-trough is here (and although it does not look too horrific, from my experience, it can get so easily and is less intuitive than the python script above):

select t.pref_name, ac.relation, ac.published_value, ac.published_units,
    s.canonical_smiles
from target_dictionary t
join assay2target a2t on t.tid = a2t.tid
join assays a on a2t.assay_id = a.assay_id
join activities ac on ac.assay_id = a.assay_id
join molecule_dictionary m on ac.molregno = m.molregno
join compound_structures s on m.molregno = s.molregno
where t.pref_name = "kallikrein 14";

Here is a visual representation of how the python/pychembl script walked through ChEMBL (the schema is a cutout of the relevant parts of the full ChEMBL schema):

So, this was a first announcement of this project – further posts will follow. Please regard it as an early beta release and, please, give feedback if you find bugs or have suggestions, etc.

Thanks,
Markus

OPSIN & Chemical Identifier Resolver: Resolving IUPAC Names

Daniel M. Lowe at the Unilever Centre for Molecular Science Informatics (University of Cambridge) and I have collaborated to integrate his very nice OPSIN software package into the Resolver (alternatively to their own web service).¬†OPSIN was initially started by¬†Peter Corbett in Peter Murray Rust’s group, however, Daniel is responsible for the development of version 1.0.0 released recently and published in JCIM.

OPSIN is an Open Source Java library that allows parsing of systematic IUPAC names and converting them into a full structure representation. Our Resolver so far attempts the same thing by a simple lookup in a large name index stored in its database (and admittedly, some parts the quality of this name index is mediocre). The lookup of names in a database, of course, works less systematically than OPSIN (as only those names available in the database can be retrieved), however, it has the advantage that also trivial names that do not follow a systematic nomenclature can be converted into a full structure representation if they are present in the database. So Daniel and I thought, combining both things would generate a very powerful tool for name-to-structure conversion.

How it works

The IUPAC name “spiro[1,2-benzodithiole-3,2′-[1,3]benzodithiole]” can only be resolved by OPSIN and is not available in the Resolver name index. Starting with the beta 4 version (to which we switched over yesterday), the Resolver automatically uses now also OPSIN, e.g.:

http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/image
http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles

A name example only resolvable by the Resolver’s name index is “Warfarin“:

http://cactus.nci.nih.gov/chemical/structure/warfarin/image

As you can see from these URLs, no explicit specification is required, whether OPSIN or the database lookup should be used.

However, if you want to make sure that a specific method is applied, you need to specify the corresponding resolver module¬†explicitly¬†(see “?resolver” query parameter “name_by_opsin” or “name_by_database“):

http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/image?resolver=name_by_opsin
http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/image?resolver=name_by_database

Alternatively, if you like the Resolver to tell you which one of two name resolving modules has worked for a specific name, you can use the xml format (it returns the applied resolver module as one of the XML tag attributes):

http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles/xml
http://cactus.nci.nih.gov/chemical/structure/warfarin/smiles/xml
http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/smiles/xml

As Daniel’s web page of name examples shows, OPSIN accepts also greek (unicode) characters – hence, we enhanced the Resolver to do the same thing:

http://cactus.nci.nih.gov/chemical/structure/(3ő≤)-cholest-5-en-3-ol/image

Also more complex names (e.g. “pentacyclo[13.7.4.33,8.018,20.113,28]triacontane”) should be URL-encoded as Daniel’s examples show (see “von Baeyer systems”):

http://cactus.nci.nih.gov/chemical/structure/pentacyclo%5B13.7.4.3^3%2C8.0^18%2C20.1^13%2C28%5Dtriacontane/image

Well, and finally – to get some graphics in here – let’s twirl around “L-alanyl-L-glutaminyl-L-arginyl-O-phosphono-L-seryl-L-alanyl-L-proline” converted by OPSIN into a structure (3D coordinates are calculated by CORINA):

I hope you find it helpful,
Markus