Category Archives: Databases

ChEMBL 11 Schema

Since I got quite few positive remarks & comments (thank you) about the visualization of the ChEMBL 09 database schema, here is a new version showing the ChEMBL 11 schema. The schema has been created the same way as in the previous post.

So, feel free to find the differences :-) – I am going to do it myself in order to update pychembl to ChEMBL version 11 (I also hope that the ChEMBL team doesn’t put version 12 out to soon, they create a lot of pressure there :-) ). Here is the ChEMBL 11 schema also as pdf.

Using pychembl (3) – Active & Parent Molecules


A quite interesting table in ChEMBLdb, also linked to table molecule_dictionary by the mutual primary key molregno, is table molecule_hierarchy. As the name suggests, it stores hierarchical relationships between row entries in table molecule_dictionary and provides a linkage to the parent and active form of a molecule if available in ChEMBLdb.

But first of all, let us load an example molecule from the database again:

> molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==47340).one()

Like shown in previous posts, this delivers a MoleculeDictionary object:

> print molecule
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374d750>

The following two command lines first walk to table molecule_hierarchy using the pre-defined table relationship hierarchy. From there an immediate jump back to table molecule_hierarchy is performed either using the named relationship parent or active. Both calls again provide a MoleculeDictionary object, however, this time the corresponding object represent either the parent structure or the active form of the original molecule.

> print molecule.hierarchy.parent
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374dbd0>
> print molecule.hierarchy.active
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374de90>

A really cool feature of SQLAlchemy is, that it allows to pre-define relationships which can walk over more than one actual table-to-table relationship (more examples of this will come in future posts). In the example shown here, this allows us to eliminate the explicit call of the hierarchy relationship of the MoleculeDictionary object (“hierarchy” is a hard word to type anyway :-) ). Internally, these new relationships follow the walk over the same relationship paths as just shown, but provide the attributes “parent” and “active” as direct attributes of the object stored in variable molecule:

> print molecule.parent
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374dbd0>
> print molecule.active
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374de90>

And in the same fashion as described in the pychembl (2) post, we can ask now either for attributes of the original molecule, the parent molecule or the active form of the original molecule:

> print molecule.pref_name
% TAMOXIFEN CITRATE
> print molecule.parent.pref_name
% TAMOXIFEN
> print molecule.active.pref_name
% 4-HYDROXYTAMOXIFEN

…, or follow the structure relationship to structural information of each of the three molecules:

> print molecule.structure.canonical_smiles
% CC\C(=C(/c1ccccc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3.OC(=O)CC(O)(CC(=O)O)C(=O)O
> print molecule.parent.structure.canonical_smiles
% CC\C(=C(/c1ccccc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3
> print molecule.active.structure.canonical_smiles
% CC\C(=C(/c1ccc(O)cc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3

…, or ask for properties of the corresponding CompoundProperty object:

> print molecule.property.hba
% 2
> print molecule.parent.property.hba
% 2
> print molecule.active.property.hba
% 3

Makes walking through ChEMBLdb pretty easy, doesn’t it?
Markus

Note: In case you already had installed pychem earlier, please pull/download it again from GitHub since I added the new relationships for a MoleculeDictionary object.

Using pychembl (2) – Table Relationships

Let’s continue the example from the pychembl(1) post where we loaded a molecule from ChEMBL’s molecule dictionary table:

molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==675049).one()

As you can see from partial ChEMBL database schema above, table molecule_dictionary uses molregno as primary key. On the right side of this table you find two other tables which have the same primary key: table compound_structures and table compound_properties. Hence, the relationships between these two table pairs can each be considered as a so-called one-to-one relationships (although they have been visualized as one-to-many relationship by the software I used to create the image of the ChEMBL database schema). This means, each entry (or row) in table molecule_dictionary either has at most a single corresponding entry in each of these two other tables – it can not occur that more than one row in each table corresponds to the same molregno as this would violate the primary key constraints there.

If you look at the row counts in all three tables, it shows that the majority of rows in table molecule_dictionary have corresponding entries both in table compound_structures and compound_properties:

> print chembldb.query(MoleculeDictionary).count()
% 658075
> print chembldb.query(CompoundStructures).count()
% 657736
> print chembldb.query(CompoundProperties).count()
% 657736

Generally, in order to bring together rows in a database table with corresponding rows in related tables, a join between all involved tables as well as the specification of appropriate join conditions is needed if “regular” sql is used. SQLAlchemy allows you to pre-define and name frequently used relationships between tables. Any pre-defined relationships are available in addition to the column attributes of a table row (e.g. molregno, pref_name, chembl_id, max_phase which have been auto-loaded from the database as discussed in the previous pychembl (1) post).

Pychembl provides the pre-defined relationship attributes property and structure for table molecule_dictionary allowing to retrieve related row objects from the corresponding tables (as the above database schema cutout shows, there are presumably more relationships to other tables, but for today’s example I will restrict myself to these two):

> p = molecule.property
> print p
% <pychembl.db.auto_schema.CompoundProperties object>
> s = molecule.structure
> print s
% <pychembl.db.auto_schema.CompoundStructures object>

These are the same row objects you would gather by querying the database in the following way:

> chembldb.query(CompoundProperties).filter(CompoundProperties.molregno==675049).one()
% <pychembl.db.auto_schema.CompoundProperties object>
> chembldb.query(CompoundStructures).filter(CompoundStructures.molregno==675049).one()
% <pychembl.db.auto_schema.CompoundStructures object>

Like for MoleculeDictionary objects, all column attributes for CompoundProperties objects and CompoundStructure objects have been auto-loaded from the database and are accessible as python object attributes:

> print p.alogp
% -0.383
> print p.hba
% 15
> print p.acd_most_apka
% 1.692
> print s.standard_inchi
% InChI=1S/C17H17N7O8S4.2Na/c1-23-16(20-21-22-23)34....
> print s.canonical_smiles
% [Na+].[Na+].CO[C@]1(NC(=O)C2SC(=C(C(=O)N)C(=O)[O-])S2)[C@H]3SCC(=C(N3C1=O)C(=O)[O-])CSc4nnnn4C
> print s.acd_most_apka
% C17 H15 N7 O8 S4 . 2 Na

And since python is a greatly designed language, it allows you to directly access this attributes also from the MoleculeDictionary object which we had previously stored in variable molecule, i.e. there is no need to intermediately create a CompoundProperties or CompoundStructures object:

> print molecule.property.alogp
% -0.383
> print molecule.property.hba
% 15
> print molecule.property.acd_most_apka
% 1.692
> print molecule.structure.standard_inchi
% InChI=1S/C17H17N7O8S4.2Na/c1-23-16(20-21-22-23)34....
> print molecule.structure.canonical_smiles
% [Na+].[Na+].CO[C@]1(NC(=O)C2SC(=C(C(=O)N)C(=O)[O-])S2)[C@H]3SCC(=C(N3C1=O)C(=O)[O-])CSc4nnnn4C
> print molecule.structure.acd_most_apka
% C17 H15 N7 O8 S4 . 2 Na

Markus

ChEMBL 09 Schema Visualized

ChEMBL_09 database schema

Today I took a look at the improvements and changes of ChEMBL_09 – to get a better view, I visualized ChEMBL’s database schema (using the mysql version). And after tweaking it for a while and pushing tables around (graphically – not in the database itself, of course) I even managed to organize it in a way that no crossings of all of the database relationships occur. As it might be useful to other people, I publish my “art” work here (well, the real art is what the ChEMBL team has done by putting together this nice database – and the new version looks really nice). Click here for a large image or here for a pdf of the schema.

Well,  there are two things I am not sure about (but they are represented in the schema above as they are in the database): it looks like, table protein_therapeutics and table molecule_dictionary could be linked by their primary keys, and table chembl_id_lookup might be linkable both to table target_dictionary and table assays (by column chembl_id) – but maybe it gets clear after diving into the data … which I will do now.

Markus