OPSIN 1.2

We have updated the OPSIN library used by CIR to version 1.2. Daniel lists the following improvements in his release notes:

  • Basic support for cylised carbohydrates e.g. alpha-D-glucopyranose
  • Basic support for systematic carbohydrate stems e.g. D-glycero-D-gluco-Heptose
  • Added heuristic for correcting esters with omitted spaces
  • Added support for xanthates/xanthic acid
  • Minor vocabulary improvements
  • Fixed a few minor bugs/limitations in the Cahn-Ingold-Prelog rules implementation and made more memory efficient
  • Many minor improvements and bug fixes

Nice Work!

Markus

Posted in Chemical Identifier Resolver, OPSIN | Tagged , , | Leave a comment

Chemical Name Pattern Searching (or even more on chemical Name Searching)

There is a new resolver module available at CIR (Chemical Identifier Resolver): “name patterns“. It allows for Google-like searches on our name index which currently holds approx. 70 million names. The module regards a chemical name, for instance the name “(p-Nitrobenzoyl)acetone“, internally as a sentence consisting of the three words “p“, “nitrobenzoyl“, and “acetone“. It is still kind of experimental, i.e. it might not work at times, but we would be happy to hear about how useful it is. The search for chemical names is perfomed by the Sphinx SQL full-text search engine.

Before I show some examples, please note the following: because of performance reasons, CIR does not use the name pattern module per default, it has to be named explicitly by the resolver URL parameter “resolver“, e.g.

http://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_pattern

This is the identical behavior to what we described in this earlier post for the “name by chemspider” resolver module. The URL shown above request the SMILES strings for all structures containing the word “morphine” in their name. The result is always restricted to the 100 most relevant entries (the relevance of a name is determined by Sphinx).

The usage of the name pattern module can also be combined with any other CIR module, e.g. the following request includes the “name by chemspider” module (for a full name match) and the “name pattern” module:

http://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_by_chemspider,name_pattern

If you look at the XML returned by CIR, the name to which the name pattern successfully has been matched is available from the notation attribute of a data item:

<data id="3" resolver="name_pattern" string_class="Chemical Name Pattern" notation="Morphine N-oxide">
   <item id="1">[C@@]125C3=C4C[C@H]([C@@H]1C=C[C@@H]([C@@H]2OC3=C(C=C4)O)O)[N+](C)([O-])CC5</item>
</data>

Examples

Now the cool things – all following examples request the SMILES strings for all structures whose names contain:

  • the words “morphine” and “methyl” (name pattern: ‘+morphine +methyl‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/smiles/xml?resolver=name_pattern
  • the words “morphine” and “methyl” but not “ester” (name pattern: ‘+morphine +methyl -esther‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -esther/smiles/xml?resolver=name_pattern
  • the substring “morphine” as a word end (name pattern: ‘*morphine‘):
http://cactus.nci.nih.gov/chemical/structure/*morphine/smiles/xml?resolver=name_pattern
  • the substring “morphine” somewhere in the name (name pattern: ‘*morphine*‘) :
http://cactus.nci.nih.gov/chemical/structure/*morphine*/smiles/xml?resolver=name_pattern
  • the substring “*morphine*” and the literal string ‘ “3-methyl ether “‘ (name pattern: ‘+*morphine* +”3-methyl ether‘)
http://cactus.nci.nih.gov/chemical/structure/+*morphine* +"3-methyl ether"/smiles/xml?resolver=name_pattern

Fancy things

  • a single character “m” and the word “benzene” in a maximum distance of 3 words (nice to find smaller aromatic ring systems, name pattern: ‘“m benzene”~3‘):
http://cactus.nci.nih.gov/chemical/structure/"m benzene"~3/smiles/xml?resolver=name_pattern
  • the words “magnesium” or “sodium” and the word “chloride” (name pattern: ‘(magnesium|sodium) +chloride‘)
http://cactus.nci.nih.gov/chemical/structure/(magnesium|sodium) +chloride/smiles/xml?resolver=name_pattern

(Long) Chemical names

The name pattern module can also be used to match full chemical names, however performance is not so good. Additionally, if you want to search, e.g.,  for ‘[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium‘ you have to quote it, otherwise dashes are seen as minus operators (excluding the next word). So here is the correct way for a full name request:

http://cactus.nci.nih.gov/chemical/structure/"[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium"/smiles/xml?resolver=name_pattern

However, for better performance and uncertainties about the spelling used in the database, it is probably smarter to ignore interpunctations and trying it this way (which only includes the important chemical syllables of the name above linked by + operators, name pattern: ‘+”(1R)-1-benzyl +keto +propyl +amino +ethyl +difluoromethoxy +ammonium‘):

http://cactus.nci.nih.gov/chemical/structure/"(1R)-1-benzyl+" +keto +propyl +amino +ethyl +difluoromethoxy +ammonium/smiles/xml?resolver=name_pattern

I hope it is useful,
Markus

Posted in Chemical Identifier Resolver | Tagged , , , | Leave a comment

ChEMBL 11 Schema

Since I got quite few positive remarks & comments (thank you) about the visualization of the ChEMBL 09 database schema, here is a new version showing the ChEMBL 11 schema. The schema has been created the same way as in the previous post.

So, feel free to find the differences :-) – I am going to do it myself in order to update pychembl to ChEMBL version 11 (I also hope that the ChEMBL team doesn’t put version 12 out to soon, they create a lot of pressure there :-) ). Here is the ChEMBL 11 schema also as pdf.

Posted in Databases | Tagged , , , | Leave a comment

More on Chemical Name Resolving

First, we’d like to announce that we have updated OPSIN to version 1.1.0. Secondly, there is a new resolver module available in CIR: ChemSpider provides a name index of excellent quality which you can use now from CIR:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles?resolver=name_by_chemspider

Internally, this request is passed through directly to ChemSpider. As we don’t want to forward our entire traffic through ChemSpider’s service, the URL parameter “?resolver=name_by_chemspider” has to be added explicitly to the URL sent to the CIR. If this parameter is not given, the provided name is resolved as previously: first by OSPIN module in CIR, if this fails, by a lookup in the local name index of CIR.

If you want to change the order of this procedure and/or add the lookup at ChemSpider, you can do the following:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles?resolver=name_by_chemspider,name_by_opsin,name_by_cir

This attempts to resolver the name “L-alanin” first by chemSpider resolver module, then by the OPSIN resolver module and finally with the database name index of CIR. As the lookup is already successful using the ChemSpider module, CIR stops there and doesn’t apply the other two modules.

If you like to see what all three name resolving modules reply, you have to use the xml representation of CIR:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles/xml?resolver=name_by_chemspider,name_by_opsin,name_by_cir

If you like to compare whether all three modules return the same structure for a name, you can “hash” the resolved structures using the HASHISY function available in CACTVS:

http://cactus.nci.nih.gov/chemical/structure/L-alanin/hashisy/xml?resolver=name_by_chemspider,name_by_opsin,name_by_cir

Fortunately, we get the same hashcode value from each module, but that is not generally true. For instance, the ChemSpider name resolver module returns both forms for ”fructose”while the other two modules return only the open-chain form of fructose (and of course, other reasons could be some nasty nasty bug):

http://cactus.nci.nih.gov/chemical/structure/fructose/smiles/xml?resolver=name_by_chemspider,name_by_opsin,name_by_cir

Markus

Posted in Chemical Identifier Resolver | Tagged , , | 2 Comments

New Hardware for cactus.nci.nih.gov

We have switched cactus.nci.nih.gov to new hardware today. By this, we hope quite a few things that were causing trouble during the recent weeks (in particular for the Resolver) are gone now. If you still experience any problems or have new ones, please report them to us.

Thanks,
Markus

Posted in Chemical Identifier Resolver, web service | Tagged | Leave a comment

Slides of my Talk at the “9th International Conference on Chemical Structures”

Here are my slides for the talk I gave on the 9th ICCS Conference in Noordwijkerhout:

Other than that, I apologize for being so quiet regarding pychembl, but I am working on continuing this blog post series (the interesting stuff is yet to come) as well as preparing pychembl for working with chembl_10.

Markus

Posted in Chemoinformatics, Talks&Presentations | Tagged , , | 1 Comment

Using pychembl (3) – Active & Parent Molecules


A quite interesting table in ChEMBLdb, also linked to table molecule_dictionary by the mutual primary key molregno, is table molecule_hierarchy. As the name suggests, it stores hierarchical relationships between row entries in table molecule_dictionary and provides a linkage to the parent and active form of a molecule if available in ChEMBLdb.

But first of all, let us load an example molecule from the database again:

> molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==47340).one()

Like shown in previous posts, this delivers a MoleculeDictionary object:

> print molecule
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374d750>

The following two command lines first walk to table molecule_hierarchy using the pre-defined table relationship hierarchy. From there an immediate jump back to table molecule_hierarchy is performed either using the named relationship parent or active. Both calls again provide a MoleculeDictionary object, however, this time the corresponding object represent either the parent structure or the active form of the original molecule.

> print molecule.hierarchy.parent
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374dbd0>
> print molecule.hierarchy.active
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374de90>

A really cool feature of SQLAlchemy is, that it allows to pre-define relationships which can walk over more than one actual table-to-table relationship (more examples of this will come in future posts). In the example shown here, this allows us to eliminate the explicit call of the hierarchy relationship of the MoleculeDictionary object (“hierarchy” is a hard word to type anyway :-) ). Internally, these new relationships follow the walk over the same relationship paths as just shown, but provide the attributes “parent” and “active” as direct attributes of the object stored in variable molecule:

> print molecule.parent
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374dbd0>
> print molecule.active
% <pychembl.db.auto_schema.MoleculeDictionary object at 0x374de90>

And in the same fashion as described in the pychembl (2) post, we can ask now either for attributes of the original molecule, the parent molecule or the active form of the original molecule:

> print molecule.pref_name
% TAMOXIFEN CITRATE
> print molecule.parent.pref_name
% TAMOXIFEN
> print molecule.active.pref_name
% 4-HYDROXYTAMOXIFEN

…, or follow the structure relationship to structural information of each of the three molecules:

> print molecule.structure.canonical_smiles
% CC\C(=C(/c1ccccc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3.OC(=O)CC(O)(CC(=O)O)C(=O)O
> print molecule.parent.structure.canonical_smiles
% CC\C(=C(/c1ccccc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3
> print molecule.active.structure.canonical_smiles
% CC\C(=C(/c1ccc(O)cc1)\c2ccc(OCCN(C)C)cc2)\c3ccccc3

…, or ask for properties of the corresponding CompoundProperty object:

> print molecule.property.hba
% 2
> print molecule.parent.property.hba
% 2
> print molecule.active.property.hba
% 3

Makes walking through ChEMBLdb pretty easy, doesn’t it?
Markus

Note: In case you already had installed pychem earlier, please pull/download it again from GitHub since I added the new relationships for a MoleculeDictionary object.

Posted in Databases, pychembl | Tagged , | Leave a comment

Using pychembl (2) – Table Relationships

Let’s continue the example from the pychembl(1) post where we loaded a molecule from ChEMBL’s molecule dictionary table:

molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==675049).one()

As you can see from partial ChEMBL database schema above, table molecule_dictionary uses molregno as primary key. On the right side of this table you find two other tables which have the same primary key: table compound_structures and table compound_properties. Hence, the relationships between these two table pairs can each be considered as a so-called one-to-one relationships (although they have been visualized as one-to-many relationship by the software I used to create the image of the ChEMBL database schema). This means, each entry (or row) in table molecule_dictionary either has at most a single corresponding entry in each of these two other tables – it can not occur that more than one row in each table corresponds to the same molregno as this would violate the primary key constraints there.

If you look at the row counts in all three tables, it shows that the majority of rows in table molecule_dictionary have corresponding entries both in table compound_structures and compound_properties:

> print chembldb.query(MoleculeDictionary).count()
% 658075
> print chembldb.query(CompoundStructures).count()
% 657736
> print chembldb.query(CompoundProperties).count()
% 657736

Generally, in order to bring together rows in a database table with corresponding rows in related tables, a join between all involved tables as well as the specification of appropriate join conditions is needed if “regular” sql is used. SQLAlchemy allows you to pre-define and name frequently used relationships between tables. Any pre-defined relationships are available in addition to the column attributes of a table row (e.g. molregno, pref_name, chembl_id, max_phase which have been auto-loaded from the database as discussed in the previous pychembl (1) post).

Pychembl provides the pre-defined relationship attributes property and structure for table molecule_dictionary allowing to retrieve related row objects from the corresponding tables (as the above database schema cutout shows, there are presumably more relationships to other tables, but for today’s example I will restrict myself to these two):

> p = molecule.property
> print p
% <pychembl.db.auto_schema.CompoundProperties object>
> s = molecule.structure
> print s
% <pychembl.db.auto_schema.CompoundStructures object>

These are the same row objects you would gather by querying the database in the following way:

> chembldb.query(CompoundProperties).filter(CompoundProperties.molregno==675049).one()
% <pychembl.db.auto_schema.CompoundProperties object>
> chembldb.query(CompoundStructures).filter(CompoundStructures.molregno==675049).one()
% <pychembl.db.auto_schema.CompoundStructures object>

Like for MoleculeDictionary objects, all column attributes for CompoundProperties objects and CompoundStructure objects have been auto-loaded from the database and are accessible as python object attributes:

> print p.alogp
% -0.383
> print p.hba
% 15
> print p.acd_most_apka
% 1.692
> print s.standard_inchi
% InChI=1S/C17H17N7O8S4.2Na/c1-23-16(20-21-22-23)34....
> print s.canonical_smiles
% [Na+].[Na+].CO[C@]1(NC(=O)C2SC(=C(C(=O)N)C(=O)[O-])S2)[C@H]3SCC(=C(N3C1=O)C(=O)[O-])CSc4nnnn4C
> print s.acd_most_apka
% C17 H15 N7 O8 S4 . 2 Na

And since python is a greatly designed language, it allows you to directly access this attributes also from the MoleculeDictionary object which we had previously stored in variable molecule, i.e. there is no need to intermediately create a CompoundProperties or CompoundStructures object:

> print molecule.property.alogp
% -0.383
> print molecule.property.hba
% 15
> print molecule.property.acd_most_apka
% 1.692
> print molecule.structure.standard_inchi
% InChI=1S/C17H17N7O8S4.2Na/c1-23-16(20-21-22-23)34....
> print molecule.structure.canonical_smiles
% [Na+].[Na+].CO[C@]1(NC(=O)C2SC(=C(C(=O)N)C(=O)[O-])S2)[C@H]3SCC(=C(N3C1=O)C(=O)[O-])CSc4nnnn4C
> print molecule.structure.acd_most_apka
% C17 H15 N7 O8 S4 . 2 Na

Markus

Posted in Databases, pychembl | Tagged , | Leave a comment

Using ChemSpider ID as Chemical Structure Identifier

ChemSpider IDs are definitely an important identifier to specify or name a chemical structure. Starting with the beta 4 version of the Chemical Identifier Resolver (CIR), ChemSpider IDs are now accepted both as input identifier as well as output representation. If it is used as input format, it has to be “classified” as ChemSpider ID (as we plan to enable the lookup of more database identifiers using the following format):

http://cactus.nci.nih.gov/chemical/structure/chemspider_id=1234567/smiles

The clue is, that for the conversion step “ChemSpider ID to structure” no local lookup in our databases is performed, but it is converted by connectiong to the ChemSpider’s InChI Resolver. If you want, you can combine this with other methods provided by CIR, for instance, the generation of all tautomers for a ChemSpider ID:

http://cactus.nci.nih.gov/chemical/structure/tautomers:chemspider_id=1234567/smiles
http://cactus.nci.nih.gov/chemical/structure/tautomers:chemspider_id=1234567/image

Or you can “twirl” a ChemSpider ID:

http://cactus.nci.nih.gov/chemical/structure/chemspider_id=1234567/twirl

Of course, it also works in the other direction: the following example starts from an IUPAC name, which internally is converted into a chemical structure by OPSIN, and then is resolved into a ChemSpider ID (which again uses ChemSpider’s InChI Resolver):

http://cactus.nci.nih.gov/chemical/structure/(3β)-cholest-5-en-3-ol/chemspider_id

And some final example: resolve a set of Warfarin tautomers into ChemSpider IDs (unfortunately the ChemSpider InChI Resolver returns also deprecated ChemSpider records):

http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/chemspider_id

You can do the same thing by starting from a ChemSpider ID, generate the tautomers and resolve them into a set of ChemSpider IDs again:

http://cactus.nci.nih.gov/chemical/structure/tautomers:chemspider_id=52602/chemspider_id

Markus

Posted in Chemical Identifier Resolver | Tagged , , , , | 1 Comment

Using pychembl (1)

Today, I want to start with some simple examples how to use pychembl. For this, let us walk through the molecule_dictionary table available in ChEMBL.

Well, like in the previous post we first have to import pychembl:

from pychembl.settings import *
from pychembl.db.auto_schema import *

To access the row entries in this table, pychembl’s table mapper class MoleculeDictionary has to be passed to the SQLAlchemy query object available in pychembl (or chembldb, respectively):

> molecules = chembldb.query(MoleculeDictionary)
> print type(molecules)
% <class 'sqlalchemt.orm.query.Query>

Let us count how many are present:

> print molecules.count()
% 658075

If you like to access a specific molecule identified by its ChEMBL molregno (the primary key in this table), one of following filter statements can be used (these are all alternative ways to do it):

> molecule = molecules.filter(MoleculeDictionary.molregno==675049).all()
> molecule = molecules.filter(MoleculeDictionary.molregno==675049).one()
> molecule = molecules.filter(MoleculeDictionary.molregno==675049).first()
> molecule = molecules.get((675049,))

The .all() method returns a python list object with all matching row objects. As it is already clear, that the statement will return only a single object, the .one() method retrieves only this object without generating a list. However, a request using .one() will generate an error message in case the filter criterion would return more than one object. This can be avoided by using the .first() method, which definiteltly returns only the first object regardless of how many rows in the table were matching the filter criterion. Finally, the .get() method can be used if a row is identified by its primary key – it expects a python tuple object as input (if a multi-column primary key is used in a table, the tuple has to contain the corresponding number of elements).

Of course, you can also do the creation of the query object and the definition of the filter criterion as a single statement, e.g. like this:

molecule = chembldb.query(MoleculeDictionary).filter(MoleculeDictionary.molregno==675049).one()

Accessing the attributes of a row object, i.e. the attributes of the molecule we just fetched from the database, is simple:

print molecule.molregno
% 658075
print molecule.pref_name
% CEFOTETAN DISODIUM
print molecule.chembl_id
% CHEMBL1201098
print molecule.first_approval
% 1984
print molecule.natural_product
% 1

All attribute names available for an object (e.g. molregno, chembl_id, first_approval, etc., see table molecule_dictionary in the schema) are auto-loaded from the database, hence are not changed by pychembl in any form. The python datatype of a returned attribute is according the column datatype as specified in the database (these are also auto-loaded).

With the statement shown earlier

> molecules = chembldb.query(MoleculeDictionary)

you make each query fetching a MoleculeDictionary object for each matching database row. In order to retrieve only certain attributes of a molecule, you can name the attributes:

> molecules = chembldb.query(MoleculeDictionary.chembl_id, MoleculeDictionary.chebi_id)

From this, for instance, you can very easily generate a python dictionary associating the ChEMBL ID with its corresponding ChEBI ID (we restrict it here to the first five):

> molecules = chembldb.query(MoleculeDictionary.chembl_id, MoleculeDictionary.chebi_id)
> chembl_to_chebi_id_dictionary = dict(molecules.limit(5).all())
> print chembl_to_chebi_id
% {'CHEMBL6328': 100002L, 'CHEMBL6329': 100001L, 'CHEMBL267864': 100005L, 'CHEMBL6362': 100004L, 'CHEMBL265667': 100003L}

Markus

Posted in pychembl | Tagged , , , | Leave a comment