Matt Swain has written a nice python wrapper for the Chemical Identifier Resolver (CIR). Since he blogged about it, my life is easy and I just link to his blog here. The project is available at GitHub.
Ian Powley, member of the MRC Toxicology Unit of the University of Leicester, UK, wrote a nice WindowsPhone App that is available at the Windows Marketplace (follow the link). It is a little lab helper for calculating molarities and stock dilutions with ease – and it connects to the Chemical Identifier Resolver for chemical structure lookups.
We have updated the OPSIN library used by CIR to version 1.2. Daniel lists the following improvements in his release notes:
- Basic support for cylised carbohydrates e.g. alpha-D-glucopyranose
- Basic support for systematic carbohydrate stems e.g. D-glycero-D-gluco-Heptose
- Added heuristic for correcting esters with omitted spaces
- Added support for xanthates/xanthic acid
- Minor vocabulary improvements
- Fixed a few minor bugs/limitations in the Cahn-Ingold-Prelog rules implementation and made more memory efficient
- Many minor improvements and bug fixes
There is a new resolver module available at CIR (Chemical Identifier Resolver): “name patterns“. It allows for Google-like searches on our name index which currently holds approx. 70 million names. The module regards a chemical name, for instance the name “(p-Nitrobenzoyl)acetone“, internally as a sentence consisting of the three words “p“, “nitrobenzoyl“, and “acetone“. It is still kind of experimental, i.e. it might not work at times, but we would be happy to hear about how useful it is. The search for chemical names is perfomed by the Sphinx SQL full-text search engine.
Before I show some examples, please note the following: because of performance reasons, CIR does not use the name pattern module per default, it has to be named explicitly by the resolver URL parameter “resolver“, e.g.
This is the identical behavior to what we described in this earlier post for the “name by chemspider” resolver module. The URL shown above request the SMILES strings for all structures containing the word “morphine” in their name. The result is always restricted to the 100 most relevant entries (the relevance of a name is determined by Sphinx).
The usage of the name pattern module can also be combined with any other CIR module, e.g. the following request includes the “name by chemspider” module (for a full name match) and the “name pattern” module:
If you look at the XML returned by CIR, the name to which the name pattern successfully has been matched is available from the notation attribute of a data item:
<data id="3" resolver="name_pattern" string_class="Chemical Name Pattern" notation="Morphine N-oxide"> <item id="1">[C@@]125C3=C4C[C@H]([C@@H]1C=C[C@@H]([C@@H]2OC3=C(C=C4)O)O)[N+](C)([O-])CC5</item> </data>
Now the cool things – all following examples request the SMILES strings for all structures whose names contain:
- the words “morphine” and “methyl” (name pattern: ‘+morphine +methyl‘):
- the words “morphine” and “methyl” but not “ester” (name pattern: ‘+morphine +methyl -esther‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -esther/smiles/xml?resolver=name_pattern
- the substring “morphine” as a word end (name pattern: ‘*morphine‘):
- the substring “morphine” somewhere in the name (name pattern: ‘*morphine*‘) :
- the substring “*morphine*” and the literal string ‘ “3-methyl ether “‘ (name pattern: ‘+*morphine* +”3-methyl ether“‘)
http://cactus.nci.nih.gov/chemical/structure/+*morphine* +"3-methyl ether"/smiles/xml?resolver=name_pattern
- a single character “m” and the word “benzene” in a maximum distance of 3 words (nice to find smaller aromatic ring systems, name pattern: ‘“m benzene”~3‘):
- the words “magnesium” or “sodium” and the word “chloride” (name pattern: ‘(magnesium|sodium) +chloride‘)
(Long) Chemical names
The name pattern module can also be used to match full chemical names, however performance is not so good. Additionally, if you want to search, e.g., for ‘[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium‘ you have to quote it, otherwise dashes are seen as minus operators (excluding the next word). So here is the correct way for a full name request:
However, for better performance and uncertainties about the spelling used in the database, it is probably smarter to ignore interpunctations and trying it this way (which only includes the important chemical syllables of the name above linked by + operators, name pattern: ‘+”(1R)-1-benzyl“ +keto +propyl +amino +ethyl +difluoromethoxy +ammonium‘):
http://cactus.nci.nih.gov/chemical/structure/"(1R)-1-benzyl+" +keto +propyl +amino +ethyl +difluoromethoxy +ammonium/smiles/xml?resolver=name_pattern
I hope it is useful,
First, we’d like to announce that we have updated OPSIN to version 1.1.0. Secondly, there is a new resolver module available in CIR: ChemSpider provides a name index of excellent quality which you can use now from CIR:
Internally, this request is passed through directly to ChemSpider. As we don’t want to forward our entire traffic through ChemSpider’s service, the URL parameter “?resolver=name_by_chemspider” has to be added explicitly to the URL sent to the CIR. If this parameter is not given, the provided name is resolved as previously: first by OSPIN module in CIR, if this fails, by a lookup in the local name index of CIR.
If you want to change the order of this procedure and/or add the lookup at ChemSpider, you can do the following:
This attempts to resolver the name “L-alanin” first by chemSpider resolver module, then by the OPSIN resolver module and finally with the database name index of CIR. As the lookup is already successful using the ChemSpider module, CIR stops there and doesn’t apply the other two modules.
If you like to see what all three name resolving modules reply, you have to use the xml representation of CIR:
If you like to compare whether all three modules return the same structure for a name, you can “hash” the resolved structures using the HASHISY function available in CACTVS:
Fortunately, we get the same hashcode value from each module, but that is not generally true. For instance, the ChemSpider name resolver module returns both forms for ”fructose”while the other two modules return only the open-chain form of fructose (and of course, other reasons could be some nasty nasty bug):
We have switched cactus.nci.nih.gov to new hardware today. By this, we hope quite a few things that were causing trouble during the recent weeks (in particular for the Resolver) are gone now. If you still experience any problems or have new ones, please report them to us.
ChemSpider IDs are definitely an important identifier to specify or name a chemical structure. Starting with the beta 4 version of the Chemical Identifier Resolver (CIR), ChemSpider IDs are now accepted both as input identifier as well as output representation. If it is used as input format, it has to be “classified” as ChemSpider ID (as we plan to enable the lookup of more database identifiers using the following format):
The clue is, that for the conversion step “ChemSpider ID to structure” no local lookup in our databases is performed, but it is converted by connectiong to the ChemSpider’s InChI Resolver. If you want, you can combine this with other methods provided by CIR, for instance, the generation of all tautomers for a ChemSpider ID:
Or you can “twirl” a ChemSpider ID:
Of course, it also works in the other direction: the following example starts from an IUPAC name, which internally is converted into a chemical structure by OPSIN, and then is resolved into a ChemSpider ID (which again uses ChemSpider’s InChI Resolver):
And some final example: resolve a set of Warfarin tautomers into ChemSpider IDs (unfortunately the ChemSpider InChI Resolver returns also deprecated ChemSpider records):
You can do the same thing by starting from a ChemSpider ID, generate the tautomers and resolve them into a set of ChemSpider IDs again:
Daniel M. Lowe at the Unilever Centre for Molecular Science Informatics (University of Cambridge) and I have collaborated to integrate his very nice OPSIN software package into the Resolver (alternatively to their own web service). OPSIN was initially started by Peter Corbett in Peter Murray Rust’s group, however, Daniel is responsible for the development of version 1.0.0 released recently and published in JCIM.
OPSIN is an Open Source Java library that allows parsing of systematic IUPAC names and converting them into a full structure representation. Our Resolver so far attempts the same thing by a simple lookup in a large name index stored in its database (and admittedly, some parts the quality of this name index is mediocre). The lookup of names in a database, of course, works less systematically than OPSIN (as only those names available in the database can be retrieved), however, it has the advantage that also trivial names that do not follow a systematic nomenclature can be converted into a full structure representation if they are present in the database. So Daniel and I thought, combining both things would generate a very powerful tool for name-to-structure conversion.
How it works
The IUPAC name “spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]” can only be resolved by OPSIN and is not available in the Resolver name index. Starting with the beta 4 version (to which we switched over yesterday), the Resolver automatically uses now also OPSIN, e.g.:
A name example only resolvable by the Resolver’s name index is “Warfarin“:
As you can see from these URLs, no explicit specification is required, whether OPSIN or the database lookup should be used.
However, if you want to make sure that a specific method is applied, you need to specify the corresponding resolver module explicitly (see “?resolver” query parameter “name_by_opsin” or “name_by_database“):
Alternatively, if you like the Resolver to tell you which one of two name resolving modules has worked for a specific name, you can use the xml format (it returns the applied resolver module as one of the XML tag attributes):
http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles/xml http://cactus.nci.nih.gov/chemical/structure/warfarin/smiles/xml http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/smiles/xml
As Daniel’s web page of name examples shows, OPSIN accepts also greek (unicode) characters – hence, we enhanced the Resolver to do the same thing:
Also more complex names (e.g. “pentacyclo[188.8.131.52,8.018,20.113,28]triacontane”) should be URL-encoded as Daniel’s examples show (see “von Baeyer systems”):
Well, and finally – to get some graphics in here – let’s twirl around “L-alanyl-L-glutaminyl-L-arginyl-O-phosphono-L-seryl-L-alanyl-L-proline” converted by OPSIN into a structure (3D coordinates are calculated by CORINA):
I hope you find it helpful,
We have switched the Resolver to its beta 4 release. It brings quite a few new features, which I will post about later. If you find any problems with the new version, please report them to us.