/chemical/structure Blog

There is a new resolver module available at CIR (Chemical Identifier Resolver): “name patterns“. It allows for Google-like searches on our name index which currently holds approx. 70 million names. The module regards a chemical name, for instance the name “(p-Nitrobenzoyl)acetone“, internally as a sentence consisting of the three words “p“, “nitrobenzoyl“, and “acetone“. It is still kind of experimental, i.e. it might not work at times, but we would be happy to hear about how useful it is. The search for chemical names is perfomed by the Sphinx SQL full-text search engine.

Before I show some examples, please note the following: because of performance reasons, CIR does not use the name pattern module per default, it has to be named explicitly by the resolver URL parameter “resolver“, e.g.

https://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_pattern

This is the identical behavior to what we described in this earlier post for the “name by chemspider” resolver module. The URL shown above request the SMILES strings for all structures containing the word “morphine” in their name. The result is always restricted to the 100 most relevant entries (the relevance of a name is determined by Sphinx).

The usage of the name pattern module can also be combined with any other CIR module, e.g. the following request includes the “name by chemspider” module (for a full name match) and the “name pattern” module:

https://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_by_chemspider,name_pattern

If you look at the XML returned by CIR, the name to which the name pattern successfully has been matched is available from the notation attribute of a data item:

<data id="3" resolver="name_pattern" string_class="Chemical Name Pattern" notation="Morphine N-oxide">
   <item id="1">[C@@]125C3=C4C[C@H]([C@@H]1C=C[C@@H]([C@@H]2OC3=C(C=C4)O)O)[N+](C)([O-])CC5</item>
</data>

Examples

Now the cool things – all following examples request the SMILES strings for all structures whose names contain:

the words “morphine” and “methyl” (name pattern: ‘+morphine +methyl‘):

https://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/smiles/xml?resolver=name_pattern

the words “morphine” and “methyl” but not “ester” (name pattern: ‘+morphine +methyl -esther‘):

https://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -esther/smiles/xml?resolver=name_pattern

the substring “morphine” as a word end (name pattern: ‘*morphine‘):

https://cactus.nci.nih.gov/chemical/structure/*morphine/smiles/xml?resolver=name_pattern

the substring “morphine” somewhere in the name (name pattern: ‘*morphine*‘) :

https://cactus.nci.nih.gov/chemical/structure/*morphine*/smiles/xml?resolver=name_pattern

the substring “*morphine*” and the literal string ‘ “3-methyl ether “‘ (name pattern: ‘+*morphine* +”3-methyl ether“‘)

https://cactus.nci.nih.gov/chemical/structure/+*morphine* +"3-methyl ether"/smiles/xml?resolver=name_pattern

Fancy things

a single character “m” and the word “benzene” in a maximum distance of 3 words (nice to find smaller aromatic ring systems, name pattern: ‘“m benzene”~3‘):

https://cactus.nci.nih.gov/chemical/structure/"m benzene"~3/smiles/xml?resolver=name_pattern

the words “magnesium” or “sodium” and the word “chloride” (name pattern: ‘(magnesium|sodium) +chloride‘)

https://cactus.nci.nih.gov/chemical/structure/(magnesium|sodium) +chloride/smiles/xml?resolver=name_pattern

(Long) Chemical names

The name pattern module can also be used to match full chemical names, however performance is not so good. Additionally, if you want to search, e.g., for ‘[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium‘ you have to quote it, otherwise dashes are seen as minus operators (excluding the next word). So here is the correct way for a full name request:

https://cactus.nci.nih.gov/chemical/structure/"[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium"/smiles/xml?resolver=name_pattern

However, for better performance and uncertainties about the spelling used in the database, it is probably smarter to ignore interpunctations and trying it this way (which only includes the important chemical syllables of the name above linked by + operators, name pattern: ‘+”(1R)-1-benzyl“ +keto +propyl +amino +ethyl +difluoromethoxy +ammonium‘):

https://cactus.nci.nih.gov/chemical/structure/"(1R)-1-benzyl+" +keto +propyl +amino +ethyl +difluoromethoxy +ammonium/smiles/xml?resolver=name_pattern

I hope it is useful,
Markus

About new web services at https://cactus.nci.nih.gov

KNIME node for CIR by Talete

We are back after Sandy

We shut down now

Hurricane Sandy

CIRpy – A python wrapper for the Chemical Identifier Resolver

WindowsApp Lab Helper

Slides of my ACS San Diego (InChI Symposium) talk

PDB Ligand Conformational Energies Calculated Quantum-Mechanically

OPSIN 1.2

Chemical Name Pattern Searching (or even more on chemical Name Searching)