There is a new resolver module available at CIR (Chemical Identifier Resolver): “name patterns“. It allows for Google-like searches on our name index which currently holds approx. 70 million names. The module regards a chemical name, for instance the name “(p-Nitrobenzoyl)acetone“, internally as a sentence consisting of the three words “p“, “nitrobenzoyl“, and “acetone“. It is still kind of experimental, i.e. it might not work at times, but we would be happy to hear about how useful it is. The search for chemical names is perfomed by the Sphinx SQL full-text search engine.
Before I show some examples, please note the following: because of performance reasons, CIR does not use the name pattern module per default, it has to be named explicitly by the resolver URL parameter “resolver“, e.g.
https://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_pattern
This is the identical behavior to what we described in this earlier post for the “name by chemspider” resolver module. The URL shown above request the SMILES strings for all structures containing the word “morphine” in their name. The result is always restricted to the 100 most relevant entries (the relevance of a name is determined by Sphinx).
The usage of the name pattern module can also be combined with any other CIR module, e.g. the following request includes the “name by chemspider” module (for a full name match) and the “name pattern” module:
https://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_by_chemspider,name_pattern
If you look at the XML returned by CIR, the name to which the name pattern successfully has been matched is available from the notation attribute of a data item:
<data id="3" resolver="name_pattern" string_class="Chemical Name Pattern" notation="Morphine N-oxide">
<item id="1">[C@@]125C3=C4C[C@H]([C@@H]1C=C[C@@H]([C@@H]2OC3=C(C=C4)O)O)[N+](C)([O-])CC5</item>
</data>
Examples
Now the cool things – all following examples request the SMILES strings for all structures whose names contain:
- the words “morphine” and “methyl” (name pattern: ‘+morphine +methyl‘):
https://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/smiles/xml?resolver=name_pattern
- the words “morphine” and “methyl” but not “ester” (name pattern: ‘+morphine +methyl -esther‘):
https://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -esther/smiles/xml?resolver=name_pattern
- the substring “morphine” as a word end (name pattern: ‘*morphine‘):
https://cactus.nci.nih.gov/chemical/structure/*morphine/smiles/xml?resolver=name_pattern
- the substring “morphine” somewhere in the name (name pattern: ‘*morphine*‘) :
https://cactus.nci.nih.gov/chemical/structure/*morphine*/smiles/xml?resolver=name_pattern
- the substring “*morphine*” and the literal string ‘ “3-methyl ether “‘ (name pattern: ‘+*morphine* +”3-methyl ether“‘)
https://cactus.nci.nih.gov/chemical/structure/+*morphine* +"3-methyl ether"/smiles/xml?resolver=name_pattern
Fancy things
- a single character “m” and the word “benzene” in a maximum distance of 3 words (nice to find smaller aromatic ring systems, name pattern: ‘“m benzene”~3‘):
https://cactus.nci.nih.gov/chemical/structure/"m benzene"~3/smiles/xml?resolver=name_pattern
- the words “magnesium” or “sodium” and the word “chloride” (name pattern: ‘(magnesium|sodium) +chloride‘)
https://cactus.nci.nih.gov/chemical/structure/(magnesium|sodium) +chloride/smiles/xml?resolver=name_pattern
(Long) Chemical names
The name pattern module can also be used to match full chemical names, however performance is not so good. Additionally, if you want to search, e.g., for ‘[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium‘ you have to quote it, otherwise dashes are seen as minus operators (excluding the next word). So here is the correct way for a full name request:
https://cactus.nci.nih.gov/chemical/structure/"[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium"/smiles/xml?resolver=name_pattern
However, for better performance and uncertainties about the spelling used in the database, it is probably smarter to ignore interpunctations and trying it this way (which only includes the important chemical syllables of the name above linked by + operators, name pattern: ‘+”(1R)-1-benzyl“ +keto +propyl +amino +ethyl +difluoromethoxy +ammonium‘):
https://cactus.nci.nih.gov/chemical/structure/"(1R)-1-benzyl+" +keto +propyl +amino +ethyl +difluoromethoxy +ammonium/smiles/xml?resolver=name_pattern
I hope it is useful,
Markus