KNIME node for CIR by Talete

Sorry, for being so quiet recently – but the preparation of new services at cactus kept us busy (we will write more about it soon).

Andrea Mauri of Talete pointed me to the KNIME node they have implemented and which is available for download at their web site, so if you work a lot with KNIME and need a lookup or conversion of different chemical structure representations, please take a look at their nice work (sorry, Andrea for delaying this post for so long).

Hurricane Sandy

Because of Hurricane Sande, we (probably) are going to shutdown our web server, cactus.nci.nih.gov, on Monday, Oct-29, 6pm (EST) and expect this to last at least for Tuesday, Oct-30.

Sorry for the inconvience,

Markus

 

OPSIN 1.2

We have updated the OPSIN library used by CIR to version 1.2. Daniel lists the following improvements in his release notes:

  • Basic support for cylised carbohydrates e.g. alpha-D-glucopyranose
  • Basic support for systematic carbohydrate stems e.g. D-glycero-D-gluco-Heptose
  • Added heuristic for correcting esters with omitted spaces
  • Added support for xanthates/xanthic acid
  • Minor vocabulary improvements
  • Fixed a few minor bugs/limitations in the Cahn-Ingold-Prelog rules implementation and made more memory efficient
  • Many minor improvements and bug fixes

Nice Work!

Markus

Chemical Name Pattern Searching (or even more on chemical Name Searching)

There is a new resolver module available at CIR (Chemical Identifier Resolver): “name patterns“. It allows for Google-like searches on our name index which currently holds approx. 70 million names. The module regards a chemical name, for instance the name “(p-Nitrobenzoyl)acetone“, internally as a sentence consisting of the three words “p“, “nitrobenzoyl“, and “acetone“. It is still kind of experimental, i.e. it might not work at times, but we would be happy to hear about how useful it is. The search for chemical names is perfomed by the Sphinx SQL full-text search engine.

Before I show some examples, please note the following: because of performance reasons, CIR does not use the name pattern module per default, it has to be named explicitly by the resolver URL parameter “resolver“, e.g.

https://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_pattern

This is the identical behavior to what we described in this earlier post for the “name by chemspider” resolver module. The URL shown above request the SMILES strings for all structures containing the word “morphine” in their name. The result is always restricted to the 100 most relevant entries (the relevance of a name is determined by Sphinx).

The usage of the name pattern module can also be combined with any other CIR module, e.g. the following request includes the “name by chemspider” module (for a full name match) and the “name pattern” module:

https://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_by_chemspider,name_pattern

If you look at the XML returned by CIR, the name to which the name pattern successfully has been matched is available from the notation attribute of a data item:

<data id="3" resolver="name_pattern" string_class="Chemical Name Pattern" notation="Morphine N-oxide">
   <item id="1">[C@@]125C3=C4C[C@H]([C@@H]1C=C[C@@H]([C@@H]2OC3=C(C=C4)O)O)[N+](C)([O-])CC5</item>
</data>

Examples

Now the cool things – all following examples request the SMILES strings for all structures whose names contain:

  • the words “morphine” and “methyl” (name pattern: ‘+morphine +methyl‘):
https://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/smiles/xml?resolver=name_pattern
  • the words “morphine” and “methyl” but not “ester” (name pattern: ‘+morphine +methyl -esther‘):
https://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -esther/smiles/xml?resolver=name_pattern
  • the substring “morphine” as a word end (name pattern: ‘*morphine‘):
https://cactus.nci.nih.gov/chemical/structure/*morphine/smiles/xml?resolver=name_pattern
  • the substring “morphine” somewhere in the name (name pattern: ‘*morphine*‘) :
https://cactus.nci.nih.gov/chemical/structure/*morphine*/smiles/xml?resolver=name_pattern
  • the substring “*morphine*” and the literal string ‘ “3-methyl ether “‘ (name pattern: ‘+*morphine* +”3-methyl ether‘)
https://cactus.nci.nih.gov/chemical/structure/+*morphine* +"3-methyl ether"/smiles/xml?resolver=name_pattern

Fancy things

  • a single character “m” and the word “benzene” in a maximum distance of 3 words (nice to find smaller aromatic ring systems, name pattern: ‘“m benzene”~3‘):
https://cactus.nci.nih.gov/chemical/structure/"m benzene"~3/smiles/xml?resolver=name_pattern
  • the words “magnesium” or “sodium” and the word “chloride” (name pattern: ‘(magnesium|sodium) +chloride‘)
https://cactus.nci.nih.gov/chemical/structure/(magnesium|sodium) +chloride/smiles/xml?resolver=name_pattern

(Long) Chemical names

The name pattern module can also be used to match full chemical names, however performance is not so good. Additionally, if you want to search, e.g.,  for ‘[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium‘ you have to quote it, otherwise dashes are seen as minus operators (excluding the next word). So here is the correct way for a full name request:

https://cactus.nci.nih.gov/chemical/structure/"[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium"/smiles/xml?resolver=name_pattern

However, for better performance and uncertainties about the spelling used in the database, it is probably smarter to ignore interpunctations and trying it this way (which only includes the important chemical syllables of the name above linked by + operators, name pattern: ‘+”(1R)-1-benzyl +keto +propyl +amino +ethyl +difluoromethoxy +ammonium‘):

https://cactus.nci.nih.gov/chemical/structure/"(1R)-1-benzyl+" +keto +propyl +amino +ethyl +difluoromethoxy +ammonium/smiles/xml?resolver=name_pattern

I hope it is useful,
Markus