Tag Archives: chemical identifier resolver options

Chemical Name Pattern Searching (or even more on chemical Name Searching)

There is a new resolver module available at CIR (Chemical Identifier Resolver): “name patterns“. It allows for Google-like searches on our name index which currently holds approx. 70 million names. The module regards a chemical name, for instance the name “(p-Nitrobenzoyl)acetone“, internally as a sentence consisting of the three words “p“, “nitrobenzoyl“, and “acetone“. It is still kind of experimental, i.e. it might not work at times, but we would be happy to hear about how useful it is. The search for chemical names is perfomed by the Sphinx SQL full-text search engine.

Before I show some examples, please note the following: because of performance reasons, CIR does not use the name pattern module per default, it has to be named explicitly by the resolver URL parameter “resolver“, e.g.

http://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_pattern

This is the identical behavior to what we described in this earlier post for the “name by chemspider” resolver module. The URL shown above request the SMILES strings for all structures containing the word “morphine” in their name. The result is always restricted to the 100 most relevant entries (the relevance of a name is determined by Sphinx).

The usage of the name pattern module can also be combined with any other CIR module, e.g. the following request includes the “name by chemspider” module (for a full name match) and the “name pattern” module:

http://cactus.nci.nih.gov/chemical/structure/morphine/smiles/xml?resolver=name_by_chemspider,name_pattern

If you look at the XML returned by CIR, the name to which the name pattern successfully has been matched is available from the notation attribute of a data item:

<data id="3" resolver="name_pattern" string_class="Chemical Name Pattern" notation="Morphine N-oxide">
   <item id="1">[C@@]125C3=C4C[C@H]([C@@H]1C=C[C@@H]([C@@H]2OC3=C(C=C4)O)O)[N+](C)([O-])CC5</item>
</data>

Examples

Now the cool things – all following examples request the SMILES strings for all structures whose names contain:

  • the words “morphine” and “methyl” (name pattern: ‘+morphine +methyl‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/smiles/xml?resolver=name_pattern
  • the words “morphine” and “methyl” but not “ester” (name pattern: ‘+morphine +methyl -esther‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -esther/smiles/xml?resolver=name_pattern
  • the substring “morphine” as a word end (name pattern: ‘*morphine‘):
http://cactus.nci.nih.gov/chemical/structure/*morphine/smiles/xml?resolver=name_pattern
  • the substring “morphine” somewhere in the name (name pattern: ‘*morphine*‘) :
http://cactus.nci.nih.gov/chemical/structure/*morphine*/smiles/xml?resolver=name_pattern
  • the substring “*morphine*” and the literal string ‘ “3-methyl ether “‘ (name pattern: ‘+*morphine* +”3-methyl ether‘)
http://cactus.nci.nih.gov/chemical/structure/+*morphine* +"3-methyl ether"/smiles/xml?resolver=name_pattern

Fancy things

  • a single character “m” and the word “benzene” in a maximum distance of 3 words (nice to find smaller aromatic ring systems, name pattern: ‘“m benzene”~3‘):
http://cactus.nci.nih.gov/chemical/structure/"m benzene"~3/smiles/xml?resolver=name_pattern
  • the words “magnesium” or “sodium” and the word “chloride” (name pattern: ‘(magnesium|sodium) +chloride‘)
http://cactus.nci.nih.gov/chemical/structure/(magnesium|sodium) +chloride/smiles/xml?resolver=name_pattern

(Long) Chemical names

The name pattern module can also be used to match full chemical names, however performance is not so good. Additionally, if you want to search, e.g.,  for ‘[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium‘ you have to quote it, otherwise dashes are seen as minus operators (excluding the next word). So here is the correct way for a full name request:

http://cactus.nci.nih.gov/chemical/structure/"[2-[[(1R)-1-(benzyl)-2-keto-propyl]amino]-2-keto-ethyl]-[4-(difluoromethoxy)benzyl]-methyl-ammonium"/smiles/xml?resolver=name_pattern

However, for better performance and uncertainties about the spelling used in the database, it is probably smarter to ignore interpunctations and trying it this way (which only includes the important chemical syllables of the name above linked by + operators, name pattern: ‘+”(1R)-1-benzyl +keto +propyl +amino +ethyl +difluoromethoxy +ammonium‘):

http://cactus.nci.nih.gov/chemical/structure/"(1R)-1-benzyl+" +keto +propyl +amino +ethyl +difluoromethoxy +ammonium/smiles/xml?resolver=name_pattern

I hope it is useful,
Markus

OPSIN & Chemical Identifier Resolver: Resolving IUPAC Names

Daniel M. Lowe at the Unilever Centre for Molecular Science Informatics (University of Cambridge) and I have collaborated to integrate his very nice OPSIN software package into the Resolver (alternatively to their own web service). OPSIN was initially started by Peter Corbett in Peter Murray Rust’s group, however, Daniel is responsible for the development of version 1.0.0 released recently and published in JCIM.

OPSIN is an Open Source Java library that allows parsing of systematic IUPAC names and converting them into a full structure representation. Our Resolver so far attempts the same thing by a simple lookup in a large name index stored in its database (and admittedly, some parts the quality of this name index is mediocre). The lookup of names in a database, of course, works less systematically than OPSIN (as only those names available in the database can be retrieved), however, it has the advantage that also trivial names that do not follow a systematic nomenclature can be converted into a full structure representation if they are present in the database. So Daniel and I thought, combining both things would generate a very powerful tool for name-to-structure conversion.

How it works

The IUPAC name “spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]” can only be resolved by OPSIN and is not available in the Resolver name index. Starting with the beta 4 version (to which we switched over yesterday), the Resolver automatically uses now also OPSIN, e.g.:

http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/image
http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles

A name example only resolvable by the Resolver’s name index is “Warfarin“:

http://cactus.nci.nih.gov/chemical/structure/warfarin/image

As you can see from these URLs, no explicit specification is required, whether OPSIN or the database lookup should be used.

However, if you want to make sure that a specific method is applied, you need to specify the corresponding resolver module explicitly (see “?resolver” query parameter “name_by_opsin” or “name_by_database“):

http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/image?resolver=name_by_opsin
http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/image?resolver=name_by_database

Alternatively, if you like the Resolver to tell you which one of two name resolving modules has worked for a specific name, you can use the xml format (it returns the applied resolver module as one of the XML tag attributes):

http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles/xml
http://cactus.nci.nih.gov/chemical/structure/warfarin/smiles/xml
http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/smiles/xml

As Daniel’s web page of name examples shows, OPSIN accepts also greek (unicode) characters – hence, we enhanced the Resolver to do the same thing:

http://cactus.nci.nih.gov/chemical/structure/(3β)-cholest-5-en-3-ol/image

Also more complex names (e.g. “pentacyclo[13.7.4.33,8.018,20.113,28]triacontane”) should be URL-encoded as Daniel’s examples show (see “von Baeyer systems”):

http://cactus.nci.nih.gov/chemical/structure/pentacyclo%5B13.7.4.3^3%2C8.0^18%2C20.1^13%2C28%5Dtriacontane/image

Well, and finally – to get some graphics in here – let’s twirl around “L-alanyl-L-glutaminyl-L-arginyl-O-phosphono-L-seryl-L-alanyl-L-proline” converted by OPSIN into a structure (3D coordinates are calculated by CORINA):

I hope you find it helpful,
Markus

Standard InChIKey Lookup

In the previous version of the Chemical Identifier Resolver, only full length Standard InChIKeys were accepted as identifier part of the requested URL. Any successful request always returned the representation of a single structure record. However, the latter neglects some characteristics of Standard InChIKeys which specifically also were implemented for interlinking highly related (but not always exactly identical) chemical compounds.
For instance, the full-length Standard InChIKey ADVPTQAUNPRNPO-UHFFFAOYSA-N
represents 3-sulfino-alanine as well as its zwitterionic form (or in other words: both are regarded as the same chemical compound by Standard InChIKey):

Starting with the Beta 2 version of the Resolver we have changed the behavior of how a Standard InChIKey is looked up in the database. A request by Standard InChIKey returns now all structure records that have this key as their Standard InChIKey (previously only the first structure record for a Standard InChIKey was returned). For the request

http://cactus.nci.nih.gov/chemical/structure/InChIKey=ADVPTQAUNPRNPO-UHFFFAOYSA-N/smiles

the Resolver returns now the SMILES string for 3-sulfino-alanine and for its zwitterion:

NC(C[S](O)=O)C(O)=O
[NH3+]C(C[S]([O-])=O)C(O)=O

To access a specific structure record use the URL option structure_index:

http://cactus.nci.nih.gov/chemical/structure/InChIKey=ADVPTQAUNPRNPO-UHFFFAOYSA-N/smiles?structure_index=0

NC(C[S](O)=O)C(O)=O

http://cactus.nci.nih.gov/chemical/structure/InChIKey=ADVPTQAUNPRNPO-UHFFFAOYSA-N/smiles?structure_index=1

[NH3+]C(C[S]([O-])=O)C(O)=O
Likewise, this works for any other structure representation available from the Resolver, i.e. if the option structure_index is not being used

Note: be careful with the request for names. Both forms of 3-sulfino-alanine return more than one name and the request for names therefore returns a join list of both name lists:

http://cactus.nci.nih.gov/chemical/structure/InChIKey=ADVPTQAUNPRNPO-UHFFFAOYSA-N/names

If you want to separate both, please use the structure_index option again. For TwirlyMol always use the structure_index option (otherwise only the first structure is returned):

http://cactus.nci.nih.gov/chemical/structure/InChIKey=ADVPTQAUNPRNPO-UHFFFAOYSA-N/twirl?structure_index=0&div_id=NAME0

http://cactus.nci.nih.gov/chemical/structure/InChIKey=ADVPTQAUNPRNPO-UHFFFAOYSA-N/twirl?structure_index=1&div_id=NAME1

Create Structure Images from Standard InChIKeys

As you might already have found out, the Chemical Identifier Resolver allows to create a GIF image from a Standard InChIKey very easily:

http://cactus.nci.nih.gov/chemical/structure/InChIKey=BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image

The same  can be done for any chemical structure identifier accepted by the Resolver:

http://cactus.nci.nih.gov/chemical/structure/morphine/image
http://cactus.nci.nih.gov/chemical/structure/InChI=InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H/image
http://cactus.nci.nih.gov/chemical/structure/CC(=O)Oc1ccccc1C(O)=O/image

The images are all created by CACTVS. So far, the service returned always a 250×250 GIF image but for the generation of structure images you might of course ask for more control about how the structure image is to be created.  So we added a few (URL) options to the image method of the Resolver. For instance, the following image has just been created from the URL shown in the caption:

http://cactus.nci.nih.gov/chemical/structure/InChIKey=BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image?footer=BSYNRYMUTXBXSQ-UHFFFAOYSA-N&width=500

http://cactus.nci.nih.gov/chemical/structure/InChIKey=BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image?footer=BSYNRYMUTXBXSQ-UHFFFAOYSA-N&width=500

More options are:

Create a PNG image instead of GIF:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?format=png

Change width, height, linewidth and fontsize:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?width=500&height=500&linewidth=2&symbolfontsize=16

Add some background color:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?bgcolor=yellow

You can also use the html hex code color codes (the ‘#’ character has to be URL-escaped as ‘%23′ in this case):

ttp://cactus.nci.nih.gov/chemical/structure/aspirin/image?bgcolor=%23AADDEE

For an image with transparent background use ‘transparent’ as color name and switch off antialiasing:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?bgcolor=transparent&antialiasing=0

Show black atom labels instead of the default color scheme for the different atom element types:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?atomcolor=black

Control which hydrogen atoms are shown:

The default values is special, i.e. only hydrogen atoms in functional groups or defining stereochemistry are shown.

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?hsymbol=special
http://cactus.nci.nih.gov/chemical/structure/aspirin/image?hsymbol=all

Control how carbon atoms are shown:
The default values is special, if all is used all carbon atoms are shown as atom symbol:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?csymbol=special
http://cactus.nci.nih.gov/chemical/structure/aspirin/image?csymbol=all
Change the colors for hydrogen atoms:
http://cactus.nci.nih.gov/chemical/structure/aspirin/image?hcolor=gray

Use another color for bonds:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?bondcolor=red

Show R/S stereo labels:

http://cactus.nci.nih.gov/chemical/structure/taxol/image?showstereo=0
http://cactus.nci.nih.gov/chemical/structure/taxol/image?showstereo=1

Add some text to the image:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?header="Aspirin on the top"
http://cactus.nci.nih.gov/chemical/structure/aspirin/image?footer="Aspirin on the bottom"

Add a frame:

http://cactus.nci.nih.gov/chemical/structure/aspirin/image?frame=1

There are more options and we will document them more exhaustively later. If you are familiar with all options CACTVS has available for controlling the GIF/PNG generation, try them – chances are good that they might work. Please also visit our GIF Generator at http://cactus.nci.nih.gov.

Resolve a structure identifier as SDF, CML, MRV, PDB …

We’d like to present a new feature of the Chemical Identifier Resolver: in addition to the already available SD file format representation

http://cactus.nci.nih.gov/chemical/structure/aspirin/sdf

the service can now represent a structure (identifier) also in many different text-based structure (file) formats. The general URL format is:

http://cactus.nci.nih.gov/chemical/structure/"identifier"/file?format="format"

The different chemical structure representations are generated by the chemoinformatic toolkit CACTVS. Although CACTVS can offer a whole lot more formats (including binary ones) we make the following (few) available here:

alc (Alchemy format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=alc

cdxml (CambridgeSoft ChemDraw XML format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=cdxml

cerius (MSI Cerius II format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=cerius

charmm (Chemistry at HARvard Macromolecular Mechanics file format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=charmm

cif (Crystallographic Information File)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=cif

cml (Chemical Markup Language)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=cml

ctx (Gasteiger Clear Text format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=ctx

gjf (Gaussian input data file)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=gjf

gromacs (GROMACS file format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=gromacs

hyperchem (HyperChem file format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=hyperchem

jme (Java Molecule Editor format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=jme

maestro (Schroedinger MacroModel structure file format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=maestro

mol (Symyx molecule file)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=mol

mol2 (Tripos Sybyl MOL2 format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=sybyl2

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=mol2

mrv (ChemAxon MRV format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=mrv

pdb (Protein Data Bank)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=pdb

sdf (Symyx Structure Data Format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=sdf

sdf3000 (Symyx Structure Data Format 3000)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=sdf3000

sln (SYBYL Line Notation)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=sln

smiles (SMILES)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=smile

xyz (xyz file format)

http://cactus.nci.nih.gov/chemical/structure/aspirin/file?format=xyz

All these workof course also with Standard InChIKey, SMILES or NCI/CADD Identifier as structure identifier:

http://cactus.nci.nih.gov/chemical/structure/InChIKey=BSYNRYMUTXBXSQ-UHFFFAOYSA-N/file?format=pdb
http://cactus.nci.nih.gov/chemical/structure/CC(=O)Oc1ccccc1C(O)=O/file?format=mrv
http://cactus.nci.nih.gov/chemical/structure/045DA3288E1A0233-FICuS-01-39/file?format=cml