Tag Archives: IUPAC name

OPSIN 1.2

We have updated the OPSIN library used by CIR to version 1.2. Daniel lists the following improvements in his release notes:

  • Basic support for cylised carbohydrates e.g. alpha-D-glucopyranose
  • Basic support for systematic carbohydrate stems e.g. D-glycero-D-gluco-Heptose
  • Added heuristic for correcting esters with omitted spaces
  • Added support for xanthates/xanthic acid
  • Minor vocabulary improvements
  • Fixed a few minor bugs/limitations in the Cahn-Ingold-Prelog rules implementation and made more memory efficient
  • Many minor improvements and bug fixes

Nice Work!

Markus

OPSIN & Chemical Identifier Resolver: Resolving IUPAC Names

Daniel M. Lowe at the Unilever Centre for Molecular Science Informatics (University of Cambridge) and I have collaborated to integrate his very nice OPSIN software package into the Resolver (alternatively to their own web service). OPSIN was initially started by Peter Corbett in Peter Murray Rust’s group, however, Daniel is responsible for the development of version 1.0.0 released recently and published in JCIM.

OPSIN is an Open Source Java library that allows parsing of systematic IUPAC names and converting them into a full structure representation. Our Resolver so far attempts the same thing by a simple lookup in a large name index stored in its database (and admittedly, some parts the quality of this name index is mediocre). The lookup of names in a database, of course, works less systematically than OPSIN (as only those names available in the database can be retrieved), however, it has the advantage that also trivial names that do not follow a systematic nomenclature can be converted into a full structure representation if they are present in the database. So Daniel and I thought, combining both things would generate a very powerful tool for name-to-structure conversion.

How it works

The IUPAC name “spiro[1,2-benzodithiole-3,2′-[1,3]benzodithiole]” can only be resolved by OPSIN and is not available in the Resolver name index. Starting with the beta 4 version (to which we switched over yesterday), the Resolver automatically uses now also OPSIN, e.g.:

http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/image
http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles

A name example only resolvable by the Resolver’s name index is “Warfarin“:

http://cactus.nci.nih.gov/chemical/structure/warfarin/image

As you can see from these URLs, no explicit specification is required, whether OPSIN or the database lookup should be used.

However, if you want to make sure that a specific method is applied, you need to specify the corresponding resolver module explicitly (see “?resolver” query parameter “name_by_opsin” or “name_by_database“):

http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/image?resolver=name_by_opsin
http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/image?resolver=name_by_database

Alternatively, if you like the Resolver to tell you which one of two name resolving modules has worked for a specific name, you can use the xml format (it returns the applied resolver module as one of the XML tag attributes):

http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles/xml
http://cactus.nci.nih.gov/chemical/structure/warfarin/smiles/xml
http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/smiles/xml

As Daniel’s web page of name examples shows, OPSIN accepts also greek (unicode) characters – hence, we enhanced the Resolver to do the same thing:

http://cactus.nci.nih.gov/chemical/structure/(3β)-cholest-5-en-3-ol/image

Also more complex names (e.g. “pentacyclo[13.7.4.33,8.018,20.113,28]triacontane”) should be URL-encoded as Daniel’s examples show (see “von Baeyer systems”):

http://cactus.nci.nih.gov/chemical/structure/pentacyclo%5B13.7.4.3^3%2C8.0^18%2C20.1^13%2C28%5Dtriacontane/image

Well, and finally – to get some graphics in here – let’s twirl around “L-alanyl-L-glutaminyl-L-arginyl-O-phosphono-L-seryl-L-alanyl-L-proline” converted by OPSIN into a structure (3D coordinates are calculated by CORINA):

I hope you find it helpful,
Markus

Molecular Weight, Formula and IUPAC Name

We added a few more structure representations that can be calculated from a structure identifier by the Chemical Identifier Resolver:

molecular weight and monoisotopic mass

http://cactus.nci.nih.gov/chemical/structure/aspirin/mw
http://cactus.nci.nih.gov/chemical/structure/InChIKey=RCINICONZNJXQF-MZXODVADSA-N/monoisotopic_mass

chemical formula

http://cactus.nci.nih.gov/chemical/structure/50-00-0/formula
http://cactus.nci.nih.gov/chemical/structure/nsc740/formula

IUPAC name

http://cactus.nci.nih.gov/chemical/structure/aspirin/iupac_name
http://cactus.nci.nih.gov/chemical/structure/acetone/iupac_name

If you use the IUPAC name representation for a structure identifier please read the following notice: we took great care during the implementation of the name index for this web service, however, we are aware of that it is far from perfect and has quite  few errors in it. Unfortunately, these errors are not easy to to find if you have to deal with millions of names and their proper assignment to the correct chemical structure. If you find any mistakes, please tell us. Our plan is to improve the name index over the time but we are of course happy about any contributions helpful for this process. Thanks!