Finally, it is “rolling out” – it was a lot of hard work 😉 :
Here are my slides for the talk I gave on the 9th ICCS Conference in Noordwijkerhout:
Other than that, I apologize for being so quiet regarding pychembl, but I am working on continuing this blog post series (the interesting stuff is yet to come) as well as preparing pychembl for working with chembl_10.
Daniel M. Lowe at the Unilever Centre for Molecular Science Informatics (University of Cambridge) and I have collaborated to integrate his very nice OPSIN software package into the Resolver (alternatively to their own web service). OPSIN was initially started by Peter Corbett in Peter Murray Rust’s group, however, Daniel is responsible for the development of version 1.0.0 released recently and published in JCIM.
OPSIN is an Open Source Java library that allows parsing of systematic IUPAC names and converting them into a full structure representation. Our Resolver so far attempts the same thing by a simple lookup in a large name index stored in its database (and admittedly, some parts the quality of this name index is mediocre). The lookup of names in a database, of course, works less systematically than OPSIN (as only those names available in the database can be retrieved), however, it has the advantage that also trivial names that do not follow a systematic nomenclature can be converted into a full structure representation if they are present in the database. So Daniel and I thought, combining both things would generate a very powerful tool for name-to-structure conversion.
How it works
The IUPAC name “spiro[1,2-benzodithiole-3,2′-[1,3]benzodithiole]” can only be resolved by OPSIN and is not available in the Resolver name index. Starting with the beta 4 version (to which we switched over yesterday), the Resolver automatically uses now also OPSIN, e.g.:
A name example only resolvable by the Resolver’s name index is “Warfarin“:
As you can see from these URLs, no explicit specification is required, whether OPSIN or the database lookup should be used.
However, if you want to make sure that a specific method is applied, you need to specify the corresponding resolver module explicitly (see “?resolver” query parameter “name_by_opsin” or “name_by_database“):
Alternatively, if you like the Resolver to tell you which one of two name resolving modules has worked for a specific name, you can use the xml format (it returns the applied resolver module as one of the XML tag attributes):
http://cactus.nci.nih.gov/chemical/structure/spiro[1,2-benzodithiole-3,2'-[1,3]benzodithiole]/smiles/xml http://cactus.nci.nih.gov/chemical/structure/warfarin/smiles/xml http://cactus.nci.nih.gov/chemical/structure/hex-1-yne/smiles/xml
As Daniel’s web page of name examples shows, OPSIN accepts also greek (unicode) characters – hence, we enhanced the Resolver to do the same thing:
Also more complex names (e.g. “pentacyclo[18.104.22.168,8.018,20.113,28]triacontane”) should be URL-encoded as Daniel’s examples show (see “von Baeyer systems”):
Well, and finally – to get some graphics in here – let’s twirl around “L-alanyl-L-glutaminyl-L-arginyl-O-phosphono-L-seryl-L-alanyl-L-proline” converted by OPSIN into a structure (3D coordinates are calculated by CORINA):
I hope you find it helpful,
Today I took a look at the improvements and changes of ChEMBL_09 – to get a better view, I visualized ChEMBL’s database schema (using the mysql version). And after tweaking it for a while and pushing tables around (graphically – not in the database itself, of course) I even managed to organize it in a way that no crossings of all of the database relationships occur. As it might be useful to other people, I publish my “art” work here (well, the real art is what the ChEMBL team has done by putting together this nice database – and the new version looks really nice). Click here for a large image or here for a pdf of the schema.
Well, there are two things I am not sure about (but they are represented in the schema above as they are in the database): it looks like, table protein_therapeutics and table molecule_dictionary could be linked by their primary keys, and table chembl_id_lookup might be linkable both to table target_dictionary and table assays (by column chembl_id) – but maybe it gets clear after diving into the data … which I will do now.
Well, we added a new concept to the Resolver: “chemical operators”. They are included before the identifier string, separated by colon, and manipulate the structure specified by the identifier before the new representation is calculated. The general scheme is
A few quick examples are how to generate tautomers with the “tautomer:” operator:
Currently, only “tautomers” is available as chemical operator, but we will add more of them soon.
We have used our Chemical Structure DataBase (CSDB == the database working behind the Chemical Identifier Resolver), an aggregated collection of over 150 small-molecule databases totaling 103.5 million structure records, to conduct a comprehensive analysis of tautomerism in small-molecule databases. On basis of our rules for the enumeration of all formal tautomers of a chemical structure, we systematically generated a set of 680 million tautomer structures (including the original structure record set).
You can read more about the results of our analysis here:
The article also gives quite a bit of information of how we have built the database working behind the Resolver.
I just stumbled over this article – and although it is six years old, it deals with a problem I am quite passioned about: data curation. In defense of Microsoft (gasp), it clearly describes a misuse of Excel:
In case you are about creating, curating, implementing, or designing a new chemical structure dataset, chemical structure database, or chemoinformatic toolkit, the following IUPAC document about stereochemistry is definitely an interesting read:
Most importantly, this document states eight different levels of how possible stereochemical configuration information for one single stereo center can be given (on page 68/69):
In the case of relative configuration, it is critical not to conflate unknown, unspecified, and racemic in any way. For a given center, there are at least the following values:
• Known absolute
• Unknown absolute (i.e., a single stereoisomer)
• Racemic (known to be a 50:50 mixture of enantiomers, at least at some stage in the experiment)
• Enantiomerically enriched (enantioenriched, or scalemic) (known to be a mixture of enantiomers, but not in a 50:50 ratio)
• Unknown (no information known experimentally)
• Unspecified (no knowledge whatever)
• Known relative (there is a known relationship to some/all of the other centers, but the absolute configuration of the centers is unknown)
• Unknown relative [it is known that the configuration depends on another center, but it is not known what the relationship is. For example, a new reagent is known to provide stereospecificity, but the nature of that (cis or trans, for example) has not yet been determined]