Tag Archives: databases

pychembl: ChEMBLdb pythonified using SQLAlchemy

I’d like to present a new (still little) side project: pychembl (@GitHub)! Pychembl connects two great things with each other: the ChEMBL database (developed by the ChEMBL team) and python. Pychembl is implemented using SQLAlchemy, which is a very powerful library for linking python with a SQL database. For this, SQLAlchemy follows a very python-like syntax and, on other hand, supports many different SQL engines and dialects; hence, it allows you to design and access a database in/from python while you don’t have to care about the underlying SQL engine.

For the installation of pychembl, please follow the instructions given in the README file.

If you have finished the installation, you can import pychembl into python in the following way:

from pychembl.settings import *
from pychembl.db.auto_schema import *

This gives you access to the full database schema which I had generated earlier from ChEMBL’s 09 unmodified sql dump (see here).

To give you an idea what you can do with pychembl, here are two quick examples (I will post more in the following days … to weeks).

The first, very simple example fetches the first 1000 assay records from ChEMBLdb, containing the word “human” in their description. They are delivered in chunks of 25 records; as output just the description field is printed:

assays = chembldb.query(Assays).filter(Assays.description.like('%human%'))
for assay in assays.limit(1000).yield_per(25):
    print "- %s" % (assay.description)

A more complex example selects a biological target from ChEMBLdb’s target dictionary table (I chose ‘kallikrein 14′ to avoid a too overwhelming result at the end of the example). A great advantage of SQLAlchemy’s ORM is that you can predefine and name relationships between tables and rows, respectively. A relationship can be based on foreign key relationship already present in the database (see the connecting lines between the tables in the ChEMBL schema), or can also be added at the python/SQLAlchemy level. Any of the relationships is then available as additional attribute of a row object fetched from the database. In the example below, “assays” is such an attribute available for each “target” object and retrieves all assays related to specific biological target available in the database. Likewise, for each assay all activities and their corresponding structures and their properties (e.g. canonical SMILES) can be fetched from the database:

targets = chembldb.query(TargetDictionary)\
    .filter(TargetDictionary.pref_name=='kallikrein 14')
for target in targets.all():
    for assay in target.assays:
        for activity in assay.activities:
           print "%s : activity %s %s %s : %s" % (
               target.description
               activity.relation,
               activity.published_value,
               activity.published_units,
               activity.molecule.structure.canonical_smiles
           )

Here is the output it generates (it is also available as part of the GitHub repository):

Kallikrein-14: activitiy = 240.0 nM : NC(=N)NCCCC(NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2cccc3cccnc23)C(=O)c4nccs4
Kallikrein-14: activitiy = 677.0 nM : NC(=N)NCCC[C@@H](NC(=O)CN1CCN(CC1=O)S(=O)(=O)Cc2ccccc2)C(=O)c3nccs3
Kallikrein-14: activitiy = 27.0 nM : NC(=N)NCCC[C@@H](NC(=O)CNC(=O)[C@@H](CCCNC(=N)N)NS(=O)(=O)Cc1ccccc1)C(=O)c2nccs2
Kallikrein-14: activitiy = 9.4 nM : Cc1c(sc2ccc(Cl)cc12)S(=O)(=O)N3CCN(CC(=O)NC(CCCN=C(N)N)C(=O)c4nccs4)C(=O)C3
Kallikrein-14: activitiy = 390.0 nM : NC(=N)NCCCC(NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2ccc3ccccc3c2)C(=O)c4nccs4
Kallikrein-14: activitiy = 406.0 nM : NC(=N)NCCC[C@@H](NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2ccc(Cl)cc2)C(=O)c3nccs3
Kallikrein-14: activitiy = 34.0 nM : NC(=N)NCCCC(NC(=O)CN1CCN(CC1=O)S(=O)(=O)c2ccc3c(Cl)cccc3c2)C(=O)c4nccs4

The (more or less) corresponding SQL command to this python/pychembl walk-trough is here (and although it does not look too horrific, from my experience, it can get so easily and is less intuitive than the python script above):

select t.pref_name, ac.relation, ac.published_value, ac.published_units,
    s.canonical_smiles
from target_dictionary t
join assay2target a2t on t.tid = a2t.tid
join assays a on a2t.assay_id = a.assay_id
join activities ac on ac.assay_id = a.assay_id
join molecule_dictionary m on ac.molregno = m.molregno
join compound_structures s on m.molregno = s.molregno
where t.pref_name = "kallikrein 14";

Here is a visual representation of how the python/pychembl script walked through ChEMBL (the schema is a cutout of the relevant parts of the full ChEMBL schema):

So, this was a first announcement of this project – further posts will follow. Please regard it as an early beta release and, please, give feedback if you find bugs or have suggestions, etc.

Thanks,
Markus