Synthetically Accessible Virtual Inventory (SAVI) Database

First Complete Beta Release |
ncilogo

Complete First Beta File Series - November 2016

283,194,312 SAVI proposed reactions generated in the first complete enumeration of the beta phase of the SAVI project

The SAVI project is an international collaboration of computationally generating a very large database of reliably and inexpensively synthesizable screening sample structures that have desirable properties for the drug development process.

It utilizes:
(a) a set of transforms with rich chemical context annotation including functional group reactivity data (LHASA, LLC, U.S.; and Lhasa Limited, UK)
(b) a set of highly annotated building blocks (Sigma-Aldrich, Global Strategic Services)
(c) the chemoinformatics toolkit CACTVS with custom development (Xemistry GmbH, Germany)

The transforms are a set of more than 2,300 rules described in the CHMTRN/PATRAN language for encoding chemical transformations with chemical context and quality criteria added, based ultimately on the pioneering work of E. J. Corey.

These rules, in contrast to simple SMIRKS transforms, allow/provide:
- Computation of whether a reaction, depending on the overall structural features of the target, will work at all.
- Scoring: If the reaction works, how robust it is, taking into account overall structural features.
- Whether protection of interfering groups is required - and these can then already be integrated in the final starting materials queries to prioritize pre-protected starting materials.
- Proposal of suitable context-dependent reaction conditions.
- Textual warnings in specific circumstances, such as potential of multiple products, borderline conditions, etc.

Ancillary information to the rules is a set of functional group reactivity data, i.e. a table describing whether any of the standard functional groups in the rule set is unstable under any of the standard conditions.

The building blocks are a set of several hundred thousand compounds available in gram quantities, and with high reliability, from, or through, Sigma-Aldrich (now MilliporeSigma). This set has been annotated with pricing information and other business intelligence type data useful for this project.

The chemoinformatics toolkit CACTVS has been expanded in various ways, e.g. with the capability to read the CHMTRN/PATRAN transforms. An important feature that needed to be implemented was the handling of the reversal of the original LHASA transform direction, without re-writing rules, for the strictly forward-synthetic SAVI project. Another important capability was the initial and final starting material (SM) query handling, i.e. the 4 steps: initial SM query extraction from the 2D patterns in the rules; forward reaction from the 2D patterns; scoring (which is the only original LHASA functionality); final SM query expansion (R-groups, protecting groups, etc.).

For the goal of filtering out structures with less-than-desirable attributes in the drug development context, several additional computed properties regarded as important in current drug design have been implemented, such as the demerit scores based on 275 rules for identifying potentially reactive or promiscuous compounds, published by Bruns and Watson (J. Med. Chem. 2012, 55, 9763-9772); dx.doi.org/10.1021/jm301008n.

For generation of this first beta set of SAVI products, 14 transforms were used, applied to approx. 377,000 building blocks in single-step reactions. The resulting products have been annotated but not yet filtered with any of the computed or associated molecular properties. A set of very schematic graphical representations of the transforms implemented so far (two of them were not used for product generation) can be downloaded here

We are ultimately aiming at creating a database of one billion high-quality screening samples that should be easily and cheaply synthesizable. These novel molecules will all be annotated with a proposed simple and high-yield synthetic route, and will have been filtered by all the molecular properties generally recognized as important in cutting-edge drug design that we will have implemented by then. A web GUI is planned that will allow users free access to this database via searches by various criteria including substructure searches. It will also present links to pages where users can place requests for having the molecule(s) synthesized by commercial entities.

The following individuals have so far been contributing to this project:

Lhasa Limited, Leeds, UK:

LHASA, LLC, Newton, MA, U.S.: Sigma-Aldrich: Xemistry GmbH, Königstein, Germany: Novartis: NCI CADD Group:

Downloadable Files

283,194,312 SAVI-generated products and reactions in two formats: 1. TAB-delimited SMILES tables in text format. 2. SD files (gzipped and combined in .tar files).

The files contain the following information about each SAVI reaction (in the order that they appear in the TAB files or the properties block of the SD files):

Identifiers
NAME - Unique SAVI identifier in the form <hashcode of product>_<hashcode of reactants>_<transform id>
SMILES - SMILES of the product

Properties referring to starting materials
SAVI_BUILDING_BLOCK_A_SIGMA_STRID - Sigma-Aldrich catalog ID of building block A
SAVI_BUILDING_BLOCK_A_SMILES - SMILES of building block A
SAVI_BUILDING_BLOCK_A_INCHI - InChI of building block A
SAVI_BUILDING_BLOCK_A_INCHIKEY - InChIKey of building block A
SAVI_BUILDING_BLOCK_A_ORDER_LINK - URL of the Sigma-Aldrich cataglog web page for building block A
SAVI_BUILDING_BLOCK_B_SIGMA_STRID - Sigma-Aldrich catalog ID of building block B
SAVI_BUILDING_BLOCK_B_SMILES - SMILES of building block B
SAVI_BUILDING_BLOCK_B_INCHI - InChI of building block B
SAVI_BUILDING_BLOCK_B_INCHIKEY - InChIKey of building block B
SAVI_BUILDING_BLOCK_B_ORDER_LINK - URL of the Sigma-Aldrich cataglog web page for the building block B
SAVI_BUILDING_BLOCK_A_PROTECTION_NEEDED - Indicates whether protection of reagent A is required in this reaction
SAVI_BUILDING_BLOCK_A_PROTECTED - Indicates whether building block A used in this reaction is already a protected version of the required reagent
E_SAVI_BUILDING_BLOCK_B_PROTECTION_NEEDED - Indicates whether protection of reagent B is required in this reaction
SAVI_BUILDING_BLOCK_B_PROTECTED - Indicates whether building block B used in this reaction is already a protected version of the required reagent
E_SAVI_PROPOSED_REACTION - Name of the reaction
SAVI_PROPOSED_REACTION_ID - ID of the reaction (LHASA ID of the transform that describes this reaction)
SAVI_REACTION_HASHCODE - Hashcode of the reaction
SAVI_REACTION_CONDITIONS - Reaction conditions according to LHASA transform rules
SAVI_REACTION_WARNINGS - Reaction warnings according to LHASA transform rules
SAVI_BUILDING_BLOCK_A_COST_GRAM - Cost per gram of building block A
SAVI_BUILDING_BLOCK_B_COST_GRAM - Cost per gram of building block B
SAVI_ESTIMATED_BB_COST_GRAM - Total cost per gram of building block A and B
SAVI_BUILDING_BLOCK_A_COST_MOL - Cost per mole of building block A
SAVI_BUILDING_BLOCK_B_COST_MOL - Cost per mole of building block B
SAVI_ESTIMATED_BB_COST_MOL - Total cost per mole of building block A and B

Properties referring to the reaction
SAVI_LHASA_SCORE - "Quality" score of this reaction according to LHASA transform scoring scheme
SAVI_PREDICTED_YIELD - Qualitative description of estimated yield of this reaction according to LHASA transform (if available)

Properties referring to the product
BRUNS_WATSON_DEMERIT_SCORE - Bruns and Watson Demerit score (see Bruns, R.F., Watson, I.A. Rules for identifying potentially reactive or promiscuous compounds. J. Med. Chem. 2012, 55,9763-9772)
BRUNS_WATSON_DEMERIT_COMPONENTS - Breakdown of Bruns and Watson Demerits
SAVI_PAINS_FILTER - PAINS Filter Match
SAVI_PAINS_FILTER_MATCH_NAME - Name of the PAINS filter match
RULE_OF_5_VIOLATIONS - Number of Lipinski "rule of 5" violations
RULE_OF_3_VIOLATIONS - Number of "rule of 3 violations"
NHDONORS - Number of H-bond donors
NHACCEPTORS - Number of H-bond acceptors
WEIGHT - Molecular weight
HEAVY_ATOM_COUNT - Number of heavy atoms
NROTBONDS - Number of rotatable bonds
XLOGP2 - Xlogp2
XLOGP - Xlogp
FSP3 - Fraction of sp3-hybridized carbons
STEREO_COUNT - Number of stereo centers
TPSA - Total polar surface area
BENZENOID_INDEX - Benzenoid index
FORMULA - Molecular formula
GENOTOXIC_ALERTS - Genotoxic alerts (see: CE&N Sep 27, 2010, p16ff)
COMPLEXITY - Complexity (see: W. D. Ihlenfeldt, Ph.D. Thesis, TU Munich, Germany, 1991)
INCHI - InChI
INCHIKEY - InChIKey
SAVI_PUBCHEM_STEREO_CID_MATCH - PubChem CID of the compound matching the product (if found using stereo-sensitive lookup)
SAVI_PUBCHEM_TAUTO_CID_MATCH - PubChem CID of the compound matching the product (if found using tautomer-sensitive lookup. Not implemented currently)
SAVI_AMS_SID_MATCH - Sigma-Aldrich "Aldrich Market Select" SID of the compound matching the product (if found using stereo-sensitive lookup)
CHARGED_GROUP_COUNTS - Number of charged groups
HYDROGEN_BOND_CENTER_COUNT - Number of hydrogen bond centers
SAVI_SSSR_COUNT - Number of smallest set of smallest rings (SSSR)
SAVI_SSSR_COUNT_CHANGE - SSSR count in product minus combined SSSR count in building blocks
SAVI_ALIRING_COUNT - Number of aliphatic rings in product
SAVI_ALIRING_COUNT_CHANGE - Number of aliphatic rings in product minus combined number of aliphatic rings in building blocks
SAVI_ARORING_COUNT - Number of aromatic rings in product
SAVI_ARORING_COUNT_CHANGE - Number of aromatic rings in product minus combined number of aromatic rings in building blocks

Notes:
1. All hashcodes used in the file are tautomer-invariant CACTVS cheminformatics toolkit hashcodes
2. Prices reflect Sigma-Aldrich catalog prices at the time of compilation of the building block set and may vary from current prices. Please use the link to Sigma-Aldrich catalog for each of the building blocks to get the most up-to-date price.
3. Please note that the InChI and InChIKey values are not in general Standard InChI[Key] identifiers but contain the FixedH layer (representing a specific tautomer). This will be corrected in subsequent runs.

Older Releases

Alpha 1 File Series - July 2015

In this, very early alpha, stage of this project, and for the file downloadable below, only 11 transforms were used; applied to approx. 230,000 building blocks; in only one-step reactions; and the ~610,000 resulting products have been annotated but not yet filtered with any of the computed or associated molecular properties. To limit the file size, only on the order of one percent of the theoretically possible products (of one-step reactions) have been sampled . A set of very schematic graphical representations of the transforms implemented so far (two of them were not used for product generation) can be downloaded here

610,492 SAVI-generated products in SD format. This is a 374 MB .gz file that umcompresses to 4.4 GB.

Download

The downloadable SD file is a very early alpha version of the set of generated products. The structures in this file may or may not be part of the final SAVI database. They are meant to be looked at, and commented on, by early users. Any feedback about individual structures or the entire set, and the data associated with them, is welcome.

If you have any questions regarding potential availability of the generated molecules including access to the synthetic starting materials, please contact Bret Daniel.

Disclaimer

All structures ("SAVI Products") and associated information downloadable from here are placed in the public domain. They may be freely used for any purpose without restrictions by any individual or organization. At the same time, we, i.e. the U.S. Government, NIH, NCI, and their employees and contractors do not make any warranty, express or implied, including the warranties of merchantability and fitness for a particular purpose with respect to any of the SAVI Products and associated information, nor assume any legal liability for the accuracy, completeness, or usefulness of any information disclosed herein and do not represent that use of such information would not infringe on privately owned rights. See also our general Disclaimer.

M. C. Nicklaus

Last Update: 2017-09-14