Downloadable Structure Files of NCI Open Database Compounds

Release 1 | Release 2 | Release 3 | Release 4
ncilogo

New: Release 4 File Series - May 2012

DTP Releases (December 2010), 2D/3D, with GUSAR Human Liver Microsomal Stability Prediction Data Added

This is the first one of a series of files which will be released over the next few months. These files will contain a successively curated structure set of all records of the Open NCI Database. The basis of this first file is the version of the Open NCI Database as provided by DTP in December 2010 ( 2D Coordinates SD File with 266,151 records). The file was processed in the following way:

This succeeded for 265,242 of the 266,151 original structure records.

GUSAR QSAR Model Application for the prediction of human liver microsomal stability. Thirty five QSAR models created by GUSAR were used to generate a consesus prediction of the microsomal stability of the chemical structures contained in this file. Each compound in the file is classified as stable or unstable (data field "GUSAR Human Liver Microsomal Stability Prediction ). The prediction output also includes an assessment of the applicability domain as provided by GUSAR (data field "GUSAR Human Liver Microsomal Stability Prediction AD). This succeeded for 196,460 of the 265,242 structure records.

This version of the NCI Open Database, which adds ~15,000 new structures, is not included in our Enhanced NCI Database Browser web service. We are also aware that beyond that the PubChem version of the NCI database contains ~15,000 addtional structure records. We are currently in the process to analyze overlap between both sources.

265,242 structures in SDF format. This is a 198 MB gzipped file that uncompresses to about 1.2 GB.

Download

Release 3 Files - September 2003

September 2003 SD File of Combined DTP Releases, 2D/3D, with Canonical Properties Added

The most complete collection of Open NCI Database compounds as of September 2003 that we are aware of. These are 260,071 structures, combined from DTP releases from Oct. 1999, Aug. 2000, Feb. 2003, and Sep. 2003. All the identifier-type information that we were able to associate with the structures are included in this file: NSC numbers; DTP names for ~53,000 records (including some WLN strings); Unique SMILES, calculated by CACTVS according to Daylight's original (1989) canonicalization rules; the new IUPAC/NIST InChI chemical identifier (calculated with [beta] version 0.932 of NIST's program); IUPAC names, calculated with ACD/Lab's program ACD/Name Batch; eight different CACTVS hash codes, including a tautomer-invariant but stereochemistry-, multifragment-, charge- and isotope-sensitive hash code that is essentially a unique, calculable identifier for any (small-molecule) chemical. Additional properties, some of them helpful to categorize structures when dealing with several databases simultaneously, are explained in the Technical Notes.

The 2003 DTP releases now have many structures with at least some, if not full, stereochemistry specification. This allowed 3D coordinates of reliable stereoisomers to be calculated in many cases. Where such 3D structures would have potentially shown the wrong chemical, or would otherwise have been doubtful, 2D coordinates were kept. See the Technical Notes for more details. Also be aware of the fact that for a very large number of entries (on the order of 100,000), the structure shown in the 2003 DTP releases is slightly different from that shown in previous releases. In the vast majority of those cases, the structure is now represented as a different tautomer.

This version of the NCI Open Database, which adds ~10,000 new structures, is not included in our Enhanced NCI Database Browser web service.

260,071 structures in SDF format. This is a 214 MB gzipped file that uncompresses to about 1.6 GB.

Download

Release 2 Files - August 2000

August 2000 2D File

The "raw" structure data that were used to build the Release 2 of the Enhanced NCI Database Browser. These are 250,251 2D structures calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. The SMILES string and the CAS RN (where available) are also included for each structure.

250,251 structures in SDF format. This is a 90 MB gzipped file that uncompresses to about 982 GB.

Download

New in August 2006: A 3D version of the 0D file with some properties added. Their values are the same as those shown in the Enhanced NCI Datebase Browser. This file contains 250,250 structures as of August 2000 (one missing because of technical reasons). 3D coordinates have been calculated by Corina 3.0 and are available for 248,574 structures. The following properties are included:

250,250 structures in SDF format. This is a 145 MB gzipped file that uncompresses to about 1005 MB.

Release 1 Files - October 1999

"0D"

The "raw" structure data that were used to build the previous version of the Enhanced NCI Database Browser, plus about 2,900 new structures. These are 249,081 "0D" structures (i.e. all coordinates set to 0.0) as of October 1999 in SDF format, in one file compressed with the widely available program gzip.

249,081 structures in SDF format. This is a 16.5 MB MB gzipped file that uncompresses to about 380 MB.

Download

SMILES

A SMILES version of the structures (i.e. the above "0D" dataset) that were used to build this service, plus about 2,900 new structures. These are 249,081 structures as of October 1999 in SMILES format, in one file compressed with the widely available program gzip. SMILES string were generated with the help of CACTVS. (This is a newly generated dataset and therefore not guaranteed to contain SMILES strings identical, for each compound, with those in previous SMILES string files, such as downloadable data from DTP.)

249,081 structures in SMILES format. This is a 3.2 MB gzipped file that uncompresses to about 18.5 MB.

Download

2D

2D version of NCI Open Database compounds as of October 1999. 2D coordinates (essentially structure drawings) calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)

249,081 structures in SDF format. This is a 40 MB gzipped file that uncompresses to about 527 MB.

Download

2D + Biological Data

2D versions of NCI Open Database compounds as of October 1999, with biological test data added. These data are publicly available from the DTP Human Tumor Cell Line Screen and/or the DTP AIDS Antiviral Screen. 2D coordinates (essentially structure drawings) calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)

249,081 structures in SDF format. Cancer data are as of August 1999, AIDS data and structures are as of October 1999. This is a 56 MB gzipped file that uncompresses to about 723 MB.

Download

32,577 structures with cancer test data in SDF format. Cancer data are as of August 1999. This is a 20 MB gzipped file that uncompresses to about 273 MB.

Download

42,689 structures in SDF format. AIDS test data is as of October 1999 This is a 9.1 MB gzipped file that uncompresses to about 114 MB.

Download

23,031 structures in SDF format for which both cancer and AIDS data are available. Cancer data are as of August 1999, AIDS data and structures are as of October 1999. This is a 13.5 MB gzipped file that uncompresses to about 195 MB.

Download

3D

A 3D version of the 0D file, containing 249,071 structures as of October 1999. The program CORINA v. 1.7 was used to generate the 3D coordinates. Please note that, just as with the 3D results provided by the Enhanced NCI Database Browser, stereochemistry of chiral compounds is not guaranteed to be correct due to the lack of stereochemical information in the original data. This is not a shortcoming of CORINA. Please also note that, as of now, the 3D structures in this bulk file were not generated with the same version of CORINA as is used in the Browser, the latter being somewhat newer. This file is the result of a one-time conversion; no efforts have been undertaken to compare the conformations in it with those you obtain from the Browser (although we don't necessarily expect huge differences.)

249,071 structures in SDF format. This is a 127 MB gzipped file that uncompresses to about 574 MB.

Download

For more information on all the files in this release, please see the Technical Notes.

M. C. Nicklaus

Last Update: 2012-05-22

s