Downloadable Structure Files of NCI Open Database Compounds

 
 

Release 4 Files - Coming in 2007

(A new major update is under preparation. It will contain some additional structures. But it will mostly contain more, and updated, calculated properties. A new release of our Enhanced NCI Database Browser web interface, incorporating all these new data, is also under preparation.)

 


Release 3 Files - September 2003

September 2003 SD File of Combined DTP Releases, 2D/3D, with Canonical Properties Added

The most complete collection of Open NCI Database compounds as of September 2003 that we are aware of.  These are 260,071 structures, combined from DTP releases from Oct. 1999, Aug. 2000, Feb. 2003, and Sep. 2003.  All the identifier-type information that we were able to associate with the structures are included in this file: NSC numbers; DTP names for ~53,000 records (including some WLN strings); Unique SMILES, calculated by CACTVS according to Daylight's original (1989) canonicalization rules; the new IUPAC/NIST ICHI chemical identifier (calculated with [beta] version 0.932 of NIST's program), (see Note 3.); IUPAC names, calculated with ACD/Lab's program ACD/Name Batch; eight different CACTVS hash codes, including a tautomer-invariant but stereochemistry-, multifragment-, charge- and isotope-sensitive hash code that is essentially a unique, calculable identifier for any (small-molecule) chemical.  Additional properties, some of them helpful to categorize structures when dealing with several databases simultaneously, are explained in the Technical Notes.

The 2003 DTP releases now have many structures with at least some, if not full, stereochemistry specification. This allowed 3D coordinates of reliable stereoisomers to be calculated in many cases. Where such 3D structures would have potentially shown the wrong chemical, or would otherwise have been doubtful, 2D coordinates were kept. See the Technical Notes for more details. Also be aware of the fact that for a very large number of entries (on the order of 100,000), the structure shown in the 2003 DTP releases is slightly different from that shown in previous releases.  In the vast majority of those cases, the structure is now represented as a different tautomer.

Notes as of April 2007:

  1. Many of the calculated values in this file will be superseded in the next release, planned for mid-2007.  In particular some of the CACTVS hashcode-based identifiers and InChI strings may change. We recommend waiting for that release unless you have an urgent need for this file.
  2. This version of the NCI Open Database, which adds ~10,000 new structures, is not (yet) included in our Enhanced NCI Database Browser web service.  An update of that service is underway.
  3. This identifier has been renamed several times.  It has gone from ICHI to INChI to now InChI.
260,071 structures in SDF format, 2D or 3D (see Technical Notes for more explanations). Beta version/update no. 2, 25-Nov-03. WARNING: This is a 214 MB gzip'ed file that uncompresses to about 1.6 GB!! Use the "Save Link As..." (Netscape/Firefox) or "Save Target As..." (IE) option of your web browser to download the file.

Release 2 Files - August 2000

August 2000 2D File

The "raw" structure data that were used to build the Release 2 of the Enhanced NCI Database Browser. These are 250,251 2D structures calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. The SMILES string and the CAS RN (where available) are also included for each structure.

August 2000 SMILES Strings

A SMILES version of the 250,251 August 2000 structures. These are Unique SMILES (USMILES) strings, calculated according to Daylight's original (1989) canonicalization rules. (These rules have been changed in the meantime, but are not published.)

New Structures Only

These are 1,170 structures that were not in the previous (October 1999) release. This file may be most interesting for those who have already downloaded the previous structure file(s) and only need the difference set. It contains 3D coordinates calculated by the program CORINA. Please note the same warning regarding stereochemistry as for the large 3D file (see below).

 

New in August 2006: A 3D version of the 0D file with some properties added. Their values are the same as those shown in the Enhanced NCI Datebase Browser. This file contains 250,250 structures as of August 2000 (one missing because of technical reasons). 3D coordinates have been calculated by Corina 3.0 and are available for 248,574 structures. The following properties are included:


Release 1 Files - October 1999

"0D"

The "raw" structure data that were used to build the previous version of the Enhanced NCI Database Browser, plus about 2,900 new structures. These are 249,081 "0D" structures (i.e. all coordinates set to 0.0) as of October 1999 in SDF format, in one file compressed with the widely available program gzip.

SMILES

A SMILES version of the structures (i.e. the above "0D" dataset) that were used to build this service, plus about 2,900 new structures. These are 249,081 structures as of October 1999 in SMILES format, in one file compressed with the widely available program gzip. SMILES string were generated with the help of CACTVS.  (This is a newly generated dataset and therefore not guaranteed to contain SMILES strings identical, for each compound, with those in previous SMILES string files, such as downloadable data from DTP .)

2D

2D version of NCI Open Database compounds as of October 1999.  2D coordinates (essentially structure drawings) calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)

2D + Biological Data

2D versions of NCI Open Database compounds as of October 1999, with biological test data added. These data are publicly available from the DTP Human Tumor Cell Line Screen and/or the DTP AIDS  Antiviral Screen.  2D coordinates (essentially structure drawings) calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)

3D

A 3D version of the 0D file, containing 249,071 structures as of October 1999. The program CORINA v. 1.7 was used to generate the 3D coordinates. Please note that, just as with the 3D results provided by the Enhanced NCI Database Browser, stereochemistry of chiral compounds is not guaranteed to be correct due to the lack of stereochemical information in the original data. This is not a shortcoming of CORINA. Please also note that, as of now, the 3D structures in this bulk file were not generated with the same version of CORINA as is used in the Browser, the latter being somewhat newer. This file is the result of a one-time conversion; no efforts have been undertaken to compare the conformations in it with those you obtain from the Browser (although we don't necessarily expect huge differences.)

 

Notes:


All these files are based on the publicly and freely available data from NCI's Developmental Therapeutics Program (DTP). We collected the structures and biological data from DTP, combined them where applicable, and generated SMILES and MDL SD files from this information.

These files were compressed with the program gzip. This program is available for many platforms, and comes preloaded on most of the recent versions of many major varieties of Unix. In order to prevent possible problems with web browsers trying to uncompress "on the fly", and display on your screen (!), a file with the extension ".gz", the names of the downloadable files were changed to NCInDA99.sdz (n = 0, 2, 3 for the 0D, 2D, 3D file, respectively; "A99" stands for October 1999 [with hexadecimal notation for the month]), CAN2DA99.sdz, AID2DA99.sdz etc.

You may have to rename them to NCInDA99.sdf.gz etc. before gunzip'ing them. If you (have to) rename them to NCInDA99.gz, gunzip will uncompress them to a file name NCInDA99, unless you use the gunzip option "-N", which will restore the name NCInDA99.sdf. (These file names were chosen to conform to the 8.3 file name convention for those users that may download, e.g., to DOS-type FAT 16 file systems. This practice may be discontinued in future.)

All files (after decompression) are in MDL's SDFile format with two identification fields:

In the 2D files with biological data, you'll find the following additional fields (not necessarily present in all files for all compounds): For more explanation on these data, in particular the meaning of the column headings, please see the Web pages of the DTP Human Tumor Cell Line Screen and/or the DTP AIDS  Antiviral Screen.
Please also note that no editing of the biological test data has been performed. This means that all DTP results for which the chemical structure  is available have been included. This includes data from "non-production" cell lines, i.e. cell lines that were used only a short time during test phases, as well as data from those ten cell lines that were replaced by a new block of ten around 1992. It is up to the user to do their own evaluation, statistics, and, if necessary, (pre-)processing, of these data before using them for any purpose.

In the 3D file, hydrogens were added by CORINA, whereas they are not present in the 0D and 2D files. In the 2D files,  the stereochemistry shown is in fact meaningless since decided upon at random. This is not easily changeable.

In previous versions of this page, the 0D information was called "2D". This has been changed to avoid confusion with the new 2D information added. The file that was previously called NCI2D397.sdz is therefore mostly identical with the new file NCI0DA99.sdz with the exception of the newly added compounds.

The sizes listed for the uncompressed files are in "real" MB, i.e. 1024 x 1024 bytes.
 

Our 249,081 structure set is a combination of three sets:

1) the March 1997 set, still downloadable here as NCI3D397.sdz
2) 689 supplemental structures selected from the DTP Human Tumor Cell Line Screen 3D SD files as of August 1999
3) 2,212 supplemental structures selected from DTP AIDS  Antiviral Screen 3D SD files as of October 1999.
Our 2D files with AIDS data contain 2 more structure data than the one available at the  DTP AIDS  Antiviral Screen .
The  DTP Human Tumor Cell Line Screen biological data file contains cancer screen data for 370 more entries for which we don't have the structure (these structures are not available on the DTP site).
For 10 out of the 249,081 structures, the 3D generation process failed.

Acknowledgments

All the SD files were prepared with the help of the SDF_toolkit. Thanks to Bruno Bienfait for both the toolkit and this work.

We gratefully acknowledge Prof. Gasteiger's group at the Computer Chemistry Center (CCC), Institute of Organic Chemistry, University of Erlangen-Nuremberg, Germany, for providing us with their program CORINA, and help with the database conversion. 

 


Note to Windows users: While downloading with Netscape on Unix platforms usually works flawlessly, we've received reports (and have confirmed in very limited tests) that in Windows, using Netscape with the "Save Link As..." option may produce corrupted binary files. In these cases, you may want to try newer versions of Mozilla/Firefox, or Internet Explorer with the option "Save Target As..." for downloading. Because of the file name/extension used for some of the files(.sdz), you may have to either rename the downloaded binary file or open it manually with a program such as WinZip or similar. 
 


Home

Last change: M. C. Nicklaus, 2007-04-27