• First line in header block ("MDL name") was consistently set to "NSC x" (some prior releases had just the NSC number x, others "NSC x".)
  • Hydrogens were fully added to all structures; except: (a) to atoms that participate in a bond of type "complex" (in 5,160 entries); (b) if an atom had a non-standard valence higher than the lowest standard valence (examples: --N+ will become --NH3+; --N- would become --NH-; but (CH2R)3N- [a rather dubious case anyway], showing N with valence 4, would be left alone and not made a pentavalent nitrogen); (c) if total number of atoms would have exceeded 999, which is the maximum syntactically allowed in SD files (note however that hash codes [see below] were calculated with the hydrogens present); (d) to metal atoms [but see next bullet point].
  • A very few cases were "manually" treated with regards to hydrogen addition: NSC 158935, coded as "Ce" in the input file, but identified as "Cerium hydride" by DTP name, was made into "CeH3; NSC 403664, coded as "S", previously made into H2S -- incorrectly so, since identified by one of its 58 DTP names as "Atomic sulfur" -- was forced to stay "S".
  • A number of other single-atom entries were identified as mis-parsed DTP names that accidentally had yielded an element symbol (of mostly heavy metals) already in the original input file. For example, a number of names that begin with "Monoclonal..." (e.g. NSC 600660 "Monoclonal Antibody Fab Fragments to Lipoprotein Melanoma Antigen (2B2) (IgG3-Fab)") had been transmutated into Mo (molybdenum); or NSC 612825 "Amatrisan, a commercial preparation of urea, dissolved in distilled water" had become Am (americium).  All such cases that we could identify, 18 in total, were re-coded as "element" 'E' (which stands for, or can be interpreted as, "Enzyme," "Exception," "Emergency" or whatever you like).  This affected the NSC numbers: 600660 600661 600662 600665 603570 603571 603572 605606 605850 605851 605852 606640 607637 607770 607771 607772 609472 612825.  All such cases have been marked in the canonical property (see below) E_COMPOUND_TYPE with the value "WARNING: Pseudo-structure! See DTP names."
  • Bonds with stereochemistry unspecified in the input structure were set to "wiggly", i.e. bond attribute "3" set.  This applies even to double bonds in structures with unspecified E/Z double bonds. [Note: Attributes "3" for double bonds have been removed, and will have to be recalculated with improved algorithm.]
  • Bond type was explicitly set to "complex" for bonds to metal ions for which valences of the metal ion and ligand were correct by coincidence (and thus had been set by CACTVS to a regular bond type by default).
  • Nitro groups mis-coded in the input file were fixed and coded in a standard way: Groups coded as either OH-N-OH (327 cases) or HO-N=O (227 cases) were changed to O=N+-O-.
  • Stereochemistry perception was performed by CACTVS by evaluating the appropriate structural features in the input file, such as presence of wedge bonds encoded in the input structure.  Both chiral atoms (R/S) and double bonds (E/Z) were evaluated. Four different cases can, and do, occur:
    1. No stereogenic center present in molecule (E_STEREO_SPECIFIED set to "no_stereocenter"); 143,418 structures.
    2. Stereogenic center(s) present, all unspecified (E_STEREO_SPECIFIED set to "stereo_unknown"); 86,980 structures.
    3. Stereogenic center(s) present, some specified (E_STEREO_SPECIFIED set to "partial_stereo"); 9,931 structures.
    4. Stereogenic center(s) present, all specified (E_STEREO_SPECIFIED set to "full_stereo"); 19,742 structures.
    [Note: These values *may* change slightly after E/Z bond perception algorithm has been improved.]
  • 3D coordinates were calculated with CORINA 2.6 for all structures; except (a) structures possessing a bond of type "complex"; (b) structures with unspecified stereocenters (unless it had only one chiral atom, in which case the calculated 3D structure can at worst be the wrong enantiomer; such cases can be identified by the canonical property E_THREED_SOURCE value "CORINA 2.6 - default enantiomer!" [33,852 structures]); (c) structures for which the CORINA calculation simply failed  (6 structures). Note: (a), (b) and (c) are not mutually exclusive. A total of 192,310 structures now have 3D coordinates.
  • If 3D coordinates were not calculated, then 2D coordinates were re-calculated by CACTVS (usually producing a better-looking 2D drawing); except for structures possessing a bond of type "complex". These were found to be often rather complicated structures (e.g. ferrocenes) for which the original 2D drawing gave some indication as to the geometry of the complex, which we did not want to overwrite. [Note: Will be implemented only in next update; currently all 2D structures show original coordinates.]

Properties

The original properties from the various input files we used were left untouched. These are NSC, DTP_NAMES, CAS_RN, and ORIGIN. The first three should be self-explanatory. If no name is available in the DTP records, DTP_NAMES was set to "(none)". If no CAS RN is available in the DTP records, CAS_RN was set to 999-99-9. ORIGIN denotes the DTP release which was the source of this structure. If the same NSC number occurred in several releases (which was the case for most of the structures), the newer release overwrote the older one(s). Statistics for this combined file:
DTP OCT 1999: 12279
DTP AUG 2000: 7
DTP FEB 2003: 247138
DTP SEP 2003: 647

Canonical Properties

The following "Canonical Properties" were added, which are intended to facilitate processing and identifying compounds, as well as to allow mixing and matching of multiple databases from different sources. Some of these properties are specific to each compound; others are constant across any single database. Their names were chosen to (a) reflect CACTVS nomenclature ("E_..." standing for "ensemble" properties, an "ensemble" being the data structure created from, e.g., one, possibly multi-fragment, record in a multi-structure SD file); (b) be as unlikely as possible to conflict with pre-existing property names in existing databases, which the "E_..." naming convention has fulfilled so far for all databases we have encountered.

  • E_UNIQUE_ID : Identifier that is meant (but not guaranteed!) to be unique across all databases we are handling and will be posting. Typically a concatenation of database source/name, database version/date, and ID number used within that database. Example: NCI-Open_09-03_NSC_123456.
  • E_NSC : DTP/NCI's accession number for each sample. Unique; current range: 1 - 722245; non-contiguous (i.e. gaps do occur); no unique relationship with a chemical (i.e. the same chemical can occure multiple times under different NSC numbers). [This canonical property occurs only in databases of NCI compounds. It is added because "E_NSC" is a predefined property in CACTVS.]
  • E_CAS : CAS (Chemical Abstract Service) Registry Number. Repeated even if the database has a database-specific CAS RN field (since the property names for such fields vary wildly accross databases). If CAS RN unknown, 999-99-9 is used (which is a non-existing entry). This does not necessarily mean that this structure does not possess a valid CAS RN, only that it is not rcorded in the database. (For the NCI Database, however, we estimate that several tens of thousands of compounds truly do not possess a CAS RN.)
  • E_NAME : Name of this entry in the SD file. This will typically be the name used in the input entry's first line of the header block (also called "MDL Name"). However, depending on the database, this can also be taken from some other name-type field in the property block. E_NAME often is, but doesn't have to be, a chemical name. For the NCI Database, this was set to "NSC x" with x denoting the actual NSC number.
  • E_STEREO_SPECIFIED : Indicates whether stereogenic centers are present in the structure, and if yes, whether they are fully, partially, or not at all defined. Possible values: no_stereocenter, full_stereo, partial_stereo, stereo_unknown. See also above.
  • E_COMPOUND_TYPE : Indicates type of compoud. Main currently possible values, listed in the order of priority: complex (typically used for organometallic complexes; must have explicit bond of type "complex"; 5,160 entries [only 12 entries were identified as complexes without metal]); metal-containing (typically used for salts, possessing a disconnected metal ion; 14,025 entries including complexes); normal (everything else). Additional values may be added in the future.
  • E_THREED_SOURCE : Source of the coordinates for this structure. Currently possible values: original_DB (coordinates as they came in the input file, whether 2D or 3D); CACTVS_2D (2D coordinates recalculated by CACTVS, typically used when 3D calculation failed or would produce doubtful results); "CORINA 2.6" (3D coordinates calculated by CORINA 2.6; the standard case); "CORINA 2.6 - default enantiomer!" (3D structure calculated by CORINA for compound with one chiral center with unknown stereochemistry according to default rules - this could be the wrong enantiomer); experimental (experimentally determined coordinates, such as by X-ray crystollagraphy [not applicable to the NCI Database]).
  • E_SMILES : Unique SMILES string, calculated by CACTVS according to Daylight's original (1989) published canonicalization rules. [Note: not yet implemented; the SMILES in the current file are still non-Unique (though valid) SMILES strings.]
  • E_FORMULA : Molecular formula calculated by CACTVS (including hydrogens; and with indication of charge state by +n or -n appended [for 10,531 entries]). This field is included even if the database has a pre-existing formula field (which may have different value than E_FORMULA because of hydrogen addition etc.).
  • Eight different variants of CACTVS hash codes are included. They are 64 bit unsigned integer numbers, represented as 16-digit hexadecimal strings.

    Property Name Tautomer-invariant Stereo-sensitive Isotope-sensitive,
    charge-sensitive
    Fragment handling
    E_HASHY no no yes entire ensemble
    E_HASHSY no yes yes entire ensemble
    E_TAUTO_HASH yes no yes entire ensemble
    E_STEREO_TAUTO_HASH yes yes yes entire ensemble
    E_MAXFRAG_HASHY no no no largest fragment only
    E_MAXFRAG_HASHSY no yes no largest fragment only
    E_MAXFRAG_HASHTY yes no no largest fragment only
    E_MAXFRAG_HASHSTY yes yes no largest fragment only

    They span the range from very lenient to very strict in terms of distinguishing between chemicals. E_MAXFRAG_HASHTY is the most lenient one, and will equate structures that may contain different isotopes, be the salts of different metals (since only the ensemble's largest fragment is evaluated), be charged or uncharged, be different stereoisomers, and be represented in different tautomeric forms. E_HASHSY, conversely, is the strictest one, distinguishing between all those cases. However, it will interpret different tautomers of the same compound as different chemicals, which is not usually desirable from a chemical point of view when, e.g., comparing databases. E_STEREO_TAUTO_HASH, being tautomer-invariant, but sensitive to everything else, is probably the most useful of the hash codes included, since it is essentially a calculable, unique identifier for any small-molecule chemical. Note that this hash code will, e.g., distinguish between O, O-, O2-, OH., OH-, H2O, D2O, H3O+ etc., species that many other systems and databases - especially if they don't include hydrogens - will project onto the same entity "O". [Note: A bug was found in the charge sensitivity setting of the E_MAXFRAG_... series of hash codes. We recommend not using these hash codes at this time.]
  • E_MULTIFRAG : Indicates whether the entry consists of one ore more disconnected fragments. Possible values: single_fragment; multi_fragment. The most common reasons for multi_fragment are the presence of a counter ion in a salt and presence of a solvent molecule.
  • E_ORIGIN : Origin of the input data. This is usually a brief sentence giving the database name, the agency/company/organization it came from, its date, and possibly other brief explanation. [Constant for entire database.] Note for NCI Database: This is not the same as the property ORIGIN.
  • E_DB_VERSION : Explicit version number of the input database, if existent. Otherwise NULL. [Constant for entire database.]
  • E_DB_YEAR : Year of release of this database version/update. [Constant for entire database.]
  • E_DB_TYPE : Type of the database. Currently used values: "government public", "government non-public", "commercial free", "commercial licensed", "unknown". [Constant for entire database.]
  • E_IS_PUBLIC : Indicates if compounds in database can be publicly used. Possible values: public, non_public, unknown. [Constant for entire database.]
  • E_SOURCE : Source of the database.  Usually the name of the entire agency, company or organization. E.g. NCI, NIST, EPA. [Constant for entire database.]
  • E_CONTEXT : Context, or reason for inclusion, of compounds in the database. Various values are possible: "compoud screened in anti-cancer and/or anti-HIV assays" (for NCI Database); "environmentally relevant compound"; "compound in reference database" etc. [Constant for entire database.]
  • E_SAMPLES_AVAILABLE : Indicates if samples are, or may be, available for compounds in the database. Currently possible values: yes, no, on-demand, is_drug, unknown. This is derived from the nature of the database as a whole. It is not modified to reflect, e.g., the fact that some of the NCI compounds may be commonly available drugs, or otherwise commercially available chemicals. Specific to the NCI Database, it also doesn't take into account the fact that many samples in the NCI repository have been depleted, but is simply set to yes for all compounds in the NCI Database. [Constant for entire database.]
  • E_SUPPLIER_TYPE : If E_SAMPLES_AVAILABLE was neither no nor unknown, this property will attempt to indicate what the nature of the database issuer is with respect to the possibility of obtaining samples of the compounds in the database. Currently used values: broker, directory, manufacturer, repository. For the NCI Database, this property was set to repository. [Constant for entire database.]
  • E_CACTVS_VERSION : Version of the CACTVS toolkit that was used to process the database and generate this file.
  • E_ICHI : The new IUPAC/NIST Chemical Identifier, a multi-line, XML-based, unique chemical identifier. At this time, the beta version 0.932 was used to calculate the IChI values. [Note: Next update of this file will contain IChI's calculated with version 1.0 or newer of the IUPAC/NIST program.]

All these files are based on the publicly and freely available data from NCI's Developmental Therapeutics Program (DTP). We collected the structures and biological data from DTP, combined them where applicable, and generated SMILES and MDL SD files from this information.

These files were compressed with the program gzip. This program is available for many platforms, and comes preloaded on most of the recent versions of many major varieties of Unix. In order to prevent possible problems with web browsers trying to uncompress "on the fly", and display on your screen (!), a file with the extension ".gz", the names of the downloadable files were changed to NCInDA99.sdz (n = 0, 2, 3 for the 0D, 2D, 3D file, respectively; "A99" stands for October 1999 [with hexadecimal notation for the month]), CAN2DA99.sdz, AID2DA99.sdz etc.

You may have to rename them to NCInDA99.sdf.gz etc. before gunzip'ing them. If you (have to) rename them to NCInDA99.gz, gunzip will uncompress them to a file name NCInDA99, unless you use the gunzip option "-N", which will restore the name NCInDA99.sdf. (These file names were chosen to conform to the 8.3 file name convention for those users that may download, e.g., to DOS-type FAT 16 file systems. This practice may be discontinued in future.)

All files (after decompression) are in MDL's SDFile format with two identification fields:

  • NSC - the NCI's internal identification number of the database entry
  • CAS_RN - the CAS Registry Number. Present with a value other than 999-99-9 (dummy value) only for those compounds for which it was entered in the NCI database. (This does not mean that a compound with a CAS_RN of 999-99-9 does not necessarily have a CAS Registry Number - it just was not entered in the NCI database.)

In the 2D files with biological data, you'll find the following additional fields (not necessarily present in all files for all compounds):

  • NLOGGI50 - Log GI50 data, comprising the following columns:
    CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGGI50, INDN, TOTN
  • NLOGTGI - Log TGI data, comprising the following columns:
    CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGTGI, INDN, TOTN
  • NLOGLC50 - Log LC50 data, comprising the following columns:
    CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGLC50, INDN, TOTN
  • NCI_AIDS_Antiviral_Screen_Conclusion - AIDS Screening result (CI = Confirmed Inactive, CM = Confirmed Moderate[ly active], CA = Confirmed Active)
  • NCI_AIDS_Antiviral_Screen_EC50 - AIDS EC50 result with four columns:
    HiConc, ConcUnit, Flag, EC50, NumExp.
    Note that for some compounds, the EC50 has been measured more than once.
  • NCI_AIDS_Antiviral_Screen_IC50 - AIDS IC50 result with four columns:
    HiConc, ConcUnit, Flag, IC50, NumExp.
    Note that for some compounds, the IC50 has been measured more than once.

For more explanation on these data, in particular the meaning of the column headings, please see the Web pages of the DTP Human Tumor Cell Line Screen and/or the DTP AIDS Antiviral Screen.

Please also note that no editing of the biological test data has been performed. This means that all DTP results for which the chemical structure is available have been included. This includes data from "non-production" cell lines, i.e. cell lines that were used only a short time during test phases, as well as data from those ten cell lines that were replaced by a new block of ten around 1992. It is up to the user to do their own evaluation, statistics, and, if necessary, (pre-)processing, of these data before using them for any purpose.

In the 3D file, hydrogens were added by CORINA, whereas they are not present in the 0D and 2D files. In the 2D files, the stereochemistry shown is in fact meaningless since decided upon at random. This is not easily changeable.

In previous versions of this page, the 0D information was called "2D". This has been changed to avoid confusion with the new 2D information added. The file that was previously called NCI2D397.sdz is therefore mostly identical with the new file NCI0DA99.sdz with the exception of the newly added compounds.

The sizes listed for the uncompressed files are in "real" MB, i.e. 1024 x 1024 bytes.

Our 249,081 structure set is a combination of three sets:

Our 2D files with AIDS data contain 2 more structure data than the one available at the DTP AIDS Antiviral Screen.

The DTP Human Tumor Cell Line Screen biological data file contains cancer screen data for 370 more entries for which we don't have the structure (these structures are not available on the DTP site).

For 10 out of the 249,081 structures, the 3D generation process failed.

M. C. Nicklaus

Last Update: 2015-12-30