The original properties from the various input files we used were left
untouched. These are NSC, DTP_NAMES, CAS_RN,
and ORIGIN. The first three should be self-explanatory.
If no name is available in the DTP records, DTP_NAMES was set to
"(none)". If no CAS RN is available in the DTP records, CAS_RN
was set to 999-99-9. ORIGIN denotes the DTP release which
was the source of this structure. If the same NSC number occurred
in several releases (which was the case for most of the structures), the
newer release overwrote the older one(s). Statistics for this combined
DTP OCT 1999: 12279
DTP AUG 2000: 7
DTP FEB 2003: 247138
DTP SEP 2003: 647
The following "Canonical Properties" were added, which are intended
to facilitate processing and identifying compounds, as well as to allow
mixing and matching of multiple databases from different sources.
Some of these properties are specific to each compound; others are constant
across any single database. Their names were chosen to (a) reflect
CACTVS nomenclature ("E_..." standing for "ensemble" properties, an "ensemble"
being the data structure created from, e.g., one, possibly multi-fragment,
record in a multi-structure SD file); (b) be as unlikely as possible to
conflict with pre-existing property names in existing databases, which
the "E_..." naming convention has fulfilled so far for all databases we
- E_UNIQUE_ID : Identifier that is meant (but not guaranteed!) to
be unique across all databases we are handling and will be posting.
Typically a concatenation of database source/name, database version/date,
and ID number used within that database. Example: NCI-Open_09-03_NSC_123456.
- E_NSC : DTP/NCI's accession number for each sample. Unique; current
range: 1 - 722245; non-contiguous (i.e. gaps do occur); no unique relationship
with a chemical (i.e. the same chemical can occure multiple times under
different NSC numbers). [This canonical property occurs only in databases
of NCI compounds. It is added because "E_NSC" is a predefined property
- E_CAS : CAS (Chemical Abstract Service) Registry Number. Repeated
even if the database has a database-specific CAS RN field (since the property
names for such fields vary wildly accross databases). If CAS RN unknown,
999-99-9 is used (which is a non-existing entry). This does not necessarily
mean that this structure does not possess a valid CAS RN, only that
it is not rcorded in the database. (For the NCI Database, however, we estimate
that several tens of thousands of compounds truly do not possess a CAS
- E_NAME : Name of this entry in the SD file. This will typically
be the name used in the input entry's first line of the header block (also
called "MDL Name"). However, depending on the database, this can
also be taken from some other name-type field in the property block.
E_NAME often is, but doesn't have to be, a chemical name. For the
NCI Database, this was set to "NSC x" with x denoting
the actual NSC number.
- E_STEREO_SPECIFIED : Indicates whether stereogenic centers are present
in the structure, and if yes, whether they are fully, partially, or not
at all defined. Possible values: no_stereocenter, full_stereo,
See also above.
- E_COMPOUND_TYPE : Indicates type of compoud. Main currently possible
values, listed in the order of priority: complex (typically used
for organometallic complexes; must have explicit bond of type "complex";
5,160 entries [only 12 entries were identified as complexes without metal]);
(typically used for salts, possessing a disconnected metal ion; 14,025
entries including complexes); normal (everything else).
Additional values may be added in the future.
- E_THREED_SOURCE : Source of the coordinates for this structure.
Currently possible values: original_DB (coordinates as they came
in the input file, whether 2D or 3D); CACTVS_2D (2D coordinates
recalculated by CACTVS, typically used when 3D calculation failed or would
produce doubtful results); "CORINA 2.6" (3D coordinates
calculated by CORINA 2.6; the standard case); "CORINA 2.6 - default
enantiomer!" (3D structure calculated by CORINA for compound with
one chiral center with unknown stereochemistry according to default rules
- this could be the wrong enantiomer); experimental (experimentally
determined coordinates, such as by X-ray crystollagraphy [not applicable
to the NCI Database]).
- E_SMILES : Unique SMILES string, calculated by CACTVS according
to Daylight's original (1989) published canonicalization rules.
[Note: not yet implemented; the SMILES in the current
file are still non-Unique (though valid) SMILES strings.]
- E_FORMULA : Molecular formula calculated by CACTVS (including hydrogens;
and with indication of charge state by +n or -n appended
[for 10,531 entries]). This field is included even if the database
has a pre-existing formula field (which may have different value than E_FORMULA
because of hydrogen addition etc.).
- Eight different variants of CACTVS hash codes are included. They are 64
bit unsigned integer numbers, represented as 16-digit hexadecimal strings.
||largest fragment only
||largest fragment only
||largest fragment only
||largest fragment only
They span the range from very lenient to very strict in terms of distinguishing
between chemicals. E_MAXFRAG_HASHTY is the most lenient one,
and will equate structures that may contain different isotopes, be the
salts of different metals (since only the ensemble's largest fragment is
evaluated), be charged or uncharged, be different stereoisomers, and be
represented in different tautomeric forms. E_HASHSY, conversely,
is the strictest one, distinguishing between all those cases. However,
it will interpret different tautomers of the same compound as different
chemicals, which is not usually desirable from a chemical point of view
when, e.g., comparing databases. E_STEREO_TAUTO_HASH, being tautomer-invariant,
but sensitive to everything else, is probably the most useful of the hash
codes included, since it is essentially a calculable, unique identifier
for any small-molecule chemical. Note that this hash code will, e.g., distinguish
between O, O-, O2-, OH., OH-,
H2O, D2O, H3O+ etc., species
that many other systems and databases - especially if they don't include
hydrogens - will project onto the same entity "O".
[Note: A bug was found in the charge sensitivity setting
of the E_MAXFRAG_... series of hash codes. We recommend not using these hash
codes at this time.]
- E_MULTIFRAG : Indicates whether the entry consists of one ore more
disconnected fragments. Possible values: single_fragment;
multi_fragment. The most common reasons for multi_fragment are
the presence of a counter ion in a salt and presence of a solvent molecule.
- E_ORIGIN : Origin of the input data. This is usually a brief sentence
giving the database name, the agency/company/organization it came from,
its date, and possibly other brief explanation. [Constant for entire database.]
Note for NCI Database: This is not the same as the property ORIGIN.
- E_DB_VERSION : Explicit version number of the input database, if
existent. Otherwise NULL. [Constant for entire database.]
- E_DB_YEAR : Year of release of this database version/update. [Constant
for entire database.]
- E_DB_TYPE : Type of the database. Currently used values: "government
public", "government non-public", "commercial free",
"commercial licensed", "unknown". [Constant for entire
- E_IS_PUBLIC : Indicates if compounds in database can be publicly
used. Possible values: public, non_public, unknown.
[Constant for entire database.]
- E_SOURCE : Source of the database. Usually the name of the
entire agency, company or organization. E.g. NCI, NIST,
EPA. [Constant for entire database.]
- E_CONTEXT : Context, or reason for inclusion, of compounds in the
database. Various values are possible: "compoud screened in anti-cancer
and/or anti-HIV assays" (for NCI Database); "environmentally relevant
compound"; "compound in reference database" etc. [Constant
for entire database.]
- E_SAMPLES_AVAILABLE : Indicates if samples are, or may be, available
for compounds in the database. Currently possible values: yes,
no, on-demand, is_drug, unknown. This
is derived from the nature of the database as a whole. It is not
modified to reflect, e.g., the fact that some of the NCI compounds may
be commonly available drugs, or otherwise commercially available chemicals.
Specific to the NCI Database, it also doesn't take into account the fact
that many samples in the NCI repository have been depleted, but is simply
set to yes for all compounds in the NCI Database. [Constant for
- E_SUPPLIER_TYPE : If E_SAMPLES_AVAILABLE was neither no
nor unknown, this property will attempt to indicate what the nature
of the database issuer is with respect to the possibility of obtaining
samples of the compounds in the database. Currently used values:
broker, directory, manufacturer, repository.
For the NCI Database, this property was set to repository. [Constant
for entire database.]
- E_CACTVS_VERSION : Version of the CACTVS toolkit that was used to
process the database and generate this file.
- E_ICHI : The new IUPAC/NIST Chemical
Identifier, a multi-line, XML-based, unique
chemical identifier. At this time, the beta version 0.932 was used
to calculate the IChI values.
[Note: Next update of this file will contain IChI's calculated
with version 1.0 or newer of the IUPAC/NIST program.]
All these files are based on the publicly and freely available data
from NCI's Developmental Therapeutics
Program (DTP). We collected the structures and biological data from
DTP, combined them where applicable, and generated SMILES and MDL SD files
from this information.
These files were compressed with the program gzip. This program is available
for many platforms, and comes preloaded on most of the recent versions
of many major varieties of Unix. In order to prevent possible problems
with web browsers trying to uncompress "on the fly", and display on your
screen (!), a file with the extension ".gz", the names of the downloadable
files were changed to NCInDA99.sdz (n = 0, 2, 3 for the 0D, 2D, 3D file,
respectively; "A99" stands for October 1999 [with hexadecimal notation
for the month]), CAN2DA99.sdz, AID2DA99.sdz etc.
You may have to rename them to NCInDA99.sdf.gz etc. before gunzip'ing
them. If you (have to) rename them to NCInDA99.gz, gunzip will uncompress
them to a file name NCInDA99, unless you use the gunzip option "-N", which
will restore the name NCInDA99.sdf. (These file names were chosen to conform
to the 8.3 file name convention for those users that may download, e.g.,
to DOS-type FAT 16 file systems. This practice may be discontinued in future.)
All files (after decompression) are in MDL's SDFile format with two
- NSC - the NCI's internal identification number of the database entry
- CAS_RN - the CAS Registry Number. Present with a value other than 999-99-9
(dummy value) only for those compounds for which it was entered in the
NCI database. (This does not mean that a compound with a CAS_RN of 999-99-9
does not necessarily
have a CAS Registry Number - it just was not
in the NCI database.)
In the 2D files with biological data, you'll find the following additional
fields (not necessarily present in all files for all compounds):
- NLOGGI50 - Log GI50 data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGGI50, INDN, TOTN
- NLOGTGI - Log TGI data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGTGI, INDN, TOTN
- NLOGLC50 - Log LC50 data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGLC50, INDN, TOTN
- NCI_AIDS_Antiviral_Screen_Conclusion - AIDS Screening result (CI = Confirmed
Inactive, CM = Confirmed Moderate[ly active], CA = Confirmed Active)
- NCI_AIDS_Antiviral_Screen_EC50 - AIDS EC50 result with four columns:
HiConc, ConcUnit, Flag, EC50, NumExp.
Note that for some compounds, the EC50 has been measured more than once.
- NCI_AIDS_Antiviral_Screen_IC50 - AIDS IC50 result with four columns:
HiConc, ConcUnit, Flag, IC50, NumExp.
Note that for some compounds, the IC50 has been measured more than once.
For more explanation on these data, in particular the meaning of the column
headings, please see the Web pages of the DTP
Human Tumor Cell Line Screen and/or the DTP
AIDS Antiviral Screen.
Please also note that no editing of the biological test data has been
performed. This means that all DTP results for which the chemical
structure is available have been included. This includes data from
"non-production" cell lines, i.e. cell lines that were used only a short
time during test phases, as well as data from those ten cell lines that
were replaced by a new block of ten around 1992. It is up to the user to
do their own evaluation, statistics, and, if necessary, (pre-)processing,
of these data before using them for any purpose.
In the 3D file, hydrogens were added by CORINA, whereas they are not
present in the 0D and 2D files. In the 2D files, the stereochemistry
shown is in fact meaningless since decided upon at random. This is not
In previous versions of this page, the 0D information was called "2D".
This has been changed to avoid confusion with the new 2D information added.
The file that was previously called NCI2D397.sdz is therefore mostly identical
with the new file NCI0DA99.sdz with the exception of the newly added compounds.
The sizes listed for the uncompressed files are in "real" MB, i.e. 1024
x 1024 bytes.
Our 249,081 structure set is a combination of three sets:
Our 2D files with AIDS data contain 2 more structure data than the one
available at the DTP
AIDS Antiviral Screen.
Human Tumor Cell Line Screen biological data file contains cancer screen
data for 370 more entries for which we don't have the structure (these
structures are not available on the DTP site).
For 10 out of the 249,081 structures, the 3D generation process failed.