This file contains the structures downloaded from the PubChem FTP site that have at least one assay result associated with them that was obtained in the context of the NIH Common Fund (previously: NIH Roadmap) Molecular Libraries Probe Production Centers Network (previously: Molecular Libraries Screening Center Network), part of the Common Fund’s Molecular Libraries and Imaging program. It is organized by unique chemical structures (“Compounds” in PubChem parlance), i.e. assay results for possibly multiple different samples (“Substances” in PubChem parlance) have been combined into the one record representing the unique chemical structure. Placeholder assays (assays containing a single record only) have been filtered out.

Explanation of the property data fields in the SD file (note - properties present in the original PubChem files have been copied unchanged, for the explanation of those properties we point directly to the appropriate PubChem document):
  • PUBCHEM_ASSAYID_nnn_NAME - Name of the assay with PubChem assay ID (AID) nnn. For example, the assay named "qHTS Assay for Inhibitors and Substrates of Cytochrome P450 3A4" has AID 884, thus the property name for this assay would be PUBCHEM_ASSAYID_884_NAME.
  • PUBCHEM_ASSAYID_nnn_SUBMITTER - Organization that submitted the assay data to PubChem.
  • PUBCHEM_ASSAYID_nnn_RESULT - Assay result (active/inactive/inconclusive/unspecified) for this compound.
  • PUBCHEM_ASSAYID_nnn_SID_RESULT - List of the assay results of the individual Substances corresponding to this CID.
  • PUBCHEM_ASSAYID_nnn_CURVE_CLASS - Indication of the quality of the titration curve obtained in qHTS assays. Particularly used in results obtained from the NCGC Screening Center. See here for an explanation of the curve class concept.
  • PUBCHEM_ASSAYID_nnn_LOGAC50 - Log of the concentration at which 50% of the activity was observed. The activity can be either inhibition or activation. Note that this property does not exist for some structures/assay results.
  • PUBCHEM_ASSAY_ACTIVITIES - Boolean indication of the assay result for all assays performed with this compound. A “!” in front of the assay name indicates inactivity in this assay, the assay name listed without “!” indicates activity as per the result call made by the original submitter and deposited in PubChem. A "?" means inconclusive or unspecified result.
  • PUBCHEM_SID_ASSOCIATIONS - List of PubChem Substance IDs (SID), i.e. in the widest sense, samples, that have the unique chemical structure of this Compound entry.

461,937 structures in SDF format. WARNING: This is a 2.1 GB gzipped file. Use the "Save Link As..." (Firefox) or "Save Target As..." (IE) option of your web browser to download the file.
1475 structures with at least one contradictory assay result.

This file was constructed in the following way:

  • Download SDF files for "Substances" from Pubchem ftp site
  • Combine them all into one big SD file
  • Split that file according to PUBCHEM_EXT_DATASOURCE_NAME property
  • Combine the following files into one:
    • Emory_University_Molecular_Libraries_Screening_Center
    • NCGC
    • NMMLSC
    • PCMD
    • MLSMR
    • SRMLSC
    • The_Scripps_Research_Institute_Molecular_Screening_Center
    • Columbia_University_Molecular_Screening_Center
    • NIH_Clinical_Collection
    • University_of_Pittsburgh_Molecular_Library_Screening_Center
    • Vanderbilt_Screening_Center_for_GPCRs__Ion_Channels_and_Transporters
    • Burnham_Center_for_Chemical_Genomics
    • Johns_Hopkins_Ion_Channel_Center
    • Southern_Research_Institute
  • Download all assays (*.descr.xml and *.data.xml)
  • Keep only the assays with the submitter from the following list, remove the rest. The submitter is extracted from .descr.xml file from the key: PC-AssaySubmit -> PC-AssaySubmit_assay -> PC-AssaySubmit_assay_descr -> PC-AssayDescription -> PC-AssayDescription_aid-source -> PC-Source -> PC-Source_db -> PC-DBTracking -> PC-DBTracking_name
    • Columbia University Molecular Screening Center
    • Emory University Molecular Libraries Screening Center
    • NCGC
    • NMMLSC
    • PCMD
    • SRMLSC
    • The Scripps Research Institute Molecular Screening Center
    • University of Pittsburgh Molecular Library Screening Center
    • Vanderbilt University Molecular Libraries Screening Center (VUMLSC)
    • Burnham Center for Chemical Genomics
    • Johns Hopkins Ion Channel Center
    • Southern Research Specialized Biocontainment Screening Center
  • Extract the following properties from .descr.xml and .data.xml files and put them into SD file:
    • PUBCHEM_ASSAYID_".$aid."_LOGAC50
  • Construct property PUBCHEM_ASSAY_ACTIVITIES as a list of active/inactive assays
  • Remove structures without PUBCHEM_ASSAY_ACTIVITIES property
  • Remove LOGAC50 property for "Inactive", "Inconclusive" and "CURVE_CLASS==4" cases
  • Create property PUBCHEM_ASSAYID_".$aid."_SID_RESULTS which contains results for a particular SID
  • Fold all substances with the same compound id (CID) into one record The properties are folded according to the following rules: "Inconclusive" results are deleted, unless they are the only results present, LOGAC50 is averaged out, unless there are both "Increasing" and "Decreasing" results present, PUBCHEM_ASSAYID_".$aid."_RESULT is set to "Discrepant" if there is more than one type present, discounting the "Inconclusive".

473,965 structures in SDF format. WARNING: This is a 1.6 GB gzipped file. Use the "Save Link As..." (Firefox) or "Save Target As..." (IE) option of your web browser to download the file.

466,537 structures in SDF format. WARNING: This is a 642 MB gzipped file. Use the "Save Link As..." (Firefox) or "Save Target As..." (IE) option of your web browser to download the file.

Igor Filippov

Last Update: 2015-08-05