This document describes some of the operations performed to generate the downloadable bulk files from the NCI Open Database structures and biological test data (cancer and AIDS, see here for more information). The aim of this document is to show how to combine the tools of the SDF Toolkit and to provide tricks and recipes by showing real examples. All these examples shoud be run on a Unix system.

Input files

All input files are based on the publicly and freely available data from NCI's Developmental Therapeutics Program (DTP). We collected the structures and biological data from DTP (cancer data as of August 1999, AIDS data as of October 1999), combined them where applicable, and generated MDL SD files from this information.

SDF Toolkit

The SDF_Toolkit can be downloaded here. You'll need version 1.06 or later.

Merge and remove duplicates from two SD files

  • Objective: Merge and remove duplicates from two SD files. Duplicates are recognized by the same identifier (a non chemical data entry in the SD files). The identifier (here the NSC number) must be present in both input files.
  • Input files:
    • nciopen_LMCH_aug99_0D.sdf - August 1999 SD file without 3D/2D and stereo information
    • aids_o99_chemical_structs.sdf - chemical structures from the DTP site for which AIDS data is available
  • Command:

    append_sdf -prop NSC nciopen_LMCH_aug99_0D.sdf aids_o99_chemical_structs.sdf > new.sdf

Merge and remove duplicates from two SD files and make a list of the new entries

  • Objective: Merge and remove duplicates from two SD files. Duplicates are recognized by the same identifier (a non chemical data entry in the SD files). The identifier must be present in both input files. Make a list of the new entries.
  • Input files:
    • nciopen_LMCH_aug99_0D.sdf - August 99 SD file without 3D/2D and stereo information
    • aids_o99_chemical_structs.sdf - chemical structures from the DTP site for which AIDS data is available
  • Commands:

    append_sdf -prop NSC nciopen_LMCH_aug99_0D.sdf aids_o99_chemical_structs.sdf | extract_prop_sdf -prop NSC > temp.list

    > 42687 entries read and 2212 entries added from aids_o99_chemical_structs.sdf

    tail -2212 temp.list > 2212_oct99.list

Select entries from an SD file using a file containing a list of identifiers

  • Objective: Select a subset of an SD file using the NSC number as the identifier.
  • Input files:
    • aids_o99_chemical_structs.sdf - chemical structures from the DTP site for which AIDS data is available
    • 2212_oct99.list file created above
  • Command:

    select_sdf -labelfile 2212_oct99.list -property_name NSC < aids_o99_chemical_structs.sdf > 2212_oct99_3D.sdf

Remove hydrogens, charges, stereo information and 3D coordinates

  • Objective: See title.
  • Input files:
    • 2212_oct99_3D.sdf - chemical structures file created above
    • 689_aug99.list - a list of new entries for the August 1999 release
    • cancer_screened_a99_chemical_structs.sdf - file downloaded from the DTP WWW site. This file contains structures for which cancer cell data is available.
  • Commands:

    remove_h_sdf < 2212_oct99_3D.sdf | remove_charge_sdf | tee 2212_oct99_3D_no_H.sdf | zero_sdf > 2212_oct99_0D.sdf
    cactus_2d_nci 2212_oct99_0D.sdf | remove_stereo_sdf > 2212_oct99_2D.sdf

    Redo the same thing for the 689 file:

    select_sdf -labelfile 689_aug99.list -property_name NSC < cancer_screened_a99_chemical_structs.sdf > 689_aug99_3D.sdf
    remove_h_sdf < 689_aug99_3D.sdf | remove_charge_sdf | tee 689_aug99_3D_no_H.sdf | zero_sdf > 689_aug99_0D.sdf
    cactus_2d_nci 689_aug99_0D.sdf | remove_stereo_sdf > 689_aug99_2D.sdf

  • Notes: tee is a standard Unix command which reads from standard input, writes to standard output and saves to a file. cactus_2d_nci is a TCL script (not part of the SDF_Toolkit) which calculates 2D coordinates. This script makes use of the CACTVS system.

Remove entries with a special filter

  • Objective: Remove entries from the NCI files that have an NSC number greater or equal than 900,000 (these are combinatorial library entries).
  • Input files:
    • open_397.mol - NCI data file released in March 1997
    • 689_aug99_0D.sdf - supplemental structures from the August 99 release
    • 2212_oct99_0D.sdf - supplemental structures from the October 99 release
    • remove_900000.pm - a perl module for the tool select_sdf
    • Source code of remove_900000.pm:
      ##################################
      sub is_sdf_record_kept
      {
              my $sdf_entry = shift;
              my $record_number = shift ;
              defined $sdf_entry || die "Assertion failed" ;
              my $value = $sdf_entry->data_for_field_name("NSC");
              defined $value || die "Assertion failed: undefined property" ;
      #       print STDERR $value, "\n";
              return $value < 900000; #Keep NSC's < 9000000
      }
      1;
      ##################################

    • Command:

      cat open_397.mol 689_aug99_0D.sdf 2212_oct99_0D.sdf | select_sdf -perlfile remove_900000.pm > temp.sdf

    • Notes: The special filter is loaded and compiled at run time.

Sort an SD file using a numerical property

  • Objective: Sort NCI files by NSC number.
  • Input files:
    • open_397.sdf - NCI data file released in March 1997 (includes 3D)
    • 689_aug99_3D.sdf - supplemental structures from the August 99 release
    • 2212_oct99_3D.sdf - supplemental structures from the October 99 release
  • Command:

    cat open_397.sdf 689_aug99_3D.sdf 2212_oct99_3D.sdf| sort_sdf -prop NSC > nciopen_LMCH_oct99_3D.sdf

  • Notes: sort_sdf might require a lot of memory (the whole input file is stored in memory). For example, sorting the entire NCI database (about 250,000 entries with biological data added, a ~800 MB SD file) by NSC number required 1.5GB of memory and about 20 min. of computer time (this was done on galaxy.nih.gov, an SGI computer with 32 x 250 MHz R10000 processors (only one CPU was used) and 8GB RAM)

Prepare biological data file

  • Objective: The NCI cancer screen data are comma separated value files, which unfortunately, cannot be used directly by the add_propd_sdf tool. The problem is that data for one molecule (NSC number) are split over several lines. The solution is to combine in one line all the data which belongs to one entry. The Perl script nciscreen2csv was written for that purpose.
  • Input files:
    • nciopen_LMCH_oct99_2D.sdf - NCI data file  (includes 2D information)
    • cancer_screened_gi50_a99 - cancer screen data from the DTP WWW site(August 1999 release)
    • cancer_screened_lc50_a99
    • cancer_screened_tgi_a99
  • Commands:

    nciscreen2csv < cancer_screened_gi50_a99 > cancer_screened_gi50_a99.csv
    nciscreen2csv < cancer_screened_lc50_a99 > cancer_screened_lc50_a99.csv
    nciscreen2csv < cancer_screened_tgi_a99 > cancer_screened_tgi_a99.csv

  • Notes: See the file nciscreen2csv in the toolkit.

Add biological data to an SD file

  • Objective: Add AIDS and cancer cell data to an SD file in one operation.
  • Input files:
    • nciopen_LMCH_oct99_2D.sdf - NCI data file (includes 2D information)
    • cancer_screened_gi50_a99.csv - comma-separated value file with a special format which matches the NCI_screen format. Each line contains all the data for one NSC entry.
    • cancer_screened_lc50_a99.csv
    • cancer_screened_tgi_a99.csv
    • aids_ec50_oct99.csv
    • aids_ic50_oct99.csv
    • aids_conc_oct99.csv
  • Command:

    add_prop_sdf < nciopen_LMCH_oct99_2D.sdf -match NSC -table cancer_screened_gi50_a99.csv -noskip -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_lc50_a99.csv -noskip -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_tgi_a99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ec50_oct99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ic50_oct99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_conc_oct99.csv -noskip -silent > nciopen_LMCH_oct99_2D_AIDS_cancer.sdf

  • Notes: The command shown above consists of only one line!

    -perlclass is a special option for the tool add_prop_sdf. The argument to the option -perlclass, NCI_screen, is the name of a customized Perl class which derives from the class that processes standard CSV (comma separated value) table files. Its purpose is to reformat the biological data. See the file NCI_screen.pm in the toolkit (this will interest probably only Perl 5 programmers).

    -noskip is an option that instructs to keep all entries even if biological data is not available.

Add biological data to a SD file and filter out entries for which biological data is not available

  • Objective: Add AIDS and cancer cell data to a SD file in one operation. Same as before, but now only the structures for which all biological data (AIDS and cancer cells) is available.
  • Input files:
    • nciopen_LMCH_oct99_2D.sdf - NCI data file (includes 2D information)
    • cancer_screened_gi50_a99.csv - comma separated value file with a special format which matches the NCI_screen format. Each line contains all the data for one NSC entry.
    • cancer_screened_lc50_a99.csv
    • cancer_screened_tgi_a99.csv
    • aids_ec50_oct99.csv
    • aids_ic50_oct99.csv
    • aids_conc_oct99.csv
  • Command:

    add_prop_sdf < nciopen_LMCH_oct99_2D.sdf -match NSC -table cancer_screened_gi50_a99.csv -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_lc50_a99.csv -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_tgi_a99.csv -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ec50_oct99.csv -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ic50_oct99.csv -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_conc_oct99.csv -silent > nciopen_LMCH_oct99_2D_all_have_AIDS_cancer.sdf

  • Notes: The command shown above consists of only one line!

    -perlclass is a special option for the tool add_prop_sdf. The argument to the option -perlclass, NCI_screen, is the name of a customized Perl class which derives from the class that processes standard CSV (comma separated value) table files. Its purpose is to reformat the biological data. See the file NCI_screen.pm in the toolkit (this will interest probably only Perl 5 programmers).

    The -noskip option is not used.

Bruno Bienfait 1-11-2000

Last Update: 2017-10-16