SUBSET

a fast program for computation of a representative subset of a large dataset.



 

Disclaimer

This program is provided free of charge to anyone on an "as is" basis, and without warranty of any kind, including but not limited to any implied warranty of merchantability or fitness for a particular purpose.  In no event shall the author or the National Institutes of Health be liable for any direct, indirect, incidental, special, or consequential damages arising from use or distribution of this software.
 
 

Introduction

The area to which SUBSET has been applied so far is chemo-informatics.  SUBSET is a clustering program useful for the selection of a set of input vectors evenly scattered over the entire input space.  Although SUBSET has been developed to cluster chemical databases, it does not contain any algorithm to handle molecular structures.  Its input are the usual bitstrings, e.g, "1100011001...", which are widely used to represent presence and absence of molecular fragments and/or structural features.  The only distance metric available yet is the Tanimoto coefficient.

SUBSET works as a batch program and should be very simple to use.  There is only one parameter to choose (the Tanimoto coefficient).  The input data are read from (and expected in) a simple text format with one entry per line.  The program output is a selection of the input lines.

SUBSET has been designed for performance.  The algorithm employed to compute the Tanimoto coefficient is strongly optimized.  Calculating the Tanimoto coefficient between two 431-bit vectors can be done about 1,900,000 times a second on an Intel 500 MHz Pentium II Linux computer.   The algorithm used for clustering is based on the Stochastic Clustering Algorithm (SCA) described by Reynolds et al. [Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds, Journal of Chemical Information and Computer Sciences; 1998; 38(2); 305-312].  Unlike the published algorithm, the order of input of the database vectors is not randomized.  Using SUBSET, calculating a subset of 34,471 molecules from the NCI database (~250K entries), using 431-bits vectors and a typical Tanimoto coefficient of 80%, takes about 30 minutes and requires 4 MB of memory.

SUBSET is written in ANSI C.  It should be easy to recompile the source code on any platform.  Unlike many other scientific programs that have arbitrary limitations, SUBSET has no arbitrary limits for the number of entries, the subset size and the length of the bit strings.  All data structures grow dynamically when needed.  Great care has been taken to ensure C code correctness at run time.

SUBSET has been applied to an evaluation of the diversity of chemical databases (see Johannes H. Voigt, Bruno Bienfait, Shaomeng Wang, and Marc C. Nicklaus, "Comparison of the NCI Open Database with Seven Large Chemical Structural Databases", J. Chem. Inf. Comput. Sci., 2001; 41(3): 702-712; abstract)

The latest version of SUBSET is available at  http://cactus.nci.nih.gov/SUBSET/.
 

Downloading

The latest release of the program can be downloaded here (file: subset_1.0.tgz). (~ 64 kB).
 

Installation

SUBSET is delivered only in source code in the form of a Unix tar archive compressed with gzip.  To extract the archive, use the following standard
command:

        gunzip < 'ARCHIVE_NAME' | tar xfv -

where ARCHIVE_NAME is the name of the downloaded file (Note:  MS Windows user can use Winzip 8 to open the archive).
Using your Unix shell, change the working directory to the installation directory and type the command:

        make

The latter command will start the compilation and linking phases.  To check if the SUBSET program has been built correctly, the make command runs also a small test suite.
 

Input file format

Input files for SUBSET are simple text files containing lines in the form of
label - blank character - bitstrings.
One or more blank characters are used as a separator. Example:

        Mol 0101010101010

        Mol2 0101010010010
 

More examples can be found in the Test directory.  These example files were generated with the help of the CACTVS toolkit (see http://www2.ccc.uni-erlangen.de/software/cactvs/).  The CACTVS subdirectory contains a TCL script useful to generate SUBSET input files from SMILES, MDL or any other chemistry file format supported by CACTVS.
 
 

Usage

Here is a Unix command line example:

        subset -sim 0.5 < Test/nci_1000.tab  > temp.tab

The argument to the -sim option is the similarity factor (Tanimoto coefficient), which is a number in the range 0.0 to 1.0.  A small number will yield a small number of subsets.
 
 

Problems and limitations

The *reported* number of distance comparisons might be wrong when using large datasets because of an integer overflow.  This does not affect the results saved in the output file.
 
 

Author

SUBSET was written by Bruno Bienfait while on a Visiting Fellowship at the National Cancer Institute, National Institutes of Health.

Present address:

Dr. Bruno Bienfait
ChemCodes Inc.
1300 Englert Dr. Suite G
Durham, NC 27713

E-mail: bruno@brunob.org
 



This page was last changed 6-Jul-2001.