SUBSET works as a batch program and should be very simple to use. There is only one parameter to choose (the Tanimoto coefficient). The input data are read from (and expected in) a simple text format with one entry per line. The program output is a selection of the input lines.
SUBSET has been designed for performance. The algorithm employed to compute the Tanimoto coefficient is strongly optimized. Calculating the Tanimoto coefficient between two 431-bit vectors can be done about 1,900,000 times a second on an Intel 500 MHz Pentium II Linux computer. The algorithm used for clustering is based on the Stochastic Clustering Algorithm (SCA) described by Reynolds et al. [Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds, Journal of Chemical Information and Computer Sciences; 1998; 38(2); 305-312]. Unlike the published algorithm, the order of input of the database vectors is not randomized. Using SUBSET, calculating a subset of 34,471 molecules from the NCI database (~250K entries), using 431-bits vectors and a typical Tanimoto coefficient of 80%, takes about 30 minutes and requires 4 MB of memory.
SUBSET is written in ANSI C. It should be easy to recompile the source code on any platform. Unlike many other scientific programs that have arbitrary limitations, SUBSET has no arbitrary limits for the number of entries, the subset size and the length of the bit strings. All data structures grow dynamically when needed. Great care has been taken to ensure C code correctness at run time.
SUBSET has been applied to an evaluation of the diversity of chemical databases (see Johannes H. Voigt, Bruno Bienfait, Shaomeng Wang, and Marc C. Nicklaus, "Comparison of the NCI Open Database with Seven Large Chemical Structural Databases", J. Chem. Inf. Comput. Sci., 2001; 41(3): 702-712; abstract)
The latest version of SUBSET is available at http://cactus.nci.nih.gov/SUBSET/.
gunzip < 'ARCHIVE_NAME' | tar xfv -
where ARCHIVE_NAME is the name of the downloaded file (Note: MS
Windows user can use Winzip 8 to open the archive).
Using your Unix shell, change the working directory to the installation
directory and type the command:
make
The latter command will start the compilation and linking phases.
To check if the SUBSET program has been built correctly, the make command
runs also a small test suite.
Mol 0101010101010
Mol2 0101010010010
More examples can be found in the Test directory. These example
files were generated with the help of the CACTVS toolkit (see http://www2.ccc.uni-erlangen.de/software/cactvs/).
The CACTVS subdirectory contains a TCL script useful to generate SUBSET
input files from SMILES, MDL or any other chemistry file format supported
by CACTVS.
subset -sim 0.5 < Test/nci_1000.tab > temp.tab
The argument to the -sim option is the similarity factor (Tanimoto coefficient),
which is a number in the range 0.0 to 1.0. A small number will yield
a small number of subsets.
Present address:
Dr. Bruno Bienfait
ChemCodes Inc.
1300 Englert Dr. Suite G
Durham, NC 27713
E-mail: bruno@brunob.org