Please notice, that this text is out-dated in parts and describes functionality which has been removed from the service (e.g. because they represent a security risk or require external services that are not available anymore).

NCI Database Browser Help File and Information


This is a summary description of the functions available in the Enhanced CACTVS-based Browser of NCI Open Database structures at this Web site. If you do not find the answer to your question here, you should also consult the FAQ. The FAQ contains answers to questions from users and it will be updated periodically. It is possible that somebody has already encountered the same situation you are facing.

Examples

These application examples are intended to show how this service can be used to help with drug development projects, to provide useful datasets for further user-side processing, and to answer scientific questions of various natures.  The results mentioned below are those obtained at the time of this writing (September 2001).  As contents and funtionalities of the Enhanced NCI Database Browser evolve, the answers to specific queries may also slightly change. (It should also be noted that these are examples given to illustrate the capabilities of the system, and would not necessarily be optimized searches for a real drug development project.)

Example 1: Similarity Searching on an Active Drug Molecule
As an example, we choose Taxol as the drug molecule for which we want to find similar compounds in the NCI database that may have potential drug activity, and want to learn more about them if literature is available.

  1. On the Query Form pane, choose Query Type "Name Search..."  Select "Name fragment" as the suboption [this is usually the safer suboption than "Exact name" since it will find names with additional characters in them, such as, in this case, "Taxol.RTM. (Registered Trademark)"]
  2. Type "taxol" as the Query Data Value (search strings are not case-sensitive).
  3. Click on either one of the Start Search buttons
  4. For the current dataset, 20 hits should come up on the Hitlist pane.
  5. Select the first one of the hits (NSC# 125973) by clicking on the NSC number, which is a live hyperlink.
  6. This will bring up Taxol in the Detail pane.
  7. Underneath the structure drawing of the molecule, click on the "Transfer to Java Editor" button.
  8. In the Editor pane that should have come up, with the molecule shown in the JME Editor, click on the "Transfer to Query Form" button.
  9. The Query Form pane should come up, with the SMILES string for Taxol in the topmost Query Data value field.
  10. Change the Query Type to "Similarity Search..." and select, e.g., "Tanimoto 95%" as the suboption popup menu as the specific similarity search you want to conduct.
  11. Click on either one of the Start buttons.
  12. For the current dataset, this should produce a hit set of 47 compounds listed on the Hitlist pane.
  13. One of these hits was cephalomannine (NSC# 318735); click on this NSC number to call up cephalomannine in the Detail pane.
  14. In the External Services field, select Format "Medline CAS Number Search" and click on the "Contact" button to the right.
  15. A PubMed page should show up in the Display pane with approximately 20 references that contain mentioning of cephalomannine (keyed by its CAS number).
  16. Click on any of the individual article links to see the abstracts (in PubMed).
  17. You can at any time go back to, e.g., the Hitlist pane by clicking on the appropriate button in the button bar at the top of the page, and continue with further, and different, analyses.


Example 2: Download all structures in the database that have CAS numbers associated with them and that have no stereogenic centers, and include names and SMILES strings for all of them.
This is a dataset that may be useful for further processing in various contexts.  Exclusion of all compounds that possess one or more chiral center and/or one or more E/Z double bond will ensure that this dataset contains no possibly wrong stereoisomers.
It is a good idea - for this search as well as in general - to first gain an impression on how large a hit list you may get from your search:

  1. In the first (or any other) query row, choose "Data Availability Constraints..." as the Query Type and "CAS number" as the suboption; leave the Query Data Value at "yes."
  2. In the second (or any of the remaining) query rows, choose "Stereocenters" as the Query Type, select "Potential for any stereochemistry?" as the suboption, and change the Query Data Value to "no" (or, equivalently, leave it at "yes" and click the "Negate" box on).
  3. As the Output Format, choose "Simply Count Hits (entire DB)".
  4. Click on either one of the "Start Search" buttons.
  5. In the Status frame at the top, you should see the text displayed "Counted 81058 hits."
Now we proceed to the actual downloading of these structures.  Because this dataset is larger than the hardcoded limit for interactive searches of 10,000 structures, you have to do this as a background search.  You need to have established a user account and be logged in in order to be able to conduct background searches (see also step 11).
  1. Assuming that you still have the query types and values from steps 1 and 2 in the Query Form, click the check box "Background job named" on, and enter a name (e.g., here, CAS_NOSTEREO) in the field next to it.
  2. Change the Output Format to "HTML Table with Samples" or "Plain HTML Table."
  3. Change the maximum number of hits field from its default of 100 (or whatever it is now) to either a value equal or greater to the number of hits you expect, or to a blank, which stands for "infinite number."
  4. The maximum time setting is disregarded for background searches.
  5. Click on either one of the "Start Search" buttons.
  6. If you are not yet logged in, or have not even established a User Id yet, the system will prompt you to do so at this point.  Follow the instructions on the screen and in the Help page.  Cookies are not mandatory but will make your work with stored hit lists more convenient.  If you were not logged in yet but are now, click the Start Search button again on the Query Form.
  7. In the Status frame, you should now see the message "Executing query as background job. Check your archives for results after a few minutes."
  8. Click on the "List Mgr" button in the top button bar.   The new search should appear at the bottom of the list of stored searches you may already have.  (For this search, the results should appear within approximately a minute.  For more complex searches, you may have to wait several minutes.)
  9. Click on the radio button (either Selection 1 or Selection 2) for this search, and then click on the Retrieve button.
  10. The Hitlist pane should come up, with the title "Operations with this Dataset of 81058 Structures:"
  11. In the Data Retrieval section, set the Format to "SDFile" (default), assuming this is the format you desire.
  12. In the Fields selector, make sure that you have selected the following fields to be included in the SD file (use the Ctrl key, or whatever your web browser requires, to add to the selection):
  13. If you are interested in (calculated) 3D coordinates or in having the hydrogens being removed from the file, check the appropriate check boxes.
  14. Click on the Retrieve button.
  15. After a short while, a standard web browser dialog box will appear; choose Save File with a file name and location of your choice.


Example 3: Screening the NCI database for potential new structural motifs, by querying with predicted activities and other criteria.
As an example, we want to obtain some new structural ideas for potential HIV protease inhibitors.  We want to limit ourselves to molecules that have not yet been tested in the NCI anti-HIV screen, are of drug-like size, and avoid certain very common classes of compounds against this target.

  1. On the Query Form pane, choose "PASS Prediction Range.." as the first Query Type.  A separate PASS Selector window will pop up.
  2. Select the activity type "Protease inhibitor" (this is the broadest applicable class), make sure that in the Query Probability Field, "Activity" is selected (default), and click on the "Transfer to Form" button.  This should bring you back to the Query Form pane.
  3. In the suboption field, you will see the internally encoded PASS activity type [something like E_PASS_DATA_PA(55)] - don't change this.  The default value range for PASS activities is 0.7-1.0.  To allow for structures that are more dissimilar from the PASS training set structures, we lower the lower bound from 0.7 to 0.5.
  4. To exclude compounds that have already been through the NCI anti-HIV screen, choose "Data Availability Contraints..." as the Query Type in the second query row, select "AIDS screening" as the suboption, and click on the Negate check box for this query row (or, equivalently, change the default Query Data Value from "yes" to "no").
  5. In the third query row, choose "Molecular Weight Range" as the Query Type, and type "200-800" in the Query Data Value field.
  6. Many of the known HIV protease inhibitors are peptide-like molecules.  To reduce the number of such molecules in the hit set, we exclude a specific substructural motif: two consecutive amides. (Other strategies could be devised.)  Click on the "Editor" button in the fourth query row.  In the Java Molecular Editor, draw the appropriate structure - the backbone of a dipeptide - and click on the "Transfer to Query Form" button.
  7. In the Query Data Value field, a SMILES string such as CC(NCC(N)=O)=O should appear.  Click the Negate check box on for this query row.
  8. To make sure that this search can retrieve all possible hits in the entire database, increase the maximum number of hits to 1,000.
  9. Leave the output format at, or set it to, its default value "HTML Table with Samples."
  10. Click on either one of the "Start Search" buttons.
  11. With the current dataset, this search yields 410 hits, tabulated in the Hitlist pane. To prevent very large result sets overwhelming the web browser, the Hitlist pane displays only sets of 250 structures at a time.  Additional portions can be requested by clicking on the "Display Next Hits" button at the bottom of the page.
  12. To get a better overview over the structures we obtained, select "GIF Image Gallery" as the option in the Visualization field, and click on the "Display" button to the right of it.
  13. In the Display pane, a list of the structures, including 2D structure drawings and some additional data, should appear.
  14. Scroll down this list to see what structures came out of this search.  Each NSC number is, again, a live link that will bring up the Detail pane with the complete data available for each compound.  Where available, display of screening data can be requested as a separate table in the Detail pane via another live link. (In this case, this can only be cancer screening data since we excluded anti-HIV screening in the search.)
  15. Return to the Hitlist pane, and click on the "Display Next Hits" button to see the remaining 160 structures.
  16. Repeat steps 12 - 14 for this portion of the hit list.



Query Specification

System Requirements
The new database interface is a rather complicated project, and will require you to use a reasonably recent browser (Netscape or IE, on any platform). A few advanced features will only work on Netscape. Both JavaScript and Java must be enabled on your browser. As long as you do not want to store information on the server, no login is required and no Cookies are set.

Multiple Panes
This tool provides a completely revamped navigation interface which allows you to switch between different result windows ("panes") at will. Anytime you click on one of the buttons on the top navigation bar, the current content of that pane will display. However, some of those panes are not accessible at all times - for example, you will not be able to open the Hitlist and Detail display panes if you have not obtained any query results. Most of the panes will be opened in the same browser window to save the user from a proliferation of newly created windows (as could happen in the old service). The Help text pane (the file you're reading right now) is an exeption: it opens its own window so that you can read the help text while looking at the part of the service it describes.

The field below the navigation bar is a status area. After you have started an operation, information about the status of that operation will be displayed in that area. After a few seconds, the window will automatically switch back to display the global database status.

The actual content (i.e. input forms, query results, visualizations) will appear in the third and largest area in the lower part of the browser window.

Caution: Resizing the browser window may reset all entries and even lead to subsequent JavaScript errors. This does not apply to minimization of the window via the upper right hand button and subsequent restoring of the window to its previous size, but to any other resizing such as (accidentally) dragging the window border. We therefore recommend starting the session in a fully maximized window. If you encounter the JavaScript error, simply go back to the welcome page (http://cactus.nci.nih.gov/) and start a new session.

Also, do not use the Back and Forward buttons of your browser instead of the navigation bar buttons. This will work in many instances, but can have unexpected consequences, such as resetting the entire session (especially if you accidentally back or forward out of this site).

Basic Query Specification Procedure
A database query is built by selecting a query type from the popup menu to the left of the four rows below and specifying a parameter in the entry field to the right of the same line. Rows where no data is input in the entry field are completely ignored, regardless of the selected query type. Many of the principal query methods have additional parameters. The option menu below the data input field is automatically updated to reflect the available options whenever you change the principal query method. Some of the more advanced query methods will pop up separate input forms, which will write some gibberish in the data input field when they are closed. You are not supposed to edit such content.

After filling in all relevant form elements, press one of the buttons labeled Start Search. Depending on the selected output format and whether any records meeting your criteria were found, an answer page or a structure file is generated. If you selected the output format Simply Count Hits (Entire DB), you will receive the resulting count in the status window.

Negate
If this field is checked, the query result of the associated input field is inverted, i.e. only records which do not contain a specific substructure, or contain data of the selected type or range, are considered hits.

Boolean Operations
You can fill any subset of the four basic query specification data entry rows. On the Connect query fields by line, you can specify the boolean operator with which you want to connect the Query Type rows if you specified more than one. By default, this is the logical AND. You cannot select different operators to connect different subsets of rows; however, since you can specify lists of query values (and value ranges), and the implicit connection between list items is a logical OR, you can build complex queries quite easily. These mechanisms should provide for the vast majority of the searches most users will typically conduct. If you need to perform more complex searches, you can use the Hitlist Manager mechanism. The XOR mode lets records pass where an odd number of criteria are fulfilled. This is mostly useful in connection with Hitlist Management.

The order of the queries is not important. The database will optimize it and use, for example, fast queries for the presence of a data field to filter those records which are submitted to a more demanding substructure match procedure.

Limits to Hits and Execution Time
The maximum number of hits which can be retrieved or tabulated can be specified in the Max. Number of Hits field. A lower limit of 1 and a maximum of 10,000 structures are silently enforced. Additionally, you may change the maximum time spent searching. The default time is sufficient to perform about 3000 atom-by-atom matches in substructure search mode. In typical substructure queries, fast bitvector screening will reduce the number of candidate structures well below this threshold, so that the full database is scanned. All other search operations should take a maximum of five seconds to execute on the full database. In any case, you will not be allotted more than 5 minutes computer time. If your time is used up, you will be presented with the number of hits found so far. However, you have the option to continue your search. This operation is available through the 'Retrieve Next Hits' menu entry on the 'Miscellaneous' section of the hitlist page.

Query Types

NSC Number Searches
You can type in lists of individual numbers or open (e.g. '-100') or closed (e.g. '120-130') number ranges. If more than one number range is given, hits are produced from records which match any of the numbers or number ranges.

The currently highest NSC number is just above 700,000. If you don't find an entry for a given NSC number within this range, this can have two reasons: First, you may have hit on one of the non-open ("discreet") compounds in the NCI Database; secondly, large stretches of NSC numbers were set aside in the past but then never really used. Particularly the range 400000-600000 is sparsely populated.

CAS Number Searches
One or more CAS numbers, with or without hyphens, are accepted for this type of search. If more than one number is specified, hits are produced from records which match any of the specified CAS numbers.

Please be aware that only about half of the compounds in the database have a CAS number associated with them. This does not necessarily mean that they do not possess a CAS number; it just means that none was entered when the compound was originally keyed in. On the other hand, there definitely are compounds in the database that truly do not have any CAS number. Many of the samples that NCI received were, e.g., from ongoing research projects, and these compounds were not necessarily published or patended - so they may never have entered the Chemical Abstract Registry.

Molecular Formula Searches
The general formula range syntax is symbol?low count??-??high count?, repeated for every specified element and written in arbitrary order. So C7-8 is seven to eight carbons, C-7 is up to seven carbons, C7- is seven or more carbons, C or C1 are exactly one carbon, C- is any number of carbons, including none. There are two types of formula searches. If you allow other elements, any number of elements which were not mentioned in the query formula are allowed. The other type disallows any additional elements, so your formula must be fully specified, including hydrogen atoms. Two-letter elements must be written with the second letter in lowercase, otherwise Cu (copper) and CU (one carbon, one uranium) would not be distinguishable.

It is also possible to use sums and differences of elements. For example, the query C4(F+Cl+Br+I)2 will retrieve all C4-compounds with any combination of exactly two halogens.

Molecular Weight Searches
This type of query accepts one or more molecular weights or weight ranges in gr/mol. Ranges are processed with full precision, but single weights are compared with rounded weight numbers.

Atom and Ring Counts, Donor/Acceptor Counts, etc.
Once more, ranges or single numbers are permitted with these search options. The atom count is the total number of atoms, including hydrogen. The ring count is the number of ESSR rings in the structure. An ESSR ring is any ring which does not share three consecutive atoms with any other ring in the structure. This filter is also applied to fused rings such as naphthalene - according to this convention, three rings (two phenyl fragments and the 10-membered envelope) result. Biphenyl will yield only a count of two rings.

The definitions of donor and acceptor atoms and rotatable bonds are somewhat flexible, but should match common practice. If in doubt, extend the range and see whether you get extra hits which are interesting. The rotatable bond count excludes all bonds where the rotation is possible, but does not have a major impact on the shape of the molecule. For example, all terminal or linear bonds are excluded.

Please be aware that the definition of flexibility that underlies the rotatable bonds count here, and the definition of flexibility that was used by the program Catalyst (MSI) in calculating the conformers whose number is reported in the Detail window (and which can be searched for with the 3D pharmacophore search) have nothing to do with each other. Issues like terminal groups, hydrogens, large ring flexibilities play a role here. You may therefore encounter cases where the number of rotatable bonds (CACTVS) vs. the number of conformers (Catalyst) seems non-intuitive if not inconsistent.

Complexity
The complexity rating of the compounds is a rough estimate of how complicated a structure is, seen from both the point of view of the elements contained and the displayed structural features including symmetry. The value is computed using the Bertz/Hendrickson/Ihlenfeldt formula (1). It is a floating point value, ranging from 0 (simple ions) to several thousand (complex natural products). The most complex compound in this database is NSC 277816 (C128H164BrN25O86P12) with a complexity rating of about 10,515. The average complexity of the structures in this database is about 402.

Name Fragment Searches
About 45,000 compound names are associated with the structures from the original NCI database, and for most other compounds, an IUPAC name was computed by the ACD/Name (V4.0) program from ACD/Labs. Generally, because of the usual problems with structure naming conventions, name search is of somewhat limited value. Only the original NCI name set contained common names and sometimes trade names. If there happens to be an NCI name for a structure, often it has not one but multiple names. Name searches are automatically performed on the full name set, and for a hit it is sufficient for any single name to yield a positive result. Search is always case-insensitive and ignores whitespace. We support six different kinds of searches. A full name search must match the name, either with or without numbers and punctuation. Simple substring search is as simple as it sounds. The default name substring search will ignore all punctuation and numbers in the name. The second variety of substring search will also ignore punctuation, but preserve and match digits in the compound names. Shell syntax works like the command line in a Unix Bourne shell. The special characters and character sequences '*' (zero or more arbitrary characters), '?' (single arbitrary character) and '[]' (character range) are recognized. Note that this search (in contrast to substring and regular expression search) is anchored, i.e. if your query value string does not start with '*', the first character of a structure name must match the first query character. Regular expression search is even more powerful, but also rather complicated. With this search, you can, for example, specify that there should be either 'fluoro' or 'chloro' to the right of another fragment. Refer to a Unix manpage (for example, for the commands sed or egrep) to get more information about this topic if you are not familiar with it.

AIDS Screen Result
For this query, the normal data field is ignored. Please select from the menu the desired test outcome, which can be any combination of active, moderately active and inactive. These are the categorizations that the DTP AIDS Antiviral Screen has assigned to the compounds according to the combined cell-protective (against HIV) and cell-toxic (without HIV) activities.

Data Availability Fields
These are options to restrict the retrieved structures to those for which specific data, such as tumor cell screening data (GI50, LC50, TGI, IC50, EC50) or anti-viral AIDS activity data is available. Enter 'y' (yes) or 'n' (no) into the value field to select records with/without such data. Currently, you cannot search on the actual values of the tumor cell screening result data.

In combination with the Simply Count Hits (entire DB) output format, you can use this feature to quickly derive all kinds of statistics on our dataset.

LIQCRYST Data
The LIQCRYST database is the first open WWW structure database which is fully and bidirectionally crossindexed with the NCI database. LIQCRYST is a database which contains extensive information about properties of liquid crystals. If LIQCRYST data is available for a compound, a link to the LIQCRYST record will appear on the detail page. You can directly query the NCI database for LIQCRYST registry IDs, or use a data availability constraint query select only records with or without LIQCRYST data. This The LIQCRYST concordance list was kindly provided by Volkmar Will.

PASS Searches -- Predictions of Biological Activities
You can search for specific ranges in predictions for a very large number of biological activities. The program PASS (Prediction of Activity Spectra for Substances) was used to calculate predictions for up to 565 different activities for nearly all the structures in the database. PASS calculates the probability for both activity and inactivity of the compound for a given mechanism. These comprise specific enzymatic inhibitory potencies, therapeutic uses for various diseases, toxicities, and others. Counting the activity and inactivity predictions separately (they can be searched for separately), a total of 64,188,212 predicted values are offered on this site. Because the training set that underlies PASS is large but still limited (on the order of 35,000 compounds), the program cannot reliably predict each activity for every compound in the database. Here is the list of the activities, together with the number of compounds for which each activity was predicted. You can also try to assess the quality of each predicted activity's SAR model by having a look at the SAR Base Leave-One-Out Cross-Validation results. This file lists both the number of compounds in the training set for each activity and the Mean Error of Prediction (MEP) for this activity. Finally, you can view the distribution of the numbers of accepted activity predictions accross the entire compound set (acceptance criteria: probability of activity > probability of inactivity).

If you select the Query Type PASS Prediction Range..., a popup window will appear that will allow you to select the activity for which you want to search. There should be a scroll bar at the right side of the list. If your browser doesn't show it, enlarge the popup window manually. At the top of the window, you can select the Query Probability type to search for, Activity or Inactivity. You can only select one activity or inactivity at a time. If you want to conduct combined searches, such as "activity [probability] > 0.8" AND "inactivity [probability] < 0.2", you have to use separate query input lines in the Query Form. Since the predictions are calculated as probabilities, you have to use number ranges between 0.0 and 1.0.

Individual PASS activity and inactivity predictions can be selected as fields to be exported through the Data Retrieval functionality of the Hitlist pane.

We have observed the possibility, under certain circumstances, that the PASS selection popup window will not come up any more, even if you reset the Query From. If this happens to you, simply re-enter the server URL (e.g. http://cactus.nci.nih.gov for the U.S. mirror) in your web browser window, and start the search session from there anew.

It is obviously totally impossible for us to test even a small subset of these predictions for all the NCI compounds ourselves. If you use this feature of our service, we would therefore be interested in hearing about success (and also not-so-great-success) stories, which we would compile and post, e.g., on this server. This need not include disclosure of individual compounds tested, but merely of the success rate of the predictions for the activity analyzed. You can e-mail Marc Nicklaus and/or Prof. Vladimir Poroikov with results and questions.

In this vein, we want to emphasize that these values are predictions, to which all the usual caveats pertinent to QSAR-type calculations should be applied. The user should never make the mistake to assume that a specific prediction for a single compound means that this molecule has this activity. The PASS predictions can only be responsibly used in a statistical manner for sets of compounds, and should be treated as scientific "food for thought."

Structure Input
There are several possibilities to input a structure for full-structure, substructure or similarity search. The upper four input fields accept SMILES strings as structure specifications. If you are familiar with the syntax, you can type in simple queries manually. However, most of the time you will want to use some graphical structure editor. If your favorite desktop molecule editor supports Copy&Paste of SMILES strings, you can simply use this editor, put the structure on the clipboard as a SMILES string and paste it into the entry field. Editors which support this operation include ChemWindow and ChemDraw.

As a third option, you can start a Java editor by clicking on the Start Editor button below any input field. You must use a WWW browser with Java support (Netscape, Internet Explorer) for this to work, and you must have Java enabled, which is an option in the browser configuration panel. The input frame will switch to the editor panel. Read the editor instructions to learn how to use the program. Structures are exported as SMILES strings from the editor by clicking on the Transfer to Form button on the editor panel, or by using the navigation bar to switch back to the query input panel. The editor remains associated with the last input field where you pressed the Start Editor or Transfer to Form buttons. If your current search option is not structure-based, the query method will automatically change to substructure search upon structure import.

Sometimes the best combination of methods to input a query structure may be to first draw the general structure in the Editor, and then edit the resulting SMILES string manually in the Query Data Value field, especially if you want to add SMARTS extensions etc. (see Supported SMILES Features below).

We thank Peter Ertl from Novartis Crop Protection AG for kindly allowing us to use this remarkable applet.

Java Editor Comments
We are now using the 2000/10 version of the JME Java editor. It has much enhanced capabilities - for example, now you can input disconnected structures (use the 'New' button), and you can number the atoms (use the '123' button). Numbering is very helpful for the input of 3D query constraints. Just draw your structure any way you like it, and then number the atoms which participate in 3D constraints. These can be specified on the input fields to the right. The 'Qry' button pops up a window which allows you to input many more atom and bond properties. Please read the editor documentation.

Supported SMILES Features
All standard SMILES features, including stereochemistry and isotope labeling, are supported. However, since there are neither stereochemical descriptors nor isotope labeling in the database, these search features are disabled and stereo descriptors or isotope specifications will be ignored. Essentially all of the SMARTS extensions (including Recursive SMARTS) are also recognized, most notably the R, a, A, X, V and H descriptors for bracketed atoms. Boolean attribute link logic (indicated by the characters , ; &) is supported. For example, to specify that a nitrogen atom should not be part of a ring, you could use a SMILES descriptor '[N;R0]'. The special bond symbol '~' forces the bond to match an aromatic bond. Otherwise, aromatic bonds without any additional search attributes will match single and double bonds from both the substructure and structure side. The exclamation mark '!' used as a bond symbol is a 'non-bond' which must not be present in the database structure for the substructure to match. Example: 'C!C' - there may not be any bond between the two carbons, which can be useful to search for close atom contacts without bond. 'C!-C' means something else: There must be a bond, but it may not be a single bond. To search for hydrogen bond donors and acceptors as atom "types", you can use the CACTVS-specific SMARTS extensions [HD] and [HA], respectively. Edit the SMILES search string in the Query Data Value field of the Query Form pane manually, or use 'HD' or 'HA' (without brackets) in the Editor Atom/Bond Query popup window (click on the Editor's QRY button to open this).

Please note that throughout the service, Unique SMILES (USMILES) is used for SMILES output. However, this is the original USMILES definition by Daylight Chemical Information Systems, Inc. of 1989. The canonicalization rules have been changed by Daylight in the meantime, but these changes, to our knowledge, have not been published. Internally, USMILES may or may not be used. For example, the JME Edidor does not create USMILES when you draw and transfer a structure. Of course, if you compose a SMILES search string by hand, or edit a SMILES string, you are not required to use USMILES - any valid SMILES string will be recognized and accepted.

Structure Import from File
The bottommost input field of the search specification panel is somewhat different from those above. To the right, you have a button labeled Browse. You can either directly enter a filename into the input field, or use the file selection dialog which pops up when you press the button. Select a structure file which you have saved on your local disk as structure data input. The database supports most of the standard ISIS structure search functionality (plus some custom extensions), so you can use atom attributes and other features which cannot be expressed in a simple SMILES string. The file format is detected automatically. Most standard exchange formats, such as an MDL Molfile, will work, even files without connectivity such as an XYZ file. However, NONE of the program-specific binary drawing formats such as ChemWindow .cw2 files can currently be read.

Most of the ISIS query syntax is available - including all 3D query methods. R-group search is only partially supported - simple positionally variations of substituents do work, but not complex R-group logic.

Search List Uploads
You can use the same bottommost input field of the search specification panel to upload a file with lists of compound identifiers to be used as one (or the only) criteria for your search. Currently, you can upload files with NSC numbers, CAS Registry Numbers, and the internal CACTVS record numbers (option Merge/Upload Hitlist).

Structure Match Options
A set of check boxes is available on the editor panel to globally modify the structure search parameters.

First, you have a choice whether matched substructures should be highlighted in the displayed result structures or not. Highlighting applies both to 2D plots and 3D displays in Chime or as VRML file. Note that, if multiple substructures are combined by an OR statement, only the first successful substructure match is actually performed on that record and subsequently displayed, even if additional fragments would also match. Highlighting is activated by default.

If you allow multi-fragment overlap, substructures which consist of disconnected fragments may overlap when matching the target structures. By default they will not, so that if you specify two nitro groups as substructure, only compounds with two or more nitro groups are found. Note that this feature applies only to substructures which where entered in a single input field as an entity. If you specify two substructures on two different fields, their match relationship is not influenced by the setting of this switch.

The third option is whether to suppress the matching of aromatic bonds on plain single or double bonds with no auxiliary attributes. By default, aromatic bonds will match such bonds, provided that no other attributes (such as 'not in a ring') prevent the match. If you desire the behavior of NCI's older DIS system, which will match aromatic bonds in the database structures only on aromatic bonds in your query, you should activate this switch.

Finally, the option for the enforcement of ring embedding equality means that the ring count of the bonds of a substructure must match the ring count of the database structures. If this switch in on, a simple phenyl fragment will not match naphthalene (only benzene, or biphenyl). Also, it implies that all bonds in your substructure which are not in a closed ring can only match non-ring bonds in the database molecules. The same effect could be achieved by explicitly specifying for each bond that it must not be in a ring, but this global option is often more convenient.

Structure Search Types
The basic structure query types in this database are full-structure search, substructure search and similarity search. Full-structure search is fastest, substructure searches can take up to a few minutes depending on the character of your query structure. Hydrogens will be added automatically for all searches except substructure search, where you will have to specify them explicitly. You should know that adding explicit hydrogens to all sites where you do not want any substituents will both focus your search and speed it up. The similarity searches operate on the Tanimoto distance of the substructure filtering screen bitvectors. For full-structure search, you have the choice between looking for the complete structure (e.g. salts plus specific counterion) or any isolated molecule in the record. If the record contains only one molecule, which is true for the large majority of the database entries, these two search types deliver identical results.

Tautomer-Tolerant Searches
This database now also supports tautomer-tolerant queries for substructure and full-structure search. You can draw any tautomeric form of your query structure, and if the button is checked, the database will retrieve all compounds which are tautomers of your input form, regardless of internal coding. Note that in the case of substructures, you have to draw tautomeric hydrogen atoms explicitly. Inputting, for example, the enol of acetone without a hydrogen at the oxygen, or the keto form without any hydrogen at one of the carbons will not yield the expected results, since the open valences could be occupied by ligands which lock the form (say, some silicone group). If there are no potential tautomeric atoms, the search will proceed as if the box had not been checked. If there are such systems, screening will be performed less aggressively, and the match procedure adapted to allow positional variations of the hydrogens in the tauto systems. This will cost some 30% of extra computer time.

3D Pharmacophore Searches
You can conduct a 3D pharmacophore search in this database. Using the program Catalyst by MSI, up to 25 conformations were calculated for those compounds in the open NCI database that Catalyst could handle. Catalyst conformers have been included for 211,857 compounds.

To prepare a query for a 3D pharmacophore search, you can either create a query file externally and submit it to this service, or you can use the Local Query Parameters area of the Editor pane. The first possibility is probably the somewhat easier way at this time to enter more complex queries.

To create a query file, you can use programs such as Catalyst or ISIS/Draw etc. and generate a file in .mol format. Most of the additional features in query files are supported, such as exclusion spheres, centroids, points on lines, angles, planes... Once you have this file available on the machine from which you started the Browser, go to the bottommost query line, select the option Substructure and/or 3D Search..., click on the Browse button to the right of it, and select the query file on your machine. Then start the search.

To generate a query, proceed along the lines of the following examples. From any of the query input lines, call up the Editor pane. To generate a query that consists of a triangle of oxygen atoms,
1. select O from the list of elements, place it on the drawing area;
2. click on the NEW button at the top of the JME Structure Editor;
3. place another O atom;
4. repeat steps 2 and 3;
5. click on the 123 button;
6. click on the three placed O atoms: this will generate atom numbers;
7. in the Local Query Parameters area, enter "1 2" in the topmost Atoms field;
8. in the Value Range field below it, enter (e.g.) "2.5-3.5";
9. repeat steps 7 and 8 with the values (e.g.) "2 3", "3.5-4.5" and "1 3", "4.5-5.5";
10. click on the button (below the Editor area) Transfer to Query Form.
Now, in the query line you used, you should see, in the Query Data Value field, the entry "[OH2:1].[OH2:2].[OH2:3]". This would search for three water molecules -- which is probably not what you want. (The Editor automatically adds hydrogens to all unfilled valences.) Go into this field, and manually edit out the hydrogens, so that you have the string "[O:1].[O:2].[O:3]". Now start the search (after possibly adding other search criteria). The constraints you specified are transferred to the search engine behind the scenes.

You should make sure that the ensemble of constraint values you're entering amounts to a meaningful 3D arrangement of atoms. For example, the values used above are a triangle with side lengths of 3, 4, and 5 Angstroms, resp., with a 0.5 Angstrom tolerance for each side. Values of 3, 4, and 10 Angstrom, on the other hand, do not produce a valid triangle, and thus do not result in any hits.

Once you have obtained hits from your search, the best way to view the results is probably to choose, from the Detail pane, the Visualization option Chime Display/All Conformers. This will show you all conformations calculated by Catalyst, with the one that was found to match the 3D query highlighted by a light red background. (Once the search algorithm has found one match for a molecule, it will not look for additional conformers that could potentially also match the query.) Superimposition of the query onto the displayed conformers is planned for the future but not yet implemented.

This capability is not a replacement for full-fledged, dedicated 3D pharmacophore search programs. One of its main limitations is obviously that it doesn't allow one to conduct any conformational search on-the-fly -- there is only a fixed set of pre-calculated conformers available. On the other hand, this allows for a very rapid searching -- few of the more sophisticated programs will return a hit set from a 250,000-compound database within a few seconds.

Background Searches (Very Large Hit Sets)
Background searches can be performed for any query, but are particularly useful for searching for, and downloading, very large hit sets. The hard-coded limit for the maximum number of hits for an interactive search is 10,000, so if you want to go beyond this limit, you have to do this through a background search. (See also Example 2 above.)

To conduct a background search, click on the check box to the left of the option Background job named on the Query pane, and enter a name so that you will be able to retrieve your hit set after the search has finished. You do this by going to the Hitlist Management pane (click on the List Mgr button on the Navigation Bar). If your search is complicated, it may take a while for the results to become available on the Hitlist Management pane. If you go there to early, and don't see your results yet, use the Reload button of the Hitlist Management pane after a short while. (Do not use the Reload button of your web browser - this will cause you to lose all your previously entered Query data.) Remember that you have to be logged in in order to use the Hitlist Manager, and in order to log in, you need to have registered - so it's a good idea to complete these steps before you start a background search, which otherwise will be denied.

For background searches, you can enter maximum numbers of hits greater than 10,000, all the way up to 99,999; or you can blank out this field, which means that you set the maximum number to "infinite", i.e. the entire database.

Since you can specify, in the Data Retrieval section of the Hitlist pane, additional fields to be added to the output file (for those formats that allow this, e.g. SDF), background searches are actually a more flexible way of downloading the entire database with exactly the data you want than downloading the bulk files from on our Download Page.

The following is an example of how to download the entire database in chunks of 25,000 compounds. Set the Max. number of hits accordingly, i.e. to at least 25000. Select Record Number(s) as the Query Type, and enter 1-25000 as the Query Data Value. Wait a few seconds for the search to complete. (The Output Format does not matter at this point.) The resulting hitlist is now stored on our server, so you have to log in by clicking the List Mgr button on top. In the Hitlist Manager, select the Selection (hit set) you just created, and click on Retrieve. This will automatically get you to the Hitlist pane. Only 250 structures will actually be displayed at a time (you can request more blocks of 250 to be displayed), but they're all there. In the Data Retrieval section, select the appropriate fields you want to be included in the SD file (such as CAS Number, SMILES String etc.). Then click on the Retrieve button, and save the file with a name of your choice on your computer. Repeat these steps for Record Numbers 25001-50000 etc., until you have downloaded the entire database.


Hitlist Management

Besides continuing a search, or using a result set as basis for further refinement (both options are available through the 'Miscellaneous' section of the hitlist pane), you can store any result set by selecting the 'Store Hitlist' menu command from the 'Miscellaneous' section of the hitlist page. In order to use this feature, you will have to register. Creating an account is free and simple. You can create a new account, or identify yourself as the owner of an existing account on the 'List Mgr' pane. If you allow the system to set a Cookie, you will be automatically identified next time you contact the database. By means of the list manager, you can retrieve existing result sets, annotate them, perform operations such as unions and intersections, and various other operations. Result sets may be deleted by the owner, and you can get rid of an account altogether by selecting the appropriate option on the list manager pane. If you are logged in, and want to change the account, or delete the account, you have to log out first by pressing the 'Change Login' button on the manager pane.

Output Options

Tabular Output
The standard tabular output, displayed in the Hitlist pane, includes the NSC number, formula, CAS number, number of names available for the structure, and one sample name. Note that the NSC number is a live hyperlink, which will lead you to the display of the complete information available for this structure. For more information, go here.

Output Formats
Various choices are available to control the style of output. These choices can be made in several places: in the Output Format field of the Query Form; in the Data Retrieval and the Visualization fields of the Hitlist pane; and in the Structure Retrieval and the Visualization fields of the Detail pane.

Many structure file formats can be directly written to file, if desired. For statistical purposes, simply counting the matches is an option selectable from the Query Form (Simply Count Hits (entire DB)). It is also often a good idea to conduct your search first with this output format selected if you suspect that your query will result in a very large number of hits.

However, for most searches you will probably initially want to use the default output format, which is an HTML table (see above) with a few random sample structures from the result set, displayed in the Hitlist pane. If you are confident that the number of hits will not be too high, you can also select complete 2D structure rendering as GIF image gallery or Chime display. In order not to overload your web browser's cache, the maximum number of structures displayed at a time, in both the Hitlist pane and the Display pane (when displaying, e.g., a GIF Image Galery), is limited to 250. You can go through larger hit sets in chunks of 250 structures at a time.

Most output format names in the list should be self-explanatory. They are either widely known (such as PDB), or should be familiar to people working in specific fields (such as JCAMP/CS, a format used in spectroscopy). If in doubt, just try it out.

Please be aware that the format Cactvs Hitlist (records) is not a list of NSC numbers but will produce a list of the internal record numbers in the CACTVS database. This is rarely useful for external users of the database. The one major exception is if you want to exchange hit lists with other users. Conduct your search as desired. Export your hit list in the format Cactvs Hitlist (records). Send the saved file to the person you want to share it with. They can then use the Query Type Merge/Upload Hitlist in the bottommost Query row to re-import this hit list. If no other search fields are used, this list will simply be loaded; otherwise, the search will be conducted on this subset of structures.

Sorting the Hit Lists
At the bottom of the Query Form, a menu lets you select the sorting order of hitlists. It is only used when more than one result record is produced. The default sort order are the NSC registry numbers in ascending order, but you can also select atom counts, structural complexity, molecular weight (all in ascending order) and similarity the the query structure (in descending order). Note that similarity sorting can only be used in conjunction with a similarity query. Otherwise the default NSC ordering is used. The sort order is preserved when exporting multiple structures (for example, as SD file) or expanding a hit list to an image gallery. If a list is sorted by anything besides the NSC numbers, the values used for sorting are included on the hitlist page.

2D vs. 3D Structures
If the selected export file format offers a choice (e.g. Molfiles), the file will contain 2D plot coordinates by default. The file can be forced to contain exclusively (Molfiles) or additionally (CACTVS/Binary etc.) 3D information by checking the 3D preference box. Formats like PDB, XYZ, VRML etc. are always 3D and the checking of the box is implied and not required. The 3D structures are not experimentally determined coordinates but have been calculated by the program CORINA.

Hydrogen Stripping
You can request that hydrogens be stripped from the exported structures by clicking the check box Strip H on the Hitlist pane. This can be required by certain modeling programs, or be useful for display purposes. Hydrogens on hetero-atoms (e.g. in -OH groups) and hydrogens necessary to specify stereochemistry will not be removed. Be forewarned, however: Removing a hydrogen from a carbon atom (without introducing a formal charge) will, in many cases, make this carbon a radical center. This may not be what you want. It is generally better to tell your program not to display unwanted hydrogens than to truly remove them from the structure.

Stereochemistry - Warning!
The original structure records in the DTP data files (connection tables) do not contain any stereo information. 3D structures generated from this data therefore contain a default selection of stereodescriptors, usually those resulting in low overall energies (i.e. trans double bonds). This is not a limitation of the CORINA 3D structure generator (as has been erroneously written in reviews) but simply unavoidable given the current NCI Database contents.

The Hitlist Pane
If no direct output to a structure file is requested, an overview Hitlist pane is presented first. This page contains the NSC number, CAS number (if available), molecular formula, number of names and one representative name, if any names are available. Clicking on the NSC numbers retrieves this record and displays it in greater detail with all available data on the Detail pane. Additionally, you can choose the export of the complete list or a selected subset thereof in a multi-record file format (SMILES, CACTVS/Binary, Hitlist or MDL SDfile). Selected data fields can be added to the exported file by highlighting the various choices in the Fields list in the Data Retrieval block. Select multiple choices by using the usual Ctrl-Leftclick and Shift-Leftclick mechanism (or whatever your web browser requires). Individual PASS activity and inactivity predictions can now also be selected as fields to be exported. Selected data columns can also be exported in various spreadsheet files.

A more detailed overview page called the image gallery, is also accessible from this page. It contains, in addition to the information listed on this page, a structure plot and information about AIDS antiviral screening and tumor cell line screening results. It is displayed in the Display pane.

The leftmost column of the compound listing contains checkboxes. These checkboxes control the structure export from this page in multi-record formats (SD-File, CACTVS/Binary, SMILES, GIF and Chime galleries). Only structures which are checked are transferred or depicted. You can toggle the current selection by clicking on the Invert Selection button at the bottom of the page. By default, all structure are selected. The maximum size of a hitlist page is 1000 structures, that of an image gallery 100 structures. On very slow computers, the display of the 1000-row table can require over ten seconds. Having one hundred images on a page is even more demanding for antique hardware. Do not attempt this from a computer with 8 MB of RAM (or less... :-)).

The Detail Pane
The Detail pane is generated when you select a structure via clicking on its NSC number from the Hitlist pane or an image gallery. This pane displays all information of the selected compound which is contained in the database, including a structure plot and screening results. From this page, you can transfer your structure to a variety of database and computational services on the Internet in order to obtain additional information, or export the structure in many chemical exchange formats. Also, you can transfer a selected structure to the Java editor to use as a template for further queries by clicking on the Transfer to Java Editor button underneath the structure drawing.

Substructure Highlighting
If you are interested in how a substructure is embedded in the result molecules, you should leave the checkbutton Highlight matched SS enabled. If this option is active, the bonds corresponding to the substructure are plotted in red on the Detail pane, GIF image gallery and 3D displays as VRML scene or via the Chime plug-in. If you have multiple substructures, all embeddings which were used to establish the matched substructures are highlighted. If your query connection is AND, all embedded substructures will be plotted but only the first one if the connection mode is OR, since the database will not execute the second match if the first one was sufficient to let the record pass.

Structure Export
Both from the Hitlist pane and the Detail pane you can request the export of structure data in a variety of chemical exchange formats. Most of the file formats are limited to a single structure record, so a hitlist page will be automatically presented if more than one match is found and the chosen file format does not support multi-record output. The returned data is prefixed with the appropriate chemical MIME type; thus, if your browser is configured correctly, the structure data will be displayed, e.g., as a rotatable 3D model, rendered by a suitable helper application or a plug-in. Recommended plug-ins for this database include MDL Chemscape Chime for most structure formats and SGI CosmoPlayer for VRML visualizations as well the external RasMol molecular graphics program. Another VRML plug-in that has been successfully used is Cortona from ParallelGraphics, and many more choices can be found, e.g., at the VRML Repository.

Links to External Databases and Services
From the detail page, you can directly transfer your structure to a variety of other databases and computational services. Two important options on the page are links to the CambridgeSoft ChemFinder database, the ACD/Labs ILAB Property Computation Services and the TeleSpek IR prediction program. Just chose the appropriate menu entry to have your structure transferred as full-structure query to the ChemFinder Internet chemical information reference database or submitted to an IR spectra prediction. ChemFinder is especially valuable because it is a meta-database of other freely accessible databases on the Web. It will even lead you back to other NCI information pages containing additional compound information such as solubility not contained in this database. Numerous other links are also provided, for example for the computation of molecular orbitals, or various custom 2D and 3D visualization services. A Medline CAS Number Search is available for those compounds that do have a CAS number associated with them. Using that as search criteria, a search in the MEDLINE database via the PubMed service is conducted for any publication in which the compound was mentioned (and its CAS number given). This will obviously not be the case for all of the 127,000 compounds with a CAS number in the Open NCI Database. You can also request, both from the Hitlist and the Detail pane, that the DTP Ordering Form for Samples be called up for you and filled with the NSC number(s) your search produced.


Types of Information Available

Some of the types of information available for the compounds, such as [Molecular] Weight, [Molecular] Formula, Composition etc., need no explanation, at least not to the chemist. Others have already been explained in the Query Types section. Explanations are therefore mostly given for the remaining ones. The most complete set of information for each compound is displayed on the Detail pane.

Available Screening Data
This lists which, if any, of the various anti-cancer and anti-HIV screening assays of NCI have been performed for this compound.

# Catalyst Conformers
The number of conformers that the program Catalyst (MSI) was able to calculate within the specified energy range. This may give some indication of the flexibility of the compound, but is mostly important for the 3D pharmacophore searches.

Commercial Availability; Commercial Database Keys
We have cross-checked the Open NCI Database with a number of other large chemical databases (of small molecules), most of them being vendor catalogs of compilations of such catalogs. If we found a match, Commercial Availability is set to Yes; if we know of usable keys in these databases, we list them here. Obviously, to actually obtain the information in the third-party databases, you need to have a license (or other official access) for each one of them yourself.

Please be aware that this is not a continously updated set of data. It may or may not be updated in the future. It is inluded here as a courtesy to the user without any further guarantee. (Please note that "ACD" in this field stands for "Available Chemicals Directory" [MDL], whereas in the Names field, "ACD[/Name 4.0]" stands for "Advanced Chemistry Development [Inc.]"; likewise, "WDI" here stands for "World Drug Index" [Derwent] and not "Wolf-Dietrich Ihlenfeldt" :-) )

Complexity
See above in the Query Types section.

Druglikeness
The drug likeness results were calculated with a program developed by Markus Wagener and Vincent J. van Geerestein at the Department of Molecular Design and Informatics, NV Organon, The Netherlands. It is described in "Potential Drugs and Nondrugs: Prediction and Identification of Important Structural Features" J. Chem. Inf. Comput. Sci. 2000, 40, 280-292 (abstract). We gratefully acknowledge this program being made available to us for free.

These are the results of predictions (based on decision trees) whether a compound is likely to be a drug (i.e. have a pharmacological effect). The algorithm was run with two parameter settings: 'std' is the standard setting, which is the optimum overall parameter set obtained from selectivity training. The 'neg' results were obtained with another parameter set which was optimized to avoid false negatives - i.e. the underlying asumption in this case is that it is comparatively cheap to test some extra compounds, but can cost one millions if one misses an active compound.

File Record
The internal CACTVS database record number. This goes contiguously from 1 to currently 250,250.

Matched Conformer
This will different from None only for 3D pharmacophore searches. This is the number of the conformer among the (up to) 25 Catalyst conformers that matched the 3D pharmacophore. Only one conformer will be listed here - if the search routine finds one match, it will not look further in additional conformers if they also match the pharmacophore.

Names
It is important to realize that names listed for compounds in this database come from two very different sources.

For over 220,000 structures, IUPAC names were calculated by the program ACD/Name from ACD Labs (Toronto, Canada), version 4.0. (We are aware of the fact that flawed names were generated for some structures in the program runs that generated the names for the present data set. They will be corrected in the next major update of the service.) The ACD names always show up at the top of the name list, and are marked by the addition of "(ACD/Name 4.0)". These names are strictly calculated from the computer structures (connection tables) that are publicly available.

For about 45,000 compounds, at least one name ("NCI name") was present in the original DTP files. In very many cases, there is more than one NCI name, sometimes more than a hundred, for the same compound. These NCI names comprise variants of the chemical name, common names, brand names, foreign names, even various catalog numbers etc. Have a look, e.g., at NSC # 2100 for an example of a compound with more than 100 names.

There are only a handful of structures that have an NCI name but no ACD name. But there are still about 24,000 structures that have no name at all.

Keeping these different name origins in mind, it is important to be aware of the fact that they may lead to apparent discrepancies within the name set for one compound. The reason is that the ACD names are derived from the computer structure, whereas the NCI name(s) are supposed to reflect the original sample that was submitted to NCI. As an example, NSC # 257454 will display as the chemical structure Daunorubicin, and list as the ACD name the correct, but lengthy (and little used) IUPAC name; one of the NCI names, however, is "Daunorubicin mixture with salmon sperm DNA." Obviously, the second part of this mixture didn't make it into the chemical structure description. This also happened for a lot of metal ions of organometallic complexes and salts (but not for all!). In a similar vein, a textual indication of stereochemistry in the NCI name(s) should usually describe the stereochemistry of the sample, whereas any stereochemistry expressed in the ACD name is based on the computer structure, which may or may not show the correct stereochemistry when compared with the sample. -- So which names are "better"? You have to decide this for yourself, by asking if, for your application, being closer to the sample or the computer structure is the more important aspect for you.

NSC Number
The NCI's sample accession number. All compounds in the NCI database have an NSC number. However, the same compound may be in, or part of, several different samples, and thus show up under several NSC numbers. See the FAQ for some more historical information about the term "NSC" [number].

Not all numbers in the currently spanned range, reaching from 1 to just over 700,000, are present in this data set. The "discreet" (non-open) compounds of the NCI Database are obviously not available here. Furthermore, large stretches of NSC numbers were set aside in the past but then never really used. Particularly the range 400000-600000 is sparsely populated.

WLN
Wiswesser Line Notation. One of the earliest chemical line notations, developed by William J. Wiswesser in the late 1940s and early 1950s. This linear notation system was designed to allow generation of unique and unambiguous representations of chemical structures by using 40 or so standard typewriter keys. It saw its most widespread use from the mid 1960s into the 1980s. It is not widely used any more nowadays. WLN strings have been incorporated for those compounds for which this information is available in the original DTP files. No new WLN strings are being, or will be, generated.

About the Database

Data Origin
The database contains 250,250 structures, which corresponds to the open part of the NCI database up until and including the latest release of the DTP cancer screen results of August 2000. The structures used to build the Enhanced NCI Database Browser is also available for bulk download from this site in various formats, including structure files with screening data added. The original structure data and screening results are all maintained by NCI's Developmental Therapeutics Program. Additional information and downloadable files (such as the Standard Agent Database and the Mechanism of Action Database) can be obtained from that site.

Database Size and Content
At the time of this writing, the database contains 250,250 open records. Every record contains at least the NSC number and the chemical structure. Records without a chemical structure, which exist in the NCI DIS system, have not been included.

The database in its first release contained 216,089 names (of 45,229 compounds) coming from the original DTP tables, 44,804 AIDS antiviral screening results, 1,886,719 GI50 tumor cell screen data rows, 1,889,077 LC50 tumor cell screen data rows, 1,890,137 TGI tumor cell screen data rows, 11384 Level I yeast data rows, 45900 Level II yeast data rows, and 122,631 CAS numbers from the original DTP sources.

The second release of the database was enhanced by the results of many additional computational procedures, for example by computing CORINA and MSI Catalyst 3D coordinates, CACTVS structural complexity values, Organon drug likeness descriptors, two different logP calculation schemes, and others. We have also added IUPAC names calculated by ACD/Labs for 220,292 structures, and 64,188,212 (!) PASS pharmacological activity predictions. We intend to develop this database into a benchmark of structural descriptors and data mining algorithms.


About the Software and Hardware

Required Browser Software
In order to access this server, you must use a JavaScript-capable browser. If you want to use the Java structure editor, your browser must also support Java. Both Java and JavaScript may have to be enabled manually from the browser configuration panel. Besides that, any reasonably modern browser (Netscape and IE in version 4 or higher) should be able to use this service. Some display forms, especially the image gallery of hitlists, require the browser to display up to 100 GIF images in a table. You should have at least 32 MB main memory to avoid the risk of crashing with this particular display style. Also, some plug-ins (Chime, CosmoPlayer) have memory demands of their own.

Currently, PC-based Netscape versions are the only browser which can use the live link into the ACD/Labs ILAB computation services.

Server Software Environment
This database was implemented exclusively using software of the CACTVS chemical structure processing toolkit. Secondary, derived information (GIF images etc.) is dynamically computed when the query is run. The CACTVS toolkit has extensive scripting capabilities, employing tcl as language core with sophisticated chemical command enhancements. All response pages are generated by a single, compact CACTVS/TCL CGI script of about 3550 lines.

References

1. J.B. Hendrickson, P. Huang, A.G. Toczko, Molecular Complexity - A Simplified Formula Adapted to Individual Atoms. J. Chem. Inf. Comput. Sci. 27, 63-67 (1987); and
W.D. Ihlenfeldt, Computergestützte Syntheseplanung durch Erkennung synthetisch nutzbarer Möglichkeit von Molekülen. Dissertation, TU Munich 1991.

2. D. Weininger, A. Weininger, and J.L. Weininger, SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Comput. Sci. 29, 97-101 (1989).

3. Wolf Dietrich Ihlenfeldt, Yoshimasa Takahashi, Hidetsugu Abe, S. Sasaki. Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility. J. Chem. Inf. Comput. Sci. 34(1), 109-116 (1994).

4. Voigt JH, Bienfait B, Wang S, Nicklaus MC. Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci. 41(3), 702-712 (2001).

5. Ihlenfeldt WD, Voigt JH, Bienfait B, Oellien F, Nicklaus MC. Enhanced CACTVS browser of the Open NCI Database. J. Chem. Inf. Comput. Sci. 42(1), 46-57 (2002).

6. Poroikov VV, Filimonov DA, Ihlenfeldt WD, Gloriozova TA, Lagunin AA, Borodina YV, Stepanchikova AV, Nicklaus MC. PASS biological activity spectrum predictions in the enhanced open NCI database browser. J. Chem. Inf. Comput. Sci. 43(1), 228-236 (2003).


Contact

This service was implemented by Wolf-Dietrich Ihlenfeldt (Homepage) in the course of a continuing collaboration with the CADD Group  of the Laboratory of Medicinal Chemistry, Center for Cancer Research, NCI-Frederick, NIH, Frederick, USA, headed by Marc C. Nicklaus. The support of many collaborators is kindly acknowledged.

You are welcome to mail me (WDI) and/or Marc Nicklaus for comments, questions, suggestions and bug reports.

Last change: 2004-09-09