Example 1: Similarity Searching on an Active
As an example, we choose Taxol as the drug molecule for which we want to find similar compounds in the NCI database that may have potential drug activity, and want to learn more about them if literature is available.
Example 2: Download all structures in the database that have CAS numbers associated with them and that have no stereogenic centers, and include names and SMILES strings for all of them.
This is a dataset that may be useful for further processing in various contexts. Exclusion of all compounds that possess one or more chiral center and/or one or more E/Z double bond will ensure that this dataset contains no possibly wrong stereoisomers.
It is a good idea - for this search as well as in general - to first gain an impression on how large a hit list you may get from your search:
Example 3: Screening the NCI database for potential new structural motifs, by querying with predicted activities and other criteria.
As an example, we want to obtain some new structural ideas for potential HIV protease inhibitors. We want to limit ourselves to molecules that have not yet been tested in the NCI anti-HIV screen, are of drug-like size, and avoid certain very common classes of compounds against this target.
The field below the navigation bar is a status area. After you have started an operation, information about the status of that operation will be displayed in that area. After a few seconds, the window will automatically switch back to display the global database status.
The actual content (i.e. input forms, query results, visualizations) will appear in the third and largest area in the lower part of the browser window.
Also, do not use the Back and Forward buttons of your browser instead of the navigation bar buttons. This will work in many instances, but can have unexpected consequences, such as resetting the entire session (especially if you accidentally back or forward out of this site).
After filling in all relevant form elements, press one of the buttons labeled Start Search. Depending on the selected output format and whether any records meeting your criteria were found, an answer page or a structure file is generated. If you selected the output format Simply Count Hits (Entire DB), you will receive the resulting count in the status window.
The order of the queries is not important. The database will optimize it and use, for example, fast queries for the presence of a data field to filter those records which are submitted to a more demanding substructure match procedure.
The currently highest NSC number is just above 700,000. If you don't find an entry for a given NSC number within this range, this can have two reasons: First, you may have hit on one of the non-open ("discreet") compounds in the NCI Database; secondly, large stretches of NSC numbers were set aside in the past but then never really used. Particularly the range 400000-600000 is sparsely populated.
Please be aware that only about half of the compounds in the database have a CAS number associated with them. This does not necessarily mean that they do not possess a CAS number; it just means that none was entered when the compound was originally keyed in. On the other hand, there definitely are compounds in the database that truly do not have any CAS number. Many of the samples that NCI received were, e.g., from ongoing research projects, and these compounds were not necessarily published or patended - so they may never have entered the Chemical Abstract Registry.
It is also possible to use sums and differences of elements. For example, the query C4(F+Cl+Br+I)2 will retrieve all C4-compounds with any combination of exactly two halogens.
The definitions of donor and acceptor atoms and rotatable bonds are somewhat flexible, but should match common practice. If in doubt, extend the range and see whether you get extra hits which are interesting. The rotatable bond count excludes all bonds where the rotation is possible, but does not have a major impact on the shape of the molecule. For example, all terminal or linear bonds are excluded.
Please be aware that the definition of flexibility that underlies the rotatable bonds count here, and the definition of flexibility that was used by the program Catalyst (MSI) in calculating the conformers whose number is reported in the Detail window (and which can be searched for with the 3D pharmacophore search) have nothing to do with each other. Issues like terminal groups, hydrogens, large ring flexibilities play a role here. You may therefore encounter cases where the number of rotatable bonds (CACTVS) vs. the number of conformers (Catalyst) seems non-intuitive if not inconsistent.
In combination with the Simply Count Hits (entire DB) output format, you can use this feature to quickly derive all kinds of statistics on our dataset.
If you select the Query Type PASS Prediction Range..., a popup window will appear that will allow you to select the activity for which you want to search. There should be a scroll bar at the right side of the list. If your browser doesn't show it, enlarge the popup window manually. At the top of the window, you can select the Query Probability type to search for, Activity or Inactivity. You can only select one activity or inactivity at a time. If you want to conduct combined searches, such as "activity [probability] > 0.8" AND "inactivity [probability] < 0.2", you have to use separate query input lines in the Query Form. Since the predictions are calculated as probabilities, you have to use number ranges between 0.0 and 1.0.
Individual PASS activity and inactivity predictions can be selected as fields to be exported through the Data Retrieval functionality of the Hitlist pane.
We have observed the possibility, under certain circumstances, that the PASS selection popup window will not come up any more, even if you reset the Query From. If this happens to you, simply re-enter the server URL (e.g. http://cactus.nci.nih.gov for the U.S. mirror) in your web browser window, and start the search session from there anew.
It is obviously totally impossible for us to test even a small subset of these predictions for all the NCI compounds ourselves. If you use this feature of our service, we would therefore be interested in hearing about success (and also not-so-great-success) stories, which we would compile and post, e.g., on this server. This need not include disclosure of individual compounds tested, but merely of the success rate of the predictions for the activity analyzed. You can e-mail Marc Nicklaus and/or Prof. Vladimir Poroikov with results and questions.
In this vein, we want to emphasize that these values are predictions, to which all the usual caveats pertinent to QSAR-type calculations should be applied. The user should never make the mistake to assume that a specific prediction for a single compound means that this molecule has this activity. The PASS predictions can only be responsibly used in a statistical manner for sets of compounds, and should be treated as scientific "food for thought."
As a third option, you can start a Java editor by clicking on the Start Editor button below any input field. You must use a WWW browser with Java support (Netscape, Internet Explorer) for this to work, and you must have Java enabled, which is an option in the browser configuration panel. The input frame will switch to the editor panel. Read the editor instructions to learn how to use the program. Structures are exported as SMILES strings from the editor by clicking on the Transfer to Form button on the editor panel, or by using the navigation bar to switch back to the query input panel. The editor remains associated with the last input field where you pressed the Start Editor or Transfer to Form buttons. If your current search option is not structure-based, the query method will automatically change to substructure search upon structure import.
Sometimes the best combination of methods to input a query structure may be to first draw the general structure in the Editor, and then edit the resulting SMILES string manually in the Query Data Value field, especially if you want to add SMARTS extensions etc. (see Supported SMILES Features below).
We thank Peter Ertl from Novartis Crop Protection AG for kindly allowing us to use this remarkable applet.
Please note that throughout the service, Unique SMILES (USMILES) is used for SMILES output. However, this is the original USMILES definition by Daylight Chemical Information Systems, Inc. of 1989. The canonicalization rules have been changed by Daylight in the meantime, but these changes, to our knowledge, have not been published. Internally, USMILES may or may not be used. For example, the JME Edidor does not create USMILES when you draw and transfer a structure. Of course, if you compose a SMILES search string by hand, or edit a SMILES string, you are not required to use USMILES - any valid SMILES string will be recognized and accepted.
Most of the ISIS query syntax is available - including all 3D query methods. R-group search is only partially supported - simple positionally variations of substituents do work, but not complex R-group logic.
First, you have a choice whether matched substructures should be highlighted in the displayed result structures or not. Highlighting applies both to 2D plots and 3D displays in Chime or as VRML file. Note that, if multiple substructures are combined by an OR statement, only the first successful substructure match is actually performed on that record and subsequently displayed, even if additional fragments would also match. Highlighting is activated by default.
If you allow multi-fragment overlap, substructures which consist of disconnected fragments may overlap when matching the target structures. By default they will not, so that if you specify two nitro groups as substructure, only compounds with two or more nitro groups are found. Note that this feature applies only to substructures which where entered in a single input field as an entity. If you specify two substructures on two different fields, their match relationship is not influenced by the setting of this switch.
The third option is whether to suppress the matching of aromatic bonds on plain single or double bonds with no auxiliary attributes. By default, aromatic bonds will match such bonds, provided that no other attributes (such as 'not in a ring') prevent the match. If you desire the behavior of NCI's older DIS system, which will match aromatic bonds in the database structures only on aromatic bonds in your query, you should activate this switch.
Finally, the option for the enforcement of ring embedding equality means that the ring count of the bonds of a substructure must match the ring count of the database structures. If this switch in on, a simple phenyl fragment will not match naphthalene (only benzene, or biphenyl). Also, it implies that all bonds in your substructure which are not in a closed ring can only match non-ring bonds in the database molecules. The same effect could be achieved by explicitly specifying for each bond that it must not be in a ring, but this global option is often more convenient.
To prepare a query for a 3D pharmacophore search, you can either create a query file externally and submit it to this service, or you can use the Local Query Parameters area of the Editor pane. The first possibility is probably the somewhat easier way at this time to enter more complex queries.
To create a query file, you can use programs such as Catalyst or ISIS/Draw etc. and generate a file in .mol format. Most of the additional features in query files are supported, such as exclusion spheres, centroids, points on lines, angles, planes... Once you have this file available on the machine from which you started the Browser, go to the bottommost query line, select the option Substructure and/or 3D Search..., click on the Browse button to the right of it, and select the query file on your machine. Then start the search.
To generate a query, proceed along the lines of the following examples.
From any of the query input lines, call up the Editor pane. To generate a
query that consists of a triangle of oxygen atoms,
1. select O from the list of elements, place it on the drawing area;
2. click on the NEW button at the top of the JME Structure Editor;
3. place another O atom;
4. repeat steps 2 and 3;
5. click on the 123 button;
6. click on the three placed O atoms: this will generate atom numbers;
7. in the Local Query Parameters area, enter "1 2" in the topmost Atoms field;
8. in the Value Range field below it, enter (e.g.) "2.5-3.5";
9. repeat steps 7 and 8 with the values (e.g.) "2 3", "3.5-4.5" and "1 3", "4.5-5.5";
10. click on the button (below the Editor area) Transfer to Query Form.
Now, in the query line you used, you should see, in the Query Data Value field, the entry "[OH2:1].[OH2:2].[OH2:3]". This would search for three water molecules -- which is probably not what you want. (The Editor automatically adds hydrogens to all unfilled valences.) Go into this field, and manually edit out the hydrogens, so that you have the string "[O:1].[O:2].[O:3]". Now start the search (after possibly adding other search criteria). The constraints you specified are transferred to the search engine behind the scenes.
You should make sure that the ensemble of constraint values you're entering amounts to a meaningful 3D arrangement of atoms. For example, the values used above are a triangle with side lengths of 3, 4, and 5 Angstroms, resp., with a 0.5 Angstrom tolerance for each side. Values of 3, 4, and 10 Angstrom, on the other hand, do not produce a valid triangle, and thus do not result in any hits.
Once you have obtained hits from your search, the best way to view the results is probably to choose, from the Detail pane, the Visualization option Chime Display/All Conformers. This will show you all conformations calculated by Catalyst, with the one that was found to match the 3D query highlighted by a light red background. (Once the search algorithm has found one match for a molecule, it will not look for additional conformers that could potentially also match the query.) Superimposition of the query onto the displayed conformers is planned for the future but not yet implemented.
This capability is not a replacement for full-fledged, dedicated 3D pharmacophore search programs. One of its main limitations is obviously that it doesn't allow one to conduct any conformational search on-the-fly -- there is only a fixed set of pre-calculated conformers available. On the other hand, this allows for a very rapid searching -- few of the more sophisticated programs will return a hit set from a 250,000-compound database within a few seconds.
To conduct a background search, click on the check box to the left of the option Background job named on the Query pane, and enter a name so that you will be able to retrieve your hit set after the search has finished. You do this by going to the Hitlist Management pane (click on the List Mgr button on the Navigation Bar). If your search is complicated, it may take a while for the results to become available on the Hitlist Management pane. If you go there to early, and don't see your results yet, use the Reload button of the Hitlist Management pane after a short while. (Do not use the Reload button of your web browser - this will cause you to lose all your previously entered Query data.) Remember that you have to be logged in in order to use the Hitlist Manager, and in order to log in, you need to have registered - so it's a good idea to complete these steps before you start a background search, which otherwise will be denied.
For background searches, you can enter maximum numbers of hits greater than 10,000, all the way up to 99,999; or you can blank out this field, which means that you set the maximum number to "infinite", i.e. the entire database.
Since you can specify, in the Data Retrieval section of the Hitlist pane, additional fields to be added to the output file (for those formats that allow this, e.g. SDF), background searches are actually a more flexible way of downloading the entire database with exactly the data you want than downloading the bulk files from on our Download Page.
The following is an example of how to download the entire database in chunks of 25,000 compounds. Set the Max. number of hits accordingly, i.e. to at least 25000. Select Record Number(s) as the Query Type, and enter 1-25000 as the Query Data Value. Wait a few seconds for the search to complete. (The Output Format does not matter at this point.) The resulting hitlist is now stored on our server, so you have to log in by clicking the List Mgr button on top. In the Hitlist Manager, select the Selection (hit set) you just created, and click on Retrieve. This will automatically get you to the Hitlist pane. Only 250 structures will actually be displayed at a time (you can request more blocks of 250 to be displayed), but they're all there. In the Data Retrieval section, select the appropriate fields you want to be included in the SD file (such as CAS Number, SMILES String etc.). Then click on the Retrieve button, and save the file with a name of your choice on your computer. Repeat these steps for Record Numbers 25001-50000 etc., until you have downloaded the entire database.
Many structure file formats can be directly written to file, if desired. For statistical purposes, simply counting the matches is an option selectable from the Query Form (Simply Count Hits (entire DB)). It is also often a good idea to conduct your search first with this output format selected if you suspect that your query will result in a very large number of hits.
However, for most searches you will probably initially want to use the default output format, which is an HTML table (see above) with a few random sample structures from the result set, displayed in the Hitlist pane. If you are confident that the number of hits will not be too high, you can also select complete 2D structure rendering as GIF image gallery or Chime display. In order not to overload your web browser's cache, the maximum number of structures displayed at a time, in both the Hitlist pane and the Display pane (when displaying, e.g., a GIF Image Galery), is limited to 250. You can go through larger hit sets in chunks of 250 structures at a time.
Most output format names in the list should be self-explanatory. They are either widely known (such as PDB), or should be familiar to people working in specific fields (such as JCAMP/CS, a format used in spectroscopy). If in doubt, just try it out.
Please be aware that the format Cactvs Hitlist (records) is not a list of NSC numbers but will produce a list of the internal record numbers in the CACTVS database. This is rarely useful for external users of the database. The one major exception is if you want to exchange hit lists with other users. Conduct your search as desired. Export your hit list in the format Cactvs Hitlist (records). Send the saved file to the person you want to share it with. They can then use the Query Type Merge/Upload Hitlist in the bottommost Query row to re-import this hit list. If no other search fields are used, this list will simply be loaded; otherwise, the search will be conducted on this subset of structures.
A more detailed overview page called the image gallery, is also accessible from this page. It contains, in addition to the information listed on this page, a structure plot and information about AIDS antiviral screening and tumor cell line screening results. It is displayed in the Display pane.
The leftmost column of the compound listing contains checkboxes. These checkboxes control the structure export from this page in multi-record formats (SD-File, CACTVS/Binary, SMILES, GIF and Chime galleries). Only structures which are checked are transferred or depicted. You can toggle the current selection by clicking on the Invert Selection button at the bottom of the page. By default, all structure are selected. The maximum size of a hitlist page is 1000 structures, that of an image gallery 100 structures. On very slow computers, the display of the 1000-row table can require over ten seconds. Having one hundred images on a page is even more demanding for antique hardware. Do not attempt this from a computer with 8 MB of RAM (or less... :-)).
Please be aware that this is not a continously updated set of data. It may or may not be updated in the future. It is inluded here as a courtesy to the user without any further guarantee. (Please note that "ACD" in this field stands for "Available Chemicals Directory" [MDL], whereas in the Names field, "ACD[/Name 4.0]" stands for "Advanced Chemistry Development [Inc.]"; likewise, "WDI" here stands for "World Drug Index" [Derwent] and not "Wolf-Dietrich Ihlenfeldt" :-) )
These are the results of predictions (based on decision trees) whether a compound is likely to be a drug (i.e. have a pharmacological effect). The algorithm was run with two parameter settings: 'std' is the standard setting, which is the optimum overall parameter set obtained from selectivity training. The 'neg' results were obtained with another parameter set which was optimized to avoid false negatives - i.e. the underlying asumption in this case is that it is comparatively cheap to test some extra compounds, but can cost one millions if one misses an active compound.
For over 220,000 structures, IUPAC names were calculated by the program ACD/Name from ACD Labs (Toronto, Canada), version 4.0. (We are aware of the fact that flawed names were generated for some structures in the program runs that generated the names for the present data set. They will be corrected in the next major update of the service.) The ACD names always show up at the top of the name list, and are marked by the addition of "(ACD/Name 4.0)". These names are strictly calculated from the computer structures (connection tables) that are publicly available.
For about 45,000 compounds, at least one name ("NCI name") was present in the original DTP files. In very many cases, there is more than one NCI name, sometimes more than a hundred, for the same compound. These NCI names comprise variants of the chemical name, common names, brand names, foreign names, even various catalog numbers etc. Have a look, e.g., at NSC # 2100 for an example of a compound with more than 100 names.
There are only a handful of structures that have an NCI name but no ACD name. But there are still about 24,000 structures that have no name at all.
Keeping these different name origins in mind, it is important to be aware of the fact that they may lead to apparent discrepancies within the name set for one compound. The reason is that the ACD names are derived from the computer structure, whereas the NCI name(s) are supposed to reflect the original sample that was submitted to NCI. As an example, NSC # 257454 will display as the chemical structure Daunorubicin, and list as the ACD name the correct, but lengthy (and little used) IUPAC name; one of the NCI names, however, is "Daunorubicin mixture with salmon sperm DNA." Obviously, the second part of this mixture didn't make it into the chemical structure description. This also happened for a lot of metal ions of organometallic complexes and salts (but not for all!). In a similar vein, a textual indication of stereochemistry in the NCI name(s) should usually describe the stereochemistry of the sample, whereas any stereochemistry expressed in the ACD name is based on the computer structure, which may or may not show the correct stereochemistry when compared with the sample. -- So which names are "better"? You have to decide this for yourself, by asking if, for your application, being closer to the sample or the computer structure is the more important aspect for you.
Not all numbers in the currently spanned range, reaching from 1 to just over 700,000, are present in this data set. The "discreet" (non-open) compounds of the NCI Database are obviously not available here. Furthermore, large stretches of NSC numbers were set aside in the past but then never really used. Particularly the range 400000-600000 is sparsely populated.
The database in its first release contained 216,089 names (of 45,229 compounds) coming from the original DTP tables, 44,804 AIDS antiviral screening results, 1,886,719 GI50 tumor cell screen data rows, 1,889,077 LC50 tumor cell screen data rows, 1,890,137 TGI tumor cell screen data rows, 11384 Level I yeast data rows, 45900 Level II yeast data rows, and 122,631 CAS numbers from the original DTP sources.
The second release of the database was enhanced by the results of many additional computational procedures, for example by computing CORINA and MSI Catalyst 3D coordinates, CACTVS structural complexity values, Organon drug likeness descriptors, two different logP calculation schemes, and others. We have also added IUPAC names calculated by ACD/Labs for 220,292 structures, and 64,188,212 (!) PASS pharmacological activity predictions. We intend to develop this database into a benchmark of structural descriptors and data mining algorithms.
Currently, PC-based Netscape versions are the only browser which can use the live link into the ACD/Labs ILAB computation services.
2. D. Weininger, A. Weininger, and J.L. Weininger, SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Comput. Sci. 29, 97-101 (1989).
3. Wolf Dietrich Ihlenfeldt, Yoshimasa Takahashi, Hidetsugu Abe, S. Sasaki. Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility. J. Chem. Inf. Comput. Sci. 34(1), 109-116 (1994).
4. Voigt JH, Bienfait B, Wang S, Nicklaus MC. Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci. 41(3), 702-712 (2001).
5. Ihlenfeldt WD, Voigt JH, Bienfait B, Oellien F, Nicklaus MC. Enhanced CACTVS browser of the Open NCI Database. J. Chem. Inf. Comput. Sci. 42(1), 46-57 (2002).
6. Poroikov VV, Filimonov DA, Ihlenfeldt WD, Gloriozova TA, Lagunin AA, Borodina YV, Stepanchikova AV, Nicklaus MC. PASS biological activity spectrum predictions in the enhanced open NCI database browser. J. Chem. Inf. Comput. Sci. 43(1), 228-236 (2003).
You are welcome to mail me (WDI) and/or Marc Nicklaus for comments, questions, suggestions and bug reports.
Last change: 2004-09-09