OSRA: Optical Structure Recognition

1. Description
2. Dependencies
3. Other acknowledgements
4. Compilation
5. Usage
6. License
7. Download
8. Web Interface
9. Author

Description:
OSRA is a utility designed to convert graphical representations of chemical structures, as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES (Simplified Molecular Input Line Entry Specification - see http://en.wikipedia.org/wiki/SMILES) - a computer recognizable molecular structure format. OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick - including GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES representation of the molecular structure images encountered within that document.
Note that any software designed for optical recognition is unlikely to be perfect, and the output produced might, and probably will, contain errors, so a curation by a human knowledgeable in chemical structures is highly recommended.


Dependencies:
OSRA needs the following Open Source libraries installed:

Other acknowledgements:
OSRA also makes use of the following software (you do not need to install it separately, it's included in the distribution):
Compilation:
Edit the included Makefile to make sure you have the correct locations for ImageMagick, potrace, gocr, openbabel, and tclap. Running make should then generate the executable - osra.
Usage:
OSRA can process the following types of images: Some common abbreviations, hetero atoms, fused and merged atomic labels, hash and wedge bonds, and bridge bonds are currently recognized. Formal charges, isotopes and some element symbols, i.e. iodine ("I" -- looks too much like a straight line = single bond), are not.

Command-line options:
./osra --help
will give you a list of available options with short descriptions.

Most common use: ./osra [-r <resolution>] <filename>
Resolution in dpi, default is 300 (unless it's a PS or PDF file as mentioned above), filename is the name of your image file (or PS/PDF document).

Other options:
-t, --threshold: Gray level threshold, default is 0.2 for black-and-white images,
-n, --negate: Inverts colors (for white on black images),
-o, --output: Sets a prefix for writing recognized images to files - i.e. "-o tmp" will create files tmp0.png, tmp1.png... for each of the structures,
-s, --size: Resize images on output - can be useful for running OSRA as a backend for a webservice. Example: "-s 300x400".


License:
This program is free software; the part of the software that was written at the National Cancer Institute is in the public domain. This does not preclude, however, that components such as specific libraries used in the software may be covered by specific licenses, including but not limited to the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version; which may impose specific terms for redistribution or modification.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. See also http://www.gnu.org/.

See the file COPYING for details.


Download:
OSRA is Free and Open Source Software. You are welcome to download and use it, provided that you understand the terms described above. Participation in the development is highly encouraged!
    osra-0.9.9.tgz - Build system upgraded to allow linking to the newer versions of gocr (0.45). Download Windows executable here.
    osra-0.9.8.tgz - Added recognition of old-style aromatic rings with heteroatoms.
   osra-0.9.7.tgz - Improved recognition of color and low-res images.
   osra-0.9.6.tgz - Introduced automatic resolution detection.
   osra-0.9.5.tgz - Source code modified to facilitate compiling with MinGW for Windows platform.
   osra-0.9.4.tgz - added old-style benzene ring recognition
   osra-0.9.3.tgz - added rudimentary formal charge recognition
   osra-0.9.2.tgz - improved handling of hash and wedge bonds
   osra-0.9.1.tgz - slightly improved handling of 72dpi color images
   osra-0.9.tgz - original public release
We also welcome your feedback - send us your comments, suggestions, criticism, or praise to the contact email address below.
Web Interface:
To demonstrate the capabilities (and limitations) of OSRA we have created the following web interface:
OSRA Web Interface
Try this sample image from the US Patent Office website first: patent.gif. Use a resolution of 300 dpi.

Author:
Igor Filippov, igorf(AT)helix.nih.gov
2007, SAIC-Frederick, NCI-Frederick, NIH, DHHS, Frederick, MD