OSRA: Optical Structure Recognition Application

Description

OSRA is a utility designed to convert graphical representations of chemical structures, as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES (Simplified Molecular Input Line Entry Specification - see http://en.wikipedia.org/wiki/SMILES) or SD files - a computer recognizable molecular structure format. OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick - including GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES or SDF representation of the molecular structure images encountered within that document.

Note that any software designed for optical recognition is unlikely to be perfect, and the output produced might, and probably will, contain errors, so curation by a human knowledgeable in chemical structures is highly recommended.

News

Note: The last version of OSRA developed at NCI was OSRA 1.4.0. Development has continued, and a major rewrite, OSRA II, is available at https://sourceforge.net/projects/osra/ . This information and link is provided purely as a service to the user; however, OSRA II is neither supported at, nor endorsed by, the CADD Group and NCI.

An updated USPTO validation test set is available courtesy of Aniko Valko and Keymodule Ltd., UK. The ground truth molfiles have been corrected and invalid images have been removed. It is available here .
OSRA 1.4.0 is out. The new features include reaction recognition. Accelrys Draw plugin has been updated to work with reactions as well.
OSRA 1.3.9 has been released. Added single molecule MOL format output, improved Java (JNI) interface, significantly improved Accelrys Draw plug-in (James Jack). Added osra-pdf script for more thorough PDF processing.
OSRA 1.3.8 is out! New capabilities include adaptive thresholding for low-light, low-contrast images, output of structure box coordinates, output of InChI, InChI-key or SMILES as SD file properties, new version of Symyx Draw plugin (now requires Symyx Draw 4.0), and the library version of OSRA which is possible to operate from Java through JNI.
OSRA 1.3.7 is released - new build system courtesy of Dmitry Katsubo and other enhancements.
A subset of 450 images from the Japanese Patent Office Chem-Infty dataset containing only organic molecules can be downloaded here: images and ground truth. This subset is distributed by permission from the original Chem-Infty dataset authors Koji Nakagawa, Akio Fujiyoshi, and Masakazu Suzuki. This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.1 Japan License.
You can find my short paper (not peer-reviewed) written for Document Analysis Systems workshop (June 9th-11th, Boston, MA) here.
A large validation set consisting of images and associated MOL files is now available for download. This set was produced from the US Patent Office Complex Work Units and contain one structure per image, ground truth MOL files and a simple script to benchmark the results of your chemical structure recognition software. The benchmark script takes two arguments - first the folder with ground truth files ("molfiles") and second with your generated files - the filenames of individual structures should be identical. It will compare the structures based on standard InChI. This validation set was made possible courtesy of collaboration with Dr. Steve Boyer and Dr. John Kinney.
OSRA 1.3.6 - Updated superatom dictionary, added an option to print out average bond length as measured from the image, modified table detection routine, added an adaptive filter for images with too long or too short bonds.
OSRA 1.3.5 - Improved memory handling and file read/write error handling. Added command-line option to save output to a file. Added call to InitializeMagick() (required for GraphicsMagick 1.3.8). Added multi-threading support (OpenMP) for processing many-page documents (Linux/Unix only, does not yet function on Windows).
OSRA 1.3.4 - Improved table recogniion. Fixed a bug affecting JPEG image processing on Windows. Fixed a bug affecting double bond detection on Windows. Added optional pre-processing with Unpaper routine (-u command line option). Added support for libocrad-0.19.
OSRA 1.3.3 - the tables (boxes) around the structures are detected and removed prior to processing. Added -R (--rotate) command line switch to rotate the image. Modified debug output (-d option) to show the output from superatom dictionary.
OSRA 1.3.2 is out. The speed is further improved (by a factor of 2-3x) by replacing ImageMagick libraries with GraphicsMagick. Also fixed spontanious slowdowns on Windows platform.
OSRA 1.3.1 is out. Improved speed and various bug fixes. PDF processing now honors "--resolution" and "-r" command line options.
You can get higher quality results (at the expense of slower speed) running with the following command line: osra -r 300 -f sdf file.pdf
Also, you can see the page number for structures from PDF documents with "-e" or "--page" option.
OSRA 1.3.0 is now available. New features include:
- OS X install package - one click to install OSRA on a Mac,
- A plugin for ChemBioDraw from ChemBioOffice 2010 (requires ChemScript 12.0) - converts images from the clipboard into ChemBioDraw editor molecular objects,
- Better recognition of high resolution images (above 300 dpi),
- Improved Symyx Draw plugin.
I have presented the new algorithm used by OSRA for text/graphics separation at GREC 2009. It is the first paper in session 4, you can find it here in "Proceedings".
Version 1.2.2 is out. Superatom labels can now be edited by users - superatom.txt contains the SMILES strings for each recongized label and spelling.txt contains spelling variants of every label for cases where OCR engine is not reliable. Please note that the dependencies have changed - ocrad-0.18 is now required and RDKit support is temporarily suspended.
Starting with version 1.2.1 there is a windows installer which automatically installs a plugin if Symyx Draw is present. It also detects and auto installs if necessary Ghostscript and GflAx libraries.
OSRA manuscript has been published: "Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution" J. Chem. Inf. Model., 2009, 49 (3), pp 740–743.
Starting with version 1.2.0 plugins for BKChem, MolSketch, Symyx Draw, and Scitegic PipelinePilot are now included with Windows zip archive. Plugins allow for integration of OSRA functionality with chemical structure editors and other chemoinformatics software.

Dependencies

OSRA needs the following Open Source libraries installed:

GraphicsMagick, image manipulation library, version 1.3.7 or later; However, please note that there seems to be a regression with PNG processing for GraphicsMagick 1.3.13 and 1.3.14 so we recommend using version 1.3.12 as the most compatible version. if installing from RPM make sure you have the following packages:
GraphicsMagick
GraphicsMagick-devel
GraphicsMagick-c++-devel
GraphicsMagick-c++
http://www.graphicsmagick.org/
POTRACE, vector tracing library, version 1.9 or later,
http://potrace.sourceforge.net/
GOCR/JOCR, optical character recognition library, please use patched version available in Downloads below not the version from the GOCR website.
http://jocr.sourceforge.net/
OCRAD, optical character recognition program, version 0.21 or later,
http://www.gnu.org/software/ocrad/ocrad.html
TCLAP, Templatized C++ Command Line Parser Library, version 1.2.0,
http://tclap.sourceforge.net/
OpenBabel, open source chemistry toolbox, version 2.3.0 or later; if installing from RPM make sure you have the following packages:
openbabel
openbabel-devel
http://openbabel.sourceforge.net/wiki/Main_Page

Optional Libraries

OSRA can also use the following optional libraries:

Tesseract (version 3.01)
Cuneiform (version 1.1.0)

Other acknowledgements:

OSRA also makes use of the following software (you do not need to install it separately, it's included in the distribution):

ThinImage, C code from the article "Efficient Binary Image Thinning using Neighborhood Maps" by Joseph M. Cychosz, 3ksnn64@ecn.purdue.edu in "Graphics Gems IV", Academic Press, 1994
http://www.acm.org/pubs/tog/GraphicsGems/gemsiv/thin_image.c
GREYCstoration, Anisotropic smoothing plugin,
http://cimg.eu/greycstoration/index.shtml
CImg, The C++ Template Image Processing Library,
http://cimg.sourceforge.net
MCDL utility from Sergei Trepalin and Andrei Gakh for 2D coordinate generation
Unpaper, is a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. Additionally, unpaper might be useful to enhance the quality of scanned pages before performing optical character recognition (OCR).
https://sourceforge.net/projects/unpaper/

Compilation

Compile and/or install all the necessary dependencies. Check that GraphicsMagick++-config script is in your PATH. Potrace version 1.9 can now install libpotrace.a and potracelib.h - make sure you run configure with the following options:
./configure --with-libpotrace (also --disable-shared if you want static library only).
Use patched GOCR available from Downloads section below. It should be compiled and installed with
./configure; make libs; make install
Depending on your installation you might also have to add /usr/local/lib to your LD_LIBRARY_PATH. Unpack OSRA package. Starting with version 1.3.7 the standard process should do the job:
./configure
make
make install
For more details for specific platforms (such as Mac OS X and Windows MinGW) see the README file.

Usage

OSRA can process the following types of images:

Computer-generated 2D structures, such as found on the PubChem website, http://pubchem.ncbi.nlm.nih.gov/, black-and-white and color (use a resolution of 72 dpi),
Black-and-white PDF and PostScript files, including multi-page ones. Please note that you need Ghostcript installed for GraphicsMagick to be able to parse these kinds of files. OSRA internally renders PS and PDF at a resolution of 150 dpi, higher rendering resolution can be achieved with "-r" option,
Scanned images - black-and-white, a resolution of 300 dpi is recommended, though 150 dpi can also produce fair results. Please make sure the scanned image is of reasonable quality - an input that's too noisy will only generate garbage output.

Some common abbreviations, hetero atoms, fused and merged atomic labels, hash and wedge bonds, and bridge bonds are currently recognized. Formal charges, isotopes and some element symbols, i.e. iodine ("I" -- looks too much like a straight line = single bond), are not.

Command-line options:
./osra --help
will give you a list of available options with short descriptions.

Most common use: ./osra [-r <resolution>] <filename>
Resolution in dpi, default is 300 (unless it's a PS or PDF file as mentioned above), filename is the name of your image file (or PS/PDF document).

Other options:
-t, --threshold: Gray level threshold, default is 0.2 for black-and-white images,
-n, --negate: Inverts colors (for white on black images),
-o, --output: Sets a prefix for writing recognized images to files - i.e. "-o tmp" will create files tmp0.png, tmp1.png... for each of the structures,
-s, --size: Resize images on output - can be useful for running OSRA as a backend for a webservice. Example: "-s 300x400".
-g, --guess: Prints out resolution guess when you chose to have automatic resolution estimate.
-p, --print: Prints out the value of confidence function estimate.
-f, --format: Output format (either smi for SMILES or sdf for SD file format)
-d, --debug: Print out debug information on spelling corrections. First column - output from the OCR engine, second - result of spelling correction, last - SMILES from the superatom dictionary, if any.
-a configfile, --superatom configfile: Superatom label map to SMILES (superatom.txt by default)
-l configfile, --spelling configfile: Spelling correction dictionary (spelling.txt by default)
-e, --page: Show page number for structures from multi-page PDF and PostScript documents
-R, --rotate: Rotate image clockwise by the number of degrees, i.e. -R 90
-u , --unpaper: Pre-process image with unpaper algorithm, rounds (default: 0, or no pre-processing), e.g. -u 2
-w, --write: Write output to a file
-b, --bond: Print out average bond length in pixels
-j, --jaggy: Additional thinning/scaling down of low quality documents
-i, --adaptive: Adaptive thresholding pre-processing, useful for low light/low contrast images
-c, --coordinates: Show surrounding box coordinates (only for SDF/SMI/CAN output format)
--embedded-format "format": Allows the user to have InChI or SMILES included in an SDF file as a molecular property

License

This program is free software; the part of the software that was written at the National Cancer Institute is in the public domain. This does not preclude, however, that components such as specific libraries used in the software may be covered by specific licenses, including but not limited to the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version; which may impose specific terms for redistribution or modification.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. See also http://www.gnu.org/.

See the file COPYING for details.

If there are more questions regarding the GPL software and the legal limitations on its redistribution, modifications, and usage we recommend reading GPL FAQ. In particular, regarding the output of the program we do not and cannot impose any limitations on the use of the output generated by OSRA which the input data did not previously have:

"...copyright law does not give you any say in the use of the output people make from their data using your program. If the user uses your program to enter or convert his own data, the copyright on the output belongs to him, not you. More generally, when a program translates its input into some other form, the copyright status of the output inherits that of the input it was generated from."

Download

OSRA is Free and Open Source Software. You are welcome to download and use it, provided that you understand the terms described above. Participation in the development is highly encouraged!

Note: The last version of OSRA developed at NCI was 1.4.0. Development has continued, and a major rewrite, OSRA II, is available at https://sourceforge.net/projects/osra/. This information and link is provided purely as a service to the user; however, OSRA II is neither supported at, nor endorsed by, the CADD Group and NCI.

OSRA 1.4.0 - Added reaction recognition. Use one of the three reaction output formats - rxn, rsmi, or cmlr (i.e. -f rxn) and OSRA will automatically extract reactions from a document. Change the default rendering resolution for PDF document processing to 300 dpi.
- osra-1.4.0.tgz - source code
- osra-setup-1-4-0.exe - windows - installer
- osra-1-4-0.pkg - OS - X - Package
- gocr-0.50pre-patched.tgz - patched - GOCR - source - code
OSRA 1.3.9 - bug fix and code cleanup update. Added single molecule MOL format output, improved Java (JNI) interface, significantly improved Accelrys Draw plug-in (James Jack). Added osra-pdf script for more thorough PDF processing.
OSRA 1.3.8 - enhancements include structure box coordinates output, adaptive thresholding, output of InChI, SMILES, and InChI-key as properties of SD file records, libosra with JNI bridge, compatibility with updated versions of OCRAD, GOCR, and Tesseract. Newer versions of Potrace and TCLAP are also required.
OSRA 1.3.7 - The build process has changed. Now you can use the standard "./configure; make; make install" combination. Recognized Markush labels (R1, R2, R3...) are now added as atomic aliases in SD files. Added optional support for Tesseract and Cuneiform OCR libraries. Added additional processing for low scan quality documents (-j command line option).
OSRA 1.3.6 - Updated superatom dictionary, added an option to print out average bond length as measured from the image, modified table detection routine, added an adaptive filter for images with too long or too short bonds.
OSRA 1.3.5 - Better memory handling. Better error reporting for file read/write operation. Added compatibility with GraphicsMagick-1.3.8. Added OpenMP support (Linux/Unix only) for multi-page document processing. Added command-line option for saving output to a file.
OSRA 1.3.4 - Better table recognition, fixed bugs affecting Windows executable only (double bond detection and JPEG file processing), added support for libocrad-0.19, added optional Unpaper pre-processing.
OSRA 1.3.3 - Tables are now detected and removed priror to processing, improving accuracy and speed. Added an option to rotate the image (-R or --rotate). Made debug output more consistent.
OSRA 1.3.2 - Significant speed up achived through the use of GraphicsMagick library. Fixed some issues with slowdowns occuring when using Windows executable.
osra-1.3.1.tgz - Various bug fixes and speed up. Better handling of PDF files.
osra-1.3.0.tgz - Better automatic recognition of high resolution images (higher than 300 dpi), OS X installer, ChemBioDraw plugin for ChemOffice 2010.
osra-1.2.2.tgz - User can now add and modify recognized superatom labels without recompiling the program. Simply edit superatom.txt and spelling.txt text files.
osra-1.2.1.tgz - Improved speed (up to 30% increase) and double and triple bond detection. Windows instal
osra-1.2.0.tgz - Page layout analysis algorithm completely re-written, added plugins for integration with several popular molecular editors.
osra-1.1.0.tgz - Added SD file format output, improved wedge bond detection.
osra-1.0.1.tgz - Minor bug fixes. OpenBabel-2.2.0 or svn snapshot of RDKit are recommended with this version.
osra-1.0.0.tgz - Significant update of the recognition engine. Simplified built instructions. Please note that the dependencies have changed since the previous version.
osra-0.9.9.tgz - Build system upgraded to allow linking to the newer versions of gocr (0.45).
osra-0.9.8.tgz - Added recognition of old-style aromatic rings with heteroatoms.
osra-0.9.7.tgz - Improved recognition of color and low-res images.
osra-0.9.6.tgz - Introduced automatic resolution detection.
osra-0.9.5.tgz - Source code modified to facilitate compiling with MinGW for Windows platform.
osra-0.9.4.tgz - added old-style benzene ring recognition
osra-0.9.3.tgz - added rudimentary formal charge recognition
osra-0.9.2.tgz - improved handling of hash and wedge bonds
osra-0.9.1.tgz - slightly improved handling of 72dpi color images
osra-0.9.tgz - original public release

We also welcome your feedback - send us your comments, suggestions, criticism, or praise to the contact email address below.

Web Interface

To demonstrate the capabilities (and limitations) of OSRA we have created an OSRA Web Interface. We thank Igor Filippov for allowing us to use OSRA 2.1.0 in the newest version of the Web Interface.

Try this sample image from the US Patent Office website first: patent.gif.

Validation

A large validation set consisting of 5719 chemical structure images and associated MOL files is available for download. This set was produced from the US Patent Office Complex Work Units and contain one structure per image, ground truth MOL files and a simple Perl script to benchmark the results of your chemical structure recognition software. The benchmark script takes two arguments - first the folder with ground truth files ("molfiles") and second with your generated files - the filenames of individual structures should be identical. It will compare the structures based on standard InChI. This validation set was made possible courtesy of collaboration with Dr. Steve Boyer and Dr. John Kinney.

This file has been updated courtesy of Aniko Valko and Keymodule Ltd., UK. The ground truth molfiles have been corrected and invalid images have been removed.
Download zip archive here.
A subset of 450 images from the Japanese Patent Office Chem-Infty dataset containing only organic molecules can be downloaded here:
images and ground truth.
This subset is distributed by permission from the original Chem-Infty dataset authors Koji Nakagawa, Akio Fujiyoshi, and Masakazu Suzuki. This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.1 Japan License.

Adobe Acrobat Reader for viewing PDF files can be downloaded here.

Igor Filippov - 2007, SAIC-Frederick, Frederick National Laboratory for Cancer Research, NIH, DHHS, Frederick, MD

Last Update: Marc C. Nicklaus, 2016-08-162024-02-09