OSRA is a utility designed to convert graphical representations of
chemical structures, as they appear in journal articles, patent documents,
textbooks, trade magazines etc., into SMILES (Simplified Molecular
Input Line Entry Specification - see
http://en.wikipedia.org/wiki/SMILES) or SD files - a computer recognizable molecular structure format.
OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick - including
GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES or SDF representation of
the molecular structure images encountered within that document.
Note that any software designed for optical recognition is unlikely to
be perfect, and the output produced might, and probably will, contain
errors, so curation by a human knowledgeable in chemical structures
is highly recommended.
OSRA needs the following Open Source libraries installed:
- GraphicsMagick, image manipulation library, version 1.3.7 or later;
However, please note that there seems to be a regression with
PNG processing for GraphicsMagick 1.3.13 and 1.3.14 so we
recommend using version 1.3.12 as the most compatible version.
if installing from RPM make sure you have the following packages:
- POTRACE, vector tracing library, version 1.9 or later,
- GOCR/JOCR, optical character recognition library, please use
patched version available in Downloads below not the version from
the GOCR website.
- OCRAD, optical character recognition program, version 0.21 or
- TCLAP, Templatized C++ Command Line Parser Library, version 1.2.0,
- OpenBabel, open source chemistry toolbox, version 2.3.0 or later;
if installing from RPM make sure you have the following packages:
OSRA can also use the following optional libraries:
OSRA also makes use of the following software (you do not need to
install it separately, it's included in the distribution):
- ThinImage, C code from the article
"Efficient Binary Image Thinning using Neighborhood Maps"
by Joseph M. Cychosz, email@example.com
in "Graphics Gems IV", Academic Press, 1994
- GREYCstoration, Anisotropic smoothing plugin,
- CImg, The C++ Template Image Processing Library,
- MCDL utility from Sergei Trepalin and Andrei Gakh for 2D coordinate
- Unpaper, is a post-processing tool for scanned sheets of
paper, especially for book pages that have been scanned from previously created
photocopies. The main purpose is to make scanned book pages better readable on
screen after conversion to PDF. Additionally, unpaper might be useful to enhance
the quality of scanned pages before performing optical character recognition
Compile and/or install all the necessary dependencies.
Check that GraphicsMagick++-config script is in your PATH.
Potrace version 1.9 can now install libpotrace.a and potracelib.h -
make sure you run configure with the following options:
./configure --with-libpotrace (also --disable-shared if you want
static library only).
Use patched GOCR available from Downloads section below.
It should be compiled and installed with
./configure; make libs; make install
Depending on your installation
you might also have to add /usr/local/lib to your LD_LIBRARY_PATH.
Unpack OSRA package. Starting with version 1.3.7 the standard
process should do the job:
For more details for specific platforms
(such as Mac OS X and Windows MinGW) see the README file.
OSRA can process the following types of images:
- Computer-generated 2D structures, such as found on the PubChem website,
black-and-white and color (use a resolution of 72 dpi),
- Black-and-white PDF and PostScript files, including multi-page ones. Please
note that you need Ghostcript installed for GraphicsMagick to be able to
parse these kinds of files. OSRA internally renders PS and PDF at a resolution
of 150 dpi, higher rendering resolution can be achieved with "-r" option,
- Scanned images - black-and-white, a resolution of 300 dpi is recommended,
though 150 dpi can also produce fair results. Please make sure the
scanned image is of reasonable quality - an input that's too noisy will
only generate garbage output.
Some common abbreviations, hetero atoms, fused and merged atomic
labels, hash and wedge bonds, and bridge bonds are currently
recognized. Formal charges, isotopes and some element
symbols, i.e. iodine ("I" -- looks too much like a straight line = single
bond), are not.
will give you a list of available options with short descriptions.
Most common use: ./osra [-r <resolution>] <filename>
Resolution in dpi, default is 300 (unless it's a PS or PDF file as
mentioned above), filename is the name of your image file (or
-t, --threshold: Gray level threshold, default is 0.2
for black-and-white images,
-n, --negate: Inverts colors (for white on black images),
-o, --output: Sets a prefix for writing recognized images to files - i.e.
"-o tmp" will create files tmp0.png, tmp1.png... for
each of the structures,
-s, --size: Resize images on output - can be useful for running OSRA
as a backend for a webservice. Example: "-s 300x400".
-g, --guess: Prints out resolution guess when you chose to have automatic
-p, --print: Prints out the value of confidence function estimate.
-f, --format: Output format (either smi for SMILES or sdf for SD file format)
-d, --debug: Print out debug information on spelling corrections.
First column - output from the OCR engine, second - result of spelling
correction, last - SMILES from the superatom dictionary, if any.
-a configfile, --superatom configfile: Superatom label map to SMILES (superatom.txt by default)
-l configfile, --spelling configfile: Spelling correction dictionary (spelling.txt by default)
-e, --page: Show page number for structures from multi-page PDF and
-R, --rotate: Rotate image clockwise by the number of degrees, i.e. -R 90
-u , --unpaper: Pre-process image with unpaper algorithm, rounds (default: 0,
or no pre-processing), e.g. -u 2
-w, --write: Write output to a file
-b, --bond: Print out average bond length in pixels
-j, --jaggy: Additional thinning/scaling down of low quality documents
-i, --adaptive: Adaptive thresholding pre-processing, useful for low light/low
-c, --coordinates: Show surrounding box coordinates (only for SDF/SMI/CAN output format)
--embedded-format "format": Allows the user to have InChI or SMILES included in
an SDF file as a molecular property
This program is free software; the part of the software that was written
at the National Cancer Institute is in the public domain. This does not
preclude, however, that components such as specific libraries used in the
software may be covered by specific licenses, including but not limited
to the GNU General Public License as published by the Free Software Foundation;
either version 2 of the License, or (at your option) any later version;
which may impose specific terms for redistribution or modification.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307,
USA. See also http://www.gnu.org/.
See the file COPYING for details.
If there are more questions regarding the GPL software and the legal
limitations on its redistribution, modifications, and usage we recommend
FAQ. In particular, regarding the output of the program
we do not and cannot impose any limitations on the use of the output
generated by OSRA which the input data did not previously have:
"...copyright law does not give you any say in the use of the output
people make from their data using your program. If the user uses
your program to enter or convert his own data, the copyright on the
output belongs to him, not you. More generally, when a program
translates its input into some other form, the copyright status of
the output inherits that of the input it was generated from."
OSRA is Free and Open Source Software. You are welcome to download
and use it, provided that you understand the terms described above.
Participation in the development is highly encouraged!
- For the most recent version of OSRA visit osra.sf.net .
- OSRA 1.4.0 - Added reaction recognition. Use one of the
three reaction output formats - rxn, rsmi, or cmlr (i.e. -f rxn)
and OSRA will automatically extract reactions from a document.
Change the default rendering resolution for PDF document processing
to 300 dpi.
- OSRA 1.3.9 - bug fix and code cleanup update. Added single
molecule MOL format output, improved Java (JNI) interface,
significantly improved Accelrys Draw plug-in (James Jack).
Added osra-pdf script for more thorough PDF processing.
- OSRA 1.3.8 - enhancements include structure box coordinates
output, adaptive thresholding, output of InChI, SMILES, and
InChI-key as properties of SD file records, libosra with JNI bridge,
compatibility with updated versions of OCRAD, GOCR, and Tesseract.
Newer versions of Potrace and TCLAP are also required.
- OSRA 1.3.7 - The build process has changed. Now you
can use the standard "./configure; make; make install" combination.
Recognized Markush labels (R1, R2, R3...) are now added as atomic
aliases in SD files. Added optional support for
OCR libraries. Added additional processing for low scan quality
documents (-j command line option).
- OSRA 1.3.6 - Updated superatom dictionary, added
an option to print out average bond length as measured from the
image, modified table detection routine, added an adaptive
filter for images with too long or too short bonds.
- OSRA 1.3.5 - Better memory handling. Better error reporting
for file read/write operation. Added compatibility with
GraphicsMagick-1.3.8. Added OpenMP support (Linux/Unix only) for
multi-page document processing. Added command-line option for
saving output to a file.
- OSRA 1.3.4 - Better table recognition, fixed bugs affecting
Windows executable only (double bond detection and JPEG file
processing), added support for libocrad-0.19, added optional Unpaper
- OSRA 1.3.3 - Tables are now detected and removed priror to processing,
improving accuracy and speed. Added an option to rotate the image
(-R or --rotate). Made debug output more consistent.
- OSRA 1.3.2
- Significant speed up achived through the use of GraphicsMagick library.
Fixed some issues with slowdowns occuring when using Windows executable.
- Various bug fixes and speed up. Better handling of PDF files.
- Better automatic recognition of high resolution images (higher than 300
dpi), OS X installer, ChemBioDraw plugin for ChemOffice 2010.
- User can now add and modify recognized superatom labels without
recompiling the program. Simply edit superatom.txt and spelling.txt text
- Improved speed (up to 30% increase) and double and triple bond detection.
- osra-1.2.0.tgz -
Page layout analysis algorithm completely re-written, added plugins
for integration with several popular molecular editors.
- osra-1.1.0.tgz -
Added SD file format output, improved wedge bond detection.
- osra-1.0.1.tgz - Minor bug
fixes. OpenBabel-2.2.0 or svn snapshot of RDKit are recommended with this
- osra-1.0.0.tgz - Significant
update of the recognition engine. Simplified built instructions.
Please note that the dependencies have changed since the previous
- osra-0.9.9.tgz - Build
system upgraded to allow linking to the newer versions of gocr (0.45).
- osra-0.9.8.tgz - Added recognition of old-style aromatic rings with heteroatoms.
- osra-0.9.7.tgz - Improved recognition of color and low-res images.
- osra-0.9.6.tgz - Introduced automatic resolution detection.
- osra-0.9.5.tgz - Source code modified to facilitate compiling with MinGW for Windows platform.
- osra-0.9.4.tgz - added old-style benzene ring recognition
- osra-0.9.3.tgz - added rudimentary formal charge recognition
- osra-0.9.2.tgz - improved handling of hash and wedge bonds
- osra-0.9.1.tgz - slightly improved handling of 72dpi color images
- osra-0.9.tgz - original public release
We also welcome your feedback - send us your comments, suggestions,
criticism, or praise to the contact email address below.
To demonstrate the capabilities (and limitations) of OSRA we have created
an OSRA Web Interface.
Try this sample image from the US Patent Office website first:
Igor Filippov - 2007,
SAIC-Frederick, Frederick National Laboratory for Cancer Research, NIH, DHHS, Frederick, MD
Last Update: 2012-09-12