OSRA is a utility designed to convert graphical representations of chemical structures, as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES (Simplified Molecular Input Line Entry Specification - see http://en.wikipedia.org/wiki/SMILES) or SD files - a computer recognizable molecular structure format. OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick - including GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES or SDF representation of the molecular structure images encountered within that document.
Note that any software designed for optical recognition is unlikely to be perfect, and the output produced might, and probably will, contain errors, so curation by a human knowledgeable in chemical structures is highly recommended.
OSRA needs the following Open Source libraries installed:
OSRA can also use the following optional libraries:
OSRA also makes use of the following software (you do not need to install it separately, it's included in the distribution):
Compile and/or install all the necessary dependencies.
Check that GraphicsMagick++-config script is in your PATH.
Potrace version 1.9 can now install libpotrace.a and potracelib.h -
make sure you run configure with the following options:
./configure --with-libpotrace (also --disable-shared if you want static library only).
Use patched GOCR available from Downloads section below. It should be compiled and installed with
./configure; make libs; make install
Depending on your installation you might also have to add /usr/local/lib to your LD_LIBRARY_PATH. Unpack OSRA package. Starting with version 1.3.7 the standard process should do the job:
For more details for specific platforms (such as Mac OS X and Windows MinGW) see the README file.
OSRA can process the following types of images:
Some common abbreviations, hetero atoms, fused and merged atomic labels, hash and wedge bonds, and bridge bonds are currently recognized. Formal charges, isotopes and some element symbols, i.e. iodine ("I" -- looks too much like a straight line = single bond), are not.
will give you a list of available options with short descriptions.
Most common use: ./osra [-r <resolution>] <filename>
Resolution in dpi, default is 300 (unless it's a PS or PDF file as mentioned above), filename is the name of your image file (or PS/PDF document).
-t, --threshold: Gray level threshold, default is 0.2 for black-and-white images,
-n, --negate: Inverts colors (for white on black images),
-o, --output: Sets a prefix for writing recognized images to files - i.e. "-o tmp" will create files tmp0.png, tmp1.png... for each of the structures,
-s, --size: Resize images on output - can be useful for running OSRA as a backend for a webservice. Example: "-s 300x400".
-g, --guess: Prints out resolution guess when you chose to have automatic resolution estimate.
-p, --print: Prints out the value of confidence function estimate.
-f, --format: Output format (either smi for SMILES or sdf for SD file format)
-d, --debug: Print out debug information on spelling corrections. First column - output from the OCR engine, second - result of spelling correction, last - SMILES from the superatom dictionary, if any.
-a configfile, --superatom configfile: Superatom label map to SMILES (superatom.txt by default)
-l configfile, --spelling configfile: Spelling correction dictionary (spelling.txt by default)
-e, --page: Show page number for structures from multi-page PDF and PostScript documents
-R, --rotate: Rotate image clockwise by the number of degrees, i.e. -R 90
-u , --unpaper: Pre-process image with unpaper algorithm, rounds (default: 0, or no pre-processing), e.g. -u 2
-w, --write: Write output to a file
-b, --bond: Print out average bond length in pixels
-j, --jaggy: Additional thinning/scaling down of low quality documents
-i, --adaptive: Adaptive thresholding pre-processing, useful for low light/low contrast images
-c, --coordinates: Show surrounding box coordinates (only for SDF/SMI/CAN output format)
--embedded-format "format": Allows the user to have InChI or SMILES included in an SDF file as a molecular property
This program is free software; the part of the software that was written at the National Cancer Institute is in the public domain. This does not preclude, however, that components such as specific libraries used in the software may be covered by specific licenses, including but not limited to the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version; which may impose specific terms for redistribution or modification.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. See also http://www.gnu.org/.
If there are more questions regarding the GPL software and the legal limitations on its redistribution, modifications, and usage we recommend reading GPL FAQ. In particular, regarding the output of the program we do not and cannot impose any limitations on the use of the output generated by OSRA which the input data did not previously have:
"...copyright law does not give you any say in the use of the output people make from their data using your program. If the user uses your program to enter or convert his own data, the copyright on the output belongs to him, not you. More generally, when a program translates its input into some other form, the copyright status of the output inherits that of the input it was generated from."
OSRA is Free and Open Source Software. You are welcome to download and use it, provided that you understand the terms described above. Participation in the development is highly encouraged!
We also welcome your feedback - send us your comments, suggestions, criticism, or praise to the contact email address below.
To demonstrate the capabilities (and limitations) of OSRA we have created an OSRA Web Interface.
Try this sample image from the US Patent Office website first: patent.gif.
Adobe Acrobat Reader for viewing PDF files can be downloaded here.
Igor Filippov - 2007, SAIC-Frederick, Frederick National Laboratory for Cancer Research, NIH, DHHS, Frederick, MD
Last Update: 2016-06-09