You are here

How to optimize and improve Optical Character Recognition results (OCR)

Automatic text recognition in images or scanned documents by Optical Character Recognition (OCR)

Text stored in image formats like JPG, PNG, TIFF or GIF (i.e. scans, photos or screenshots) can not be found by standard full text search. So this enhancer enriches meta data of images like filename, format and size with results from automatic text recognition or optical character recognition (OCR) by free open source OCR software like Tesseract.

Since many information is not searchable by full text search because its in graphical formats embedded in PDF documents or Powerpoint presentations (i.e. screenshots instead of text format), the enhancer extracts images from PDF files for automatic text recognition (OCR), too.

Enable OCR

OCR is enabled by default in the virtual machine packages like Open Semantic Desktop Search or Open Semantic Search Appliance.

In other cases enable OCR by installing the open source software Tesseract OCR.

If you enabled OCR, should enable OCR for images inside PDF files, too, since many PDF files are scans and do contain much text data only as graphics.

How to optimize OCR settings to improve OCR results

You can optimize OCR to find more by different ways, which you can combine for optimal OCR results with fewer OCR failures:

Scanning resolution

If scanning documents yourself, scan or store the images with a higher resolution, so the OCR can analyse more details of the characters.

Language of dictionary

Since OCR uses a language specific dictionary, set the OCR language to your language or to multiple languages, which are used in your documents.

Additional OCR dictionary entries

Add unknown names or words to the OCR dictionary or word lists.

Please donate so your additional domain vocabulary from your thesaurus entries or lists of names can be integrated automatically with the OCR directory.

Disable usage of dictionaries and word lists

OCR results for unknown names which are not in such managed lists or dictionaries are better, if using dictionaries and lists is switched off.

If you have enough CPU resources while indexing, you can combine results of both settings to index more names or words from scanned documents in correct spelling, since the search engine will find all results of both different and technically opponent OCR strategies.

Rotation and deskewing of low quality scans before OCR

Many documents are scanned skew.

Automatic deskewing of such low quality scans by Scantailor before OCR can improve the OCR results.

Therefore enable the Scantailor plugin.

Training characters and fonts

You can train the OCR engine with the special fonts used in your documents to improve the machine learning model for recognition of characters of this fonts.

How to manage OCR errors and fails

Despite all that optimization, automatic character recognition can fail and OCR errors like wrong recognized words or names occur.

Therefore there are integrated tools for manual handling and management of OCR failures on document level for single documents or on meta level for all documents:

Handle OCR failures by collaborative tagging and annotation

For single documents with OCR failures you can add annotations or tags with the words that were recognized wrong by the OCR engine, so the search engine can find them despite this OCR errors because of the tags or annotations written correct.

Manage OCR failures by thesaurus (Hidden labels)

Manage common OCR failures for all documents by Thesaurus entries for management of OCR errors (Hidden labels)

Since you handle this OCR failure on meta level, the correction can be applied automatically to new documents with the same wrong recognized words or names.

The recommender can analyse the corpus for typos/OCR errors of a thesaurus entry and recommends such misspellings for adding to the thesaurus as hidden label by one click.

Combining OCR results of multiple OCR tools

No OCR engine is perfect.

So in some projects we used for example Abby Finereader to OCR the images in PDFs and additionally to the integrated Tesseract OCR. Each of them recognized words or names, on which the other OCR software failed. Because that combined and indexed both OCR results for the same document, we could find many documents more.

The Open Semantic ETL framework is able to combine or unify and index analysis results of multiple analysis or OCR tools or OCR parameters for the same document or image.

More information about improving OCR quality