You are here

Automatic text recognition (OCR)

Automatic text recognition in images or scanned documents by Optical Character Recognition (OCR)

Text stored in image formats like JPG, PNG, TIFF or GIF (i.e. scans, photos or screenshots) can not be found by standard full text search. So this enhancer enriches meta data of images like filename, format and size with results from automatic text recognition or optical character recognition (OCR) by free open source software like Tesseract OCR.

Since many information is not searchable by full text search because its in graphical formats embedded in PDF documents or Powerpoint presentations (i.e. screenshots instead of text format), the enhancer extracts images from PDF files for automatic text recognition (OCR), too.

Enable OCR

(OCR is enabled by default in the virtual machine packages like Open Semantic Desktop Search or Open Semantic Search Appliance)

Install the package tesseract-ocr (included in your Linux distribution):

apt-get install tesseract-ocr

If you enabled OCR, should enable OCR for images inside PDF files, too, since many PDF files are scans and do contain much text data only as graphics:

Add (uncomment) the PDF OCR Plugin:

#Enable OCR for images inside PDF files
config['plugins'].append('enhance_pdf_ocr')

How to optimize OCR settings to improve OCR results

You can optimize OCR results to find more by different ways, which you can combine for optimal OCR results:

Scanning resolution

If scanning documents yourself, scan or store the images with a higher resolution, so the OCR can analyse more details of the characters.

Language of dictionary

Since OCR uses a language specific dictionary, set the OCR language to your language or to multiple languages, which are used in your documents.

Setting OCR language to an other language than english:

  1. Install the tesseract language package (for german: tesseract-ocr-deu). See the list of available languages for Debian or Ubuntu.
  2. set option ocr_language to the language of your documents. Default is eng for english (in tesseract its eng, not en!). For german set deu (in tesseract its not de!):

    # language for automatic text recognition (ocr)
    #config['ocr_lang'] = "eng"
    config['ocr_lang'] = "deu"

    Or set the OCR language to multiple languages, which are used in your documents:


  3. # language for automatic text recognition (ocr)
    config['ocr_lang'] = "eng+deu"

This affects only the OCR of images in PDF files.

Additionally for OCR of image files you should set this options in Tika server config, too.

Additional dictionary entries

Add unknown names or words to the OCR dictionary.

Please donate so your additional domain vocabulary from your thesaurus entries or lists of names can be integrated automatically with the OCR directory.

Rotation and deskewing low quality scans before OCR

Many documents are scanned skew.

Additional deskewing of such low quality scans by Scantailor before OCR can improve the OCR results.

Install Scantailor:

apt-get install scantailor

Enable additional optimization with Scantailor before OCR by adding uncomment the descewing plugin to your ETL config /etc/opensemanticsearch/etl:

config['plugins'].append('enhance_ocr_descew')

In default configuration the descewing plugin is disabled because it needs more time and CPU resources while indexing documents with images.

Combining OCR results of multiple OCR tools

No OCR engine is perfect.

So in some projects we used for example Abby Finereader to OCR the images in PDFs additionally to the integrated Open Source OCR Software Tesseract.

Each of them recognized words or names the other software failed. Because that combined and indexed both OCR results for the same document, we could find many documents more.

The Open Semantic ETL framework is able to combine or unify and index analysis results of multiple analysis or OCR tools or OCR parameters for the same document or image.

Train characters and fonts

You can train the OCR with the special fonts used in your documents to improve the machine learning model for recognition of characters of this fonts.

How to manage OCR failures

Handle OCR errors by collaborative tagging and annotation

For single documents with OCR errors you can add annotations or tags with the words that were recognized wrong by the OCR engine, so the search engine can find them despite this OCR errors because of the tags or annotations written correct.

Manage OCR errors in thesaurus (Hidden labels)

Manage common OCR errors for all documents and new documents by Thesaurus entries for management of OCR errors (Hidden labels)

The recommender can analyse the corpus for typos/OCR errors of a thesaurus entry and recommends such misspellings for adding to the thesaurus as hidden label by one click.

More information about improving OCR quality

Improving the quality of the output of Tesseract OCR