How to optimize and improve Optical Character Recognition results (OCR)
Automatic text recognition in images or scanned documents by Optical Character Recognition (OCR)
Text stored in image formats like JPG, PNG, TIFF or GIF (i.e. scans, photos or screenshots) can not be found by standard full text search. So this enhancer enriches meta data of images like filename, format and size with results from automatic text recognition or optical character recognition (OCR) by free open source OCR software like Tesseract.
Since many information is not searchable by full text search because its in graphical formats embedded in PDF documents or Powerpoint presentations (i.e. screenshots instead of text format), the enhancer extracts images from PDF files for automatic text recognition (OCR), too.
Enable OCR
OCR is enabled by default in the Debian and Ubuntu packages and the virtual machine packages like Open Semantic Desktop Search or Open Semantic Search Appliance. If you build your search engine from the open source code enable OCR by installing the open source software Tesseract OCR.
If you disabled/enabled OCR, you should disable/enable OCR for images within PDF files, too, since many PDF files are scans and do contain much text data only as graphics.
How to optimize OCR settings to improve OCR results
You can optimize OCR to find more by different ways, which you can combine for optimal OCR results with fewer OCR failures:
Scanning resolution
If scanning documents yourself, scan or store the images with a higher resolution, so the OCR can analyse more details of the characters.
Language of dictionary
Since OCR uses a language specific dictionary, set the OCR language to your language or to multiple languages, which are used in your documents.
Additional custom OCR dictionary entries from Thesaurus and Ontologies
Fully integrated out of the box in next open source release:
Concepts, words and names of named entities like organizations, places, locations or persons that are important for you so you added them to your thesaurus or which are included in lists of names or ontologies (for example lists of names of relevant persons from internal meta data sources or from open data sources like Wikidata) you defined for faceted search/interactive filters and/or analytics/aggregated overviews are recognized better by OCR of scanned documents, too.
Therefore your additional domain knowledge / vocabulary from thesaurus, lists and ontologies is used additional to the OCR language specific dictionary of Tesseract OCR.
Since in many scanned legacy files on paper names are fully written in uppercase, this autogenerated custom OCR dictionary / OCR wordlist includes the uppercase variant of each word, too.
So you should consider to rebuild your index / reindex important files by force (so they are analyzed again & reindexed, even if yet in index) after adding very important words or names to the thesaurus or provided by (new or changed) ontologies.
Disable usage of dictionaries and word lists
OCR results for unknown names which are not in such managed lists or dictionaries are better, if using dictionaries and lists is switched off.
If you have enough CPU resources while indexing, you can combine results of both settings to index more names or words from scanned documents in correct spelling, since the search engine will find all results of both different and technically opponent OCR strategies.
Automatic deskewing and rotation of low quality scans before OCR
Many documents are scanned skew.
Automatic deskewing of such low quality scans by Scantailor before OCR can improve the OCR results.
Therefore the Scantailor plugin is enabled, if not disabled because of performance isssues.
Training characters and fonts
You can train the OCR engine with the special fonts used in your documents to improve the machine learning model for recognition of characters of this fonts.
How to manage OCR errors and fails
Despite all that optimization, automatic character recognition can fail and OCR errors like wrong recognized words or names occur.
Therefore there are integrated tools for manual handling and management of OCR failures on document level for single documents or on meta level for all documents:
Handle OCR failures by collaborative tagging and annotation
For single documents with OCR failures you can add annotations or tags with the words that were recognized wrong by the OCR engine, so the search engine can find them despite this OCR errors because of the tags or annotations written correct.
Manage OCR failures by thesaurus (Hidden labels)
Manage common OCR failures for all documents by Thesaurus entries for management of OCR errors (Hidden labels)
Since you handle this OCR failure on meta level, the correction can be applied automatically to new documents with the same wrong recognized words or names.
The recommender can analyse the corpus for typos/OCR errors of a thesaurus entry and recommends such misspellings for adding to the thesaurus as hidden label by one click.
Combining OCR results of multiple OCR tools
No OCR engine is perfect.
So in some projects we used for example Abby Finereader to OCR the images in PDFs and additionally to the integrated Tesseract OCR. Each of them recognized words or names, on which the other OCR software failed. Because that combined and indexed both OCR results for the same document, we could find many documents more.
The Open Semantic ETL framework is able to combine or unify and index analysis results of multiple analysis or OCR tools or OCR parameters for the same document or image.