Improve OCR by SKOS thesaurus or RDF ontology

The free Open Source tool Solr Ontology Tagger extracts words of a Resource Description Format (RDF) ontology and Simple Knowledge Organization System (SKOS) thesaurus to improve automatic text recognition by OCR by adding the words to a custom dictionary (the "user words" list) for the open source optical character recognition tool Tesseract OCR.

How to improve OCR of concepts or names in thesaurus or ontology

By the out of the box integration with the Tesseract OCR user word list or custom dictionary, your concepts, words and names of named entities like organizations, places, locations or persons that are important for you so you added them to your thesaurus or which are included in lists of names or ontologies (for example lists of names of relevant persons from internal databases or metadata sources or from open data sources like Wikidata) you defined for faceted search/interactive filters and/or analytics/aggregated overviews are recognized better by OCR of scanned documents, too.

Integrate SKOS Thesaurus or RDF ontology with Tesseract OCR dictionary

Therefore your additional domain knowledge / vocabulary from thesaurus, lists and ontologies is used additional to the default OCR dictionaries by the Tesseract option --user-words /etc/opensemanticsearch/ocr/dictionary.txt.

In documents on paper names often in uppercase

Since in many scanned legacy files on paper names are fully written in uppercase and for dumb computers and so OCR an "A" is another char than an "a", this autogenerated custom OCR dictionary / OCR wordlist includes the complete uppercase variant of each name or word, too.

Open Standards for Thesauruses and Ontologies (RDF & SKOS)

Since using open standards Resource Description Framework (RDF) for ontologies and Simple Knowledge Organization System (SKOS) for thesauri, knowledge bases, lists of entities, ontologies or taxonomies you do not have to add or manage all important names yourself:

Open Data vocabularies like Wikidata

So you can use linked open data sources and databases like Wikidata, the vocabularies of the European Union or the Unesco Thesaurus.

Free Open Source Software (FOSS)

Since the tool and used libraries are free Open Source Software based on Python & rdflib, the full source code is included inside the downloadable packages and hosted on Github.