You are here

How to integrate Open Data to enhance semantic search, data analysis, text mining & document analysis

How data enrichment with Linked Open Data (LOD) from Open Access databases helps to improve search, find more results, analyse document sets and data sets and how monitoring with Open Data watchlists generates leads for investigative research

Open Data from Open Access databases like WikiData (structured data from Wikipedia) can be used to improve analysis and navigation of your private data.

It can generate leads for important documents where important concepts or people occur.

Or your private data or documents can be enriched with additional data or structures from open data.

So the integration of Open Data with Open Source research tools enables easier search, more and better search results, analytics and interactive filters (faceted search), watchlists and other useful and powerful features by a combination of textmining and data analysis methods like entity matching and entity tagging, analytics or automatic classification by machine learning.

Computing all this with open source software on your own computer or server all that is possible without giving away sensitive informations because of privacy issues with any cloud services.

Sharing data and Open Data

If there is an existing list, dictionary, vocabulary or thesaurus or if the entries are available in other structured format like an semantic web ontology yet i.e. from other tools or some open data sources, you don't need to enter manually each named entity like persons, organizations, locations or important words or concepts to the named entities manager to use them for analytics, navigation, aggregated overview and interactive filters.

You can import existing, shared or external list of names, thesauri, dictionaries, vocabularies or ontologies:

Tagging named entities like people or organizations

Using this list of important names or ontology the entity tagger can tag documents in which this names occur.

Leads for research by matching watchlists

So you get leads for potential relevant documents where important concepts or people on such a watchlist or ontology occur, i.e. if analysing a large unknown document set or dataset or after getting new documents or news.

Analytics, aggregated overviews and exploration

By the faceted search you get an aggregated overview for the different facets like concepts, persons, locations or organizations showing, how many documents matching this entity.

Interactive filters (faceted search)

You can use this overviews or named entities as interactive filter to narrow down search results.
So a click to a facet (i.e. an organization) will drill down the search results to fewer documents, matching this additional facet/filter, too.

Free open data source (Linked Open Data from WikiData)

One of the biggest and an universal Open Data source is WikiData, the structured database from Wikipedia.

There you can select and download lists of names of important people for your watchlists, data for data enrichment or data structures for machine learning.

If you are not familiar enough with open Semantic Web technologies for example to extract and automatically hold up to date a list of people or companies like all politicians of your parlament(s) from the Wikidata SPARQL interface for Open Access of structured open data, so you can import the resulting list or ontology into the Named Entities Manager:

Please donate so we are able to write additional tutorials how to export and hold up to date such lists of names from Wikidata and make this open data export and open data import and update process easy by additional tools and user interfaces, i.e. for much easier data management or list updates or to be able to use such Wikidata lists or ontologies not only for exact matchings for faceted search and interactive filters but to be easier usable for our fuzzy search by lists, too.

Finds more by thesauri of synonyms and irregular verbs (Wiktionary)

Another usage of Open Data is to use a dictionary and thesaurus like the Wikipedia Wiktionary or an Libre Office or Open Office hunspell dictionary to extend search queries or indexes so you are able to find synonyms and irregular verbs, too.

Automatic classification or disambiguation (machine learning)

Another use of open data from WikiData or Wikipedia is to get good training data for classification and disambiguation.

So we get for example the possibility to guess which meaning names or words with more than one meaning (homonyms) by comparing the context with big open data like Wikipedia by machine learning algorithms for automatic classification to disambiguate the named entity.

Privacy and data protection: Investigative Research with Open Source Software

With our free Open Source software and research tools you can use Open Data from open data sources on the internet even for investigative research in sensitive documents or other sensitive data with enhanced privacy needs like for example needed for investigative journalism:

Since there is no cloud but only other peoples or companies computers, we want to compute our data ourselves with Free Software.

So after download and import of an open dataset our open source search engine computes all matching and watching for this Open Data list or ontology only on your own laptop or server. There is no upload of documents content fotr analysis nor upload of matching names to a cloud service.

Big Data analysis by Open Data

This is even possible for many named entities or list entries and very much data or and huge document collections.

In some research, text mining and data analysis projects we cross referenced documents with lists and ontologies with hundreds of thousands of names like politicians, species, materials or other important concepts to look for.

In such cases you should import the list to thesaurus editor or ontologies manager for entity tagging before indexing the documents so the entities can be while indexing instead of later searches, which will need some time in such cases.