You are here

Fuzzy search

How to find results despite incomplete or imperfect search queries, misspelled names, typos and low text quality or low OCR quality

If you're sure that all search query terms are in the document(s) you search, you can switch the default operator to AND.

If you search for a exact sentence or phrase (all search terms in exact this form and order with no other words in between), you can switch to "Phrase search".

But if you're not sure about all terms, you can use and combine different search techniques like the OR operator and scoring, wildcards, stemming, synonyms and fuzziness:

Use OR operator and sort by scoring

If you use the OR operator (default, if you don't switch to AND), you will often find results without all search terms in the documents.

But even getting many irrelevant results in which only a small part or even a single word of your search query occurs, if using this search paradigma:

The more of your search terms occur in the document, the higher is the document is scored and you will see documents with most matches on top.

So you will often find documents even if one of your search terms is misspelled or not exact the same concept used in the document.

Find alternate word forms (Stemming)

Considering grammar rules to find other word forms, too (stemming), the search engine will find more.

For example if you search for company corruption you will find companies corrupted, too.

If you want only results matching exactly the spelling in your search query, you can disable stemming by disabling the option "Find other word forms, too".

If you disable stemming for the whole search query, you can enable it for single terms by using the operator stemmed:

If you enable stemming for the whole search query, you can disable it for single terms by using the operator exact:

Find unknown word parts with wildcards

Multiple character wildcard searches with * looks for 0 or more characters.

So
corrupt*
will find corrupt, corruption, corrupted ... but not anticorruption

*rupt
will find corrupt or interrupt but wont find corruption or interruption...

*rupt*
will find corrupt, corruption, anticorruption, interrupt and interruption ...

You can also use the wildcard searches in the middle of a term:
co*upt
will find corupt or corrupt

Find results with typos (or if not sure how to spell a name) by edit distance (Levensthein distance)

Fuzziness by edit distance is useful if you search for names you don't know how to spell exactly.

Or if you want to consider typos or misspelled names in the documents.

Or if some documents are images or scanned documents (like PDF documents containing scanned pages only in graphical formats instead of digital text), so you maybe find more documents for your query even if the results of automated text-recognition (OCR) are not perfect, i.e. because single chars were recognized wrong.

To do a fuzzy search use the tilde, "~", symbol at the end of a single word / term you want to search with fuzziness.

For example to search for a term similar in spelling to "roam" use the fuzzy search:

roam~

This search will find terms like foam and roams.

similar~ words~
would find not only similar words but similer works, too because both words are set to fuzzy search.

Setting maximal edit distance (number of chars, that may differ)

If you use the ~ operator for fuzziness, the default maximal edit distance is two chars. This is the maximal number of chars that can be changed, added or deleted to match a term that is searched with fuzziness.

To change the maximum number of changeable chars, set the number after the ~ operator.

Example:
searchterm~3

Find synonyms and hyponyms

If you don't switch that off for your query and don't use the operator "exact:" for the search term, the search engine can find synonyms (f.e. if you search for car you want to find automobile, too) and hyponyms (if you search for a car, you want to find truck, too).

You can setup synonyms in the config file synonyms.txt.

Please donate so we can integrate this (technically existing) features with open data from Wiktionary, so most synonyms and more forms of irregular verbs ("went" is not only "go" with a suffix) will be preconfigurated out of the box and we can develop and integrate an user interface for easier editing of custom or domain specific synonyms (SKOS thesaurus) instead of having to edit the config file with a text editor.

More like this (Term vector model)

Based on a relevant document find other documents with many same words ("more like this").

Coming soon (please donate)...

Find similarity by machine learning (Latent Dirichlet allocation (LDA), clustering, classification, co-ocurrences)

Find documents with related words (not only the same words or synonyms).

Coming soon (please donate)...