You are here

Finds more with grammar rules (Stemming)

Considering grammar rules the search engine will find more.

So if you search for company corruption you will find companies corrupted, too, even if for stupid computers this are different words or strings.

Reducing words to their root form (stemming)

That works because a stemming algorithm reduces words to their root form.

For example the stemmer cuts suffixes like -ing or -ed or plural-s, so the search engine can find more forms of a word.

So the words corrupt, corrupted and corruption will be indexed and searched all only with their root form corrupt.

Preconfigurated stemming

The search engine packages are preconfigurated to stemming with english grammar out of the box.

Change the language / grammar

If you use another language, you can set stemming to another language and grammar:

Edit the config file /var/solr/data/core1/conf/managed-schema

For the two fields content and stemmed change the type to the type for your language (text_en for english).

For example to use German (language code de) grammar rules change the parameter type of the two fields content and text from text_en to text_de:

Change

<field name="stemmed" type="text_en" indexed="true" stored="false" multiValued="true"/>

to

<field name="stemmed" type="text_de" indexed="true" stored="false" multiValued="true"/>
...

and to be able to highlight searched words in the snippets even if written in other form change


...
<field name="content" type="text_en" indexed="false" stored="true" multiValued="true"/>
...

to

...
<field name="content" type="text_de" indexed="false" stored="true" multiValued="true"/>
...

Change the stemmer

There is no perfect stemming algorithm for all situations but different more or less working stemming algorithms and implementations (stemmers).

To switch to a more or less aggressive stemmer, edit the field definition of your language field (f.e. text_en).

Learn more about stemming algorithms, free open source stemmers and their configuration:

Stemming algorithms

Choosing a stemmer

Stemming for Solr

Stemming for Elastic Search

Thesaurus (dictionary of linked words) and other ontologies (Linked Data)

Another feature to find more is to use a thesaurus (a connected dictionary or a network of linked words or concepts) or other ontologies (linked data structures).

So for example the search engine can consider irregular verbs (for example went is not go with a suffix) and other irregular word forms, too.

Additionally you will be able to find not only different forms of the same words but additionally connected words like synonyms or hyponyms.

For example if you search for purple you would find violet, too.

An out of the box integration of a dictionary and thesaurus based on open data from Wiktionary (the dictionary of Wikipedia) could be released next, if enough donations for that.

Please donate with the subject Wiktionary for an earlier out of the box integration or if needed earlier, ask us for the config steps to setup this for your language (while setup you need a triplestore yet).