Elasticsearch plugin for Russian Phonetic Analysis

This plugin provides phonetic analysis of Russian language by exposing russian_phonetic token filter which transforms russian words to their phonetic representation or so-called phonetic code. These codes are used for matching words and names which sound similar. The process of transformation is also known as phonetic encoding and this plugin is able to encode millions of russian words per second with the lowest impact on GC among all encoders compared in encoding throughput benchmarks.

📎	Results for matching misspellings and typos, distribution and encoding throughput benchmarks.

Encoding algorithm extensively employs phonetic and orthographic rules in order to fill the inconsistency gap between spelling and pronunciation in Russian Language.

Examples of spelling and pronunciation inconsistency

вдры[зг]        ⟷    вдры[ск]
слове[тск]ий    ⟷    славе[цк]ий
ла[ндш]афт      ⟷    ла[нш]афт
п[я]так         ⟷    п[и]так
бу[хг]алтер     ⟷    бу[г]алтер
бю[стг]алтер    ⟷    бю[зд]галтер
ле[стн]ица      ⟷    ле[сн]ица
кислово[дск]    ⟷    кислово[цк]

You can find more information about encoding process at the encoding rules and unit tests.

Installation

In order to install the plugin, choose a version and run:

$ bin/elasticsearch-plugin install URL

where URL points to zip file of the appropriate release which corresponds to your elasticsearch version.

❗	The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

E.g., command for Elasticsearch 7.6.2

# install plugin on Elasticsearch 7.6.2
$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-russian-phonetics/raw/7.6.2/esplugin/plugin-distributions/analysis-russian-phonetic-7.6.2.zip

After installation plugin exposes new token filter named russian_phonetic.

Getting started

You can start using the russian_phonetic token filter by providing analysis configuration:

PUT /russian_phonetic_sample
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "russian_phonetic"
          ]
        }
      },
      "filter": {
        "russian_phonetic": {
          "type": "russian_phonetic",
          "replace": false
        }
      }
    }
  }
}

Then you should be able to hit the analyzer with russian_phonetic token filter using the analyze API

POST /russian_phonetic_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "студентка комсомолка спортсменка"
}

Returns: стднк, студентка, кмсмлк, комсомолка, спрцмнк, спортсменка

Token filter settings

The russian_phonetic token filter provides a bunch of configuration options to meet your particular needs:

replace

Whether or not the original token should be replaced by the phonetic code. Accepts true (default) or false.

vowels

Defines encoding mode for vowels. Accepts encode_first (default) or encode_all.

encode_first: only first vowel in the supplied word will be encoded

упячка          → упчк
голландский     → глнскй
абсурд          → апсрт

encode_all: all vowels will be encoded according to the encoding rules

упячка          → уп2чк1
голландский     → г1л1нск2й
абсурд          → апс3рт

max_code_len

The maximum length of the phonetic code. Defaults to 8.

enable_stemmer

Whether or not the stemming should be applied. Accepts true or false (default). When this option is enabled only base (or root) form of the supplied word will be encoded.

аннотируешь     → антрш
аннотируешься   → антрш
аннотируешь     → ан1т2р32ш
аннотируешься   → ан1т2р32ш
ящурным         → ящрн
ящурные         → ящрн
ящурным         → ящ3рн
ящурные         → ящ3рн

💡	Please take a look at the throughput and distribution benchmarks to be aware of encoder’s behaviour and performance under certain options value.

Credits

Blog post "Phonetic algorithms" by Nikita Smetanin
Apache Lucene full-featured text search engine library
Elasticsearch distributed search and analytics engine

Contribute

Use the issue tracker and/or open pull requests.

Licence

Both encoder and esplugin projects are released under version 2.0 of the Apache Licence.

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
benchmark		benchmark
encoder		encoder
esplugin		esplugin
gradle/wrapper		gradle/wrapper
.gitignore		.gitignore
.travis.yml		.travis.yml
HEADER.txt		HEADER.txt
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.asciidoc		README.asciidoc
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
releases.asciidoc		releases.asciidoc
settings.gradle		settings.gradle
versions.properties		versions.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elasticsearch plugin for Russian Phonetic Analysis

Installation

Getting started

Token filter settings

Credits

Contribute

Licence

About

Releases

Packages

Languages

License

papahigh/elasticsearch-russian-phonetics

Folders and files

Latest commit

History

Repository files navigation

Elasticsearch plugin for Russian Phonetic Analysis

Installation

Getting started

Token filter settings

Credits

Contribute

Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages