Skip to content

papahigh/elasticsearch-russian-phonetics

Repository files navigation

Elasticsearch plugin for Russian Phonetic Analysis

Build Status Code Coverage for encoder project License Apache%202.0 blue

This plugin provides phonetic analysis of Russian language by exposing russian_phonetic token filter which transforms russian words to their phonetic representation or so-called phonetic code. These codes are used for matching words and names which sound similar. The process of transformation is also known as phonetic encoding and this plugin is able to encode millions of russian words per second with the lowest impact on GC among all encoders compared in encoding throughput benchmarks.

Encoding algorithm extensively employs phonetic and orthographic rules in order to fill the inconsistency gap between spelling and pronunciation in Russian Language.

Examples of spelling and pronunciation inconsistency
вдры[зг]        ⟷    вдры[ск]
слове[тск]ий    ⟷    славе[цк]ий
ла[ндш]афт      ⟷    ла[нш]афт
п[я]так         ⟷    п[и]так
бу[хг]алтер     ⟷    бу[г]алтер
бю[стг]алтер    ⟷    бю[зд]галтер
ле[стн]ица      ⟷    ле[сн]ица
кислово[дск]    ⟷    кислово[цк]

You can find more information about encoding process at the encoding rules and unit tests.

Installation

In order to install the plugin, choose a version and run:

$ bin/elasticsearch-plugin install URL

where URL points to zip file of the appropriate release which corresponds to your elasticsearch version.

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

E.g., command for Elasticsearch 7.6.2

# install plugin on Elasticsearch 7.6.2
$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-russian-phonetics/raw/7.6.2/esplugin/plugin-distributions/analysis-russian-phonetic-7.6.2.zip

After installation plugin exposes new token filter named russian_phonetic.

Getting started

You can start using the russian_phonetic token filter by providing analysis configuration:

PUT /russian_phonetic_sample
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "russian_phonetic"
          ]
        }
      },
      "filter": {
        "russian_phonetic": {
          "type": "russian_phonetic",
          "replace": false
        }
      }
    }
  }
}

Then you should be able to hit the analyzer with russian_phonetic token filter using the analyze API

POST /russian_phonetic_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "студентка комсомолка спортсменка"
}

Returns: стднк, студентка, кмсмлк, комсомолка, спрцмнк, спортсменка

Token filter settings

The russian_phonetic token filter provides a bunch of configuration options to meet your particular needs:

replace

Whether or not the original token should be replaced by the phonetic code. Accepts true (default) or false.

vowels

Defines encoding mode for vowels. Accepts encode_first (default) or encode_all.

encode_first: only first vowel in the supplied word will be encoded
упячка          → упчк
голландский     → глнскй
абсурд          → апсрт
encode_all: all vowels will be encoded according to the encoding rules
упячка          → уп2чк1
голландский     → г1л1нск2й
абсурд          → апс3рт
max_code_len

The maximum length of the phonetic code. Defaults to 8.

enable_stemmer

Whether or not the stemming should be applied. Accepts true or false (default). When this option is enabled only base (or root) form of the supplied word will be encoded.

аннотируешь     → антрш
аннотируешься   → антрш
аннотируешь     → ан1т2р32ш
аннотируешься   → ан1т2р32ш
ящурным         → ящрн
ящурные         → ящрн
ящурным         → ящ3рн
ящурные         → ящ3рн
💡
Please take a look at the throughput and distribution benchmarks to be aware of encoder’s behaviour and performance under certain options value.

Credits

Contribute

Use the issue tracker and/or open pull requests.

Licence

Both encoder and esplugin projects are released under version 2.0 of the Apache Licence.

Releases

No releases published

Packages

No packages published