Skip to content

melaniab/spacy-pipeline-bg

Repository files navigation

Bulgarian spaCy natural language processing pipeline

Paper: An Improved Bulgarian Natural Language Processing Pipeline, proceedings of International Conference on Information Systems, Embedded Systems and Intelligent Applications (ISЕSIA) 2023.

Usage

First, the pretrained models need to be downloaded into the repo folder from HuggingFace.

In order to use the pipeline, it should be installed as a local Python package:

python -m spacy package ./models_v3.3/model-best/ packages --name bg --version 1.0.0 --code language_components/custom_bg_lang.py
pip install packages/bg_bg-1.0.0/dist/bg_bg-1.0.0.tar.gz

You can check if the pipeline was correctly installed with the pip list command.

After a successful installation, the pipeline can be opened in a Python file as a spaCy language model. The tokenizer needs to be added manually.

import spacy
nlp = spacy.load("bg_bg")
from language_components.custom_tokenizer import *
nlp.tokenizer = custom_tokenizer(nlp)

For more details on how to use the pipeline, please take a look at the Model loading and usage notebook and the official spaCy documentation.

Project structure and details

Pipeline components

The pipeline consists of the following steps:

  • Tokenization
  • Sentence Splitting
  • Lemmatization
  • Part-of-speech Tagging
  • Dependency Parsing
  • Word Sense Disambiguation (available upon request)

Pretrained vectors

Pretrained fastText vectors for Bulgarian language can be downloaded from the fasttext website and put into the vectors/ folder.

Spacy project structure

After downloading the pretrained word vectors and the pretrained models, the project should consists of the following folders:

  • configs/ - configuration files,
  • corpus/ - train/dev/test dataset in .spacy format,
  • language_components/ - files for the custom language components (tokenizer, sentencizer, and connected files),
  • models_v3.3/ - trained pipeline models in spaCy 3.3,
  • models_v3.4/ - trained pipeline models in spaCy 3.4,
  • tests/ - unittests for the custom components,
  • vectors/ - pretrained word embeddings (fastText),
  • visualiations/ - dependency parsing visualizations on the test set.

Tokenization

Tokenization is the first step of the pipeline. The Bulgarian tokenizer consists of custom rules, exceptions and stopwords. It can be used separately from the rest of the pipeline.

Rules

The rules for the rule-based tokenizer are in the file language_components/custom_tokenizer.py. They are defined by the following regular exceptions:

prefix_re = re.compile(r'''^[\[\("'“„]''')
suffix_re = re.compile(r'''[\]\)"'\.\?\!,:%$€“„]$''')
infix_re = re.compile(r'''[~]''')
simple_url_re = re.compile(r'''^https?://''')

Exceptions

Tokenizer exceptions are in the file language_components/token_exceptions.py. They are grouped in the following variables:

  • METRICS_NO_DOT_EXC - units of measure
  • DASH_ABBR_EXC - abbreviations with an inner dash
  • DASH_ABBR_TITLE_EXC - Abbreviations with an inner dash, capitalized
  • ABBR_DOT_MIDDLE_EXC - abbreviations with a dot that cannot be at the end of the sentence
  • ABBR_DOT_MIDDLE_TITLE_EXC - the same with a capital letter
  • ABBR_DOT_END_EXC - abbreviations with a dot that may be at the end of the sentence
  • ABBR_UPPERCASE_EXC - Uppercase abbreviations

Stopwords

In the file language_components/stopwords.py. Stopwords are taken from the BulTreeBank website.

Other components

Please refer to the paper for details about the rest of the components in the pipeline.

Reference

If you use the pipeline in your academic project, please cite as:

@article
{berbatova2023improved,
title={An improved Bulgarian natural language processing pipelihttps://github.com/melaniab/spacy-pipeline-bgne},
author={Berbatova, Melania and Ivanov, Filip},
journal={Annual of Sofia University St. Kliment Ohridski. Faculty of Mathematics and Informatics},
volume={110},
pages={37--50},
year={2023}
}

MIT License
Copyright (c) 2023 Melania Berbatova