Bulgarian spaCy natural language processing pipeline

Paper: An Improved Bulgarian Natural Language Processing Pipeline, proceedings of International Conference on Information Systems, Embedded Systems and Intelligent Applications (ISЕSIA) 2023.

Usage

First, the pretrained models need to be downloaded into the repo folder from HuggingFace.

In order to use the pipeline, it should be installed as a local Python package:

python -m spacy package ./models_v3.3/model-best/ packages --name bg --version 1.0.0 --code language_components/custom_bg_lang.py
pip install packages/bg_bg-1.0.0/dist/bg_bg-1.0.0.tar.gz

You can check if the pipeline was correctly installed with the pip list command.

After a successful installation, the pipeline can be opened in a Python file as a spaCy language model. The tokenizer needs to be added manually.

import spacy
nlp = spacy.load("bg_bg")
from language_components.custom_tokenizer import *
nlp.tokenizer = custom_tokenizer(nlp)

For more details on how to use the pipeline, please take a look at the Model loading and usage notebook and the official spaCy documentation.

Project structure and details

Pipeline components

The pipeline consists of the following steps:

Tokenization
Sentence Splitting
Lemmatization
Part-of-speech Tagging
Dependency Parsing
Word Sense Disambiguation (available upon request)

Pretrained vectors

Pretrained fastText vectors for Bulgarian language can be downloaded from the fasttext website and put into the vectors/ folder.

Spacy project structure

After downloading the pretrained word vectors and the pretrained models, the project should consists of the following folders:

configs/ - configuration files,
corpus/ - train/dev/test dataset in .spacy format,
language_components/ - files for the custom language components (tokenizer, sentencizer, and connected files),
models_v3.3/ - trained pipeline models in spaCy 3.3,
models_v3.4/ - trained pipeline models in spaCy 3.4,
tests/ - unittests for the custom components,
vectors/ - pretrained word embeddings (fastText),
visualiations/ - dependency parsing visualizations on the test set.

Tokenization

Tokenization is the first step of the pipeline. The Bulgarian tokenizer consists of custom rules, exceptions and stopwords. It can be used separately from the rest of the pipeline.

Rules

The rules for the rule-based tokenizer are in the file language_components/custom_tokenizer.py. They are defined by the following regular exceptions:

prefix_re = re.compile(r'''^[\[\("'“„]''')
suffix_re = re.compile(r'''[\]\)"'\.\?\!,:%$€“„]$''')
infix_re = re.compile(r'''[~]''')
simple_url_re = re.compile(r'''^https?://''')

Exceptions

Tokenizer exceptions are in the file language_components/token_exceptions.py. They are grouped in the following variables:

METRICS_NO_DOT_EXC - units of measure
DASH_ABBR_EXC - abbreviations with an inner dash
DASH_ABBR_TITLE_EXC - Abbreviations with an inner dash, capitalized
ABBR_DOT_MIDDLE_EXC - abbreviations with a dot that cannot be at the end of the sentence
ABBR_DOT_MIDDLE_TITLE_EXC - the same with a capital letter
ABBR_DOT_END_EXC - abbreviations with a dot that may be at the end of the sentence
ABBR_UPPERCASE_EXC - Uppercase abbreviations

Stopwords

In the file language_components/stopwords.py. Stopwords are taken from the BulTreeBank website.

Other components

Please refer to the paper for details about the rest of the components in the pipeline.

Reference

If you use the pipeline in your academic project, please cite as:

@article
{berbatova2023improved,
title={An improved Bulgarian natural language processing pipelihttps://github.com/melaniab/spacy-pipeline-bgne},
author={Berbatova, Melania and Ivanov, Filip},
journal={Annual of Sofia University St. Kliment Ohridski. Faculty of Mathematics and Informatics},
volume={110},
pages={37--50},
year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
corpus		corpus
language_components		language_components
tests		tests
visualizations		visualizations
LICENSE		LICENSE
Model_loading_and_usage.ipynb		Model_loading_and_usage.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bulgarian spaCy natural language processing pipeline

Usage

Project structure and details

Pipeline components

Pretrained vectors

Spacy project structure

Tokenization

Rules

Exceptions

Stopwords

Other components

Reference

About

Releases

Packages

Languages

License

melaniab/spacy-pipeline-bg

Folders and files

Latest commit

History

Repository files navigation

Bulgarian spaCy natural language processing pipeline

Usage

Project structure and details

Pipeline components

Pretrained vectors

Spacy project structure

Tokenization

Rules

Exceptions

Stopwords

Other components

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages