Skip to content

Pipeline for creating a Part of Speech Tagger and Entity Recognizer for the Greek Language in spaCy.

Notifications You must be signed in to change notification settings

datascouting/spacy-greek-model-pipeline

Repository files navigation

The following repository provides all needed information for the support of a Greek model in spaCy that uses as Part of Speech tags classes with morphological features. The tag map can be found in this page. The dataset that was used is a source of news from a newspaper called “Makedonia” and is a part of the clarin project. The dataset in under the CC – BY – NC – SA licence.

Additional work has been done for the support of Named Entity Recognition in the Greek model. The same annotated source was used for the support of 4 types of named entities (person, organization, location, facility). The annotated dataset with named entities is licenced under the CC – BY – NC – SA licence.

For the creation of the model the train and dev data is provided in proper json format. However, to recreate the dataset from source for use, a number of steps has to be followed.

  • Step 1: Download and unzip the dataset with the pos tags from this link.
  • Step 2: python parsing_sentences.py path_of_extracted_folder: Extracts the sentences from the dataset. The path of the extracted folder must be passed as an argument. The sentences will be saved in json objects in sentences.json.
  • Step 3: python parsing_tags.py path_of_extracted_folder: Extracts their pos tags. The path of the extracted folder must be passed as an argument. The tags.json will contain the part of speech tag for all tokens matching the index of the record from sentences.json.
  • Step 4: Download and unzip the dataset with the named entities from this link.
  • Step 5: python making_entity_list.py path_of_extracted_folder_1 path_of_extracted_folder_2: Creates the entity list from the dataset. The paths of the previous extracted folder must be passed as arguments.
  • Step 6: python edit_entity_list.py: Removes insufficient records from the list, entities with less than 4 characters and sort the list by the length of the named entity.
  • Step 7: python parsing_entities.py: Extracts the annotated entities from the sentences. The entities.json will contain the position and the class of the named entities matching the index of the record from sentences.json.
  • Step 8: python convert_to_biluo.py: Converts the entities to biluo format.
  • Step 9: python convert_to_json_format_and_split_to_train_dev.py: Uses only the records with proper tokenization and the existence of entities in the sentences, for the creation of train and dev data.

It must be noted that configuration has to be done in the init file of lang/el, so the proper tag_map is used. The model has been trained using as pretrained vectors the Greek, FastText, Common Crawl vectors from this link. The POS Tagger and the Entity Recognizer are provided from the model.

About

Pipeline for creating a Part of Speech Tagger and Entity Recognizer for the Greek Language in spaCy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages