Skip to content

Implementation of text preprocessing impact analysis on named entity recognition (NER) based on conditional random field (CRF) in Indonesian text.

License

Notifications You must be signed in to change notification settings

exemuel/preprocessing-crf-ner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

preprocessing-crf-ner

Description

This work contributes to extensively assessing the impact of preprocessing tasks on the named entity recognition success in Indonesian text at various feature dimensions and possible interactions among these tasks.

Flowchart of Experimental Methods on text preprocessing in Indonesian NER based on CRF

Preprocessing Procedures

  1. Contractions Expansion
  2. Lowercase Conversion
  3. Stemming
  4. Number to Words Conversion
  5. Hyphen and Comma Splitting

Feature Extraction

  1. The word
  2. The length of the word or number of characters
  3. Prefixes and suffixes of the word of varying lengths
  4. The word in lowercase
  5. Stemmed version of the word, which deletes all vowels along with g, y, n from the end of the word, but leaves at least a 2 character long stem
  6. If the word is a punctuation mark
  7. If the word is a digit
  8. Features mentioned above for the previous word, the following word, and the words two places before and after
  9. Word POS tag
  10. If the word is at the beginning of the sentence (BOS) or the end of the sentence (EOS) or neither

Requirements

  • Both Linux and Windows are supported. Linux is recommended for performance and compatibility reasons.
  • 64-bit Python 3.7 installation.
  • I recommend sklearn-crfsuite 0.36, which I used for all experiments.
  • Download singgalang.tsv and store it in the data directory.
  • Download all_indo_man_tag_corpus_model.crf.tagger and store it in the pre-trained-model directory.

Usage

python main.py

About

Implementation of text preprocessing impact analysis on named entity recognition (NER) based on conditional random field (CRF) in Indonesian text.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages