LUKE-NER

This repository is for NER training/inference using LUKE.

Features:

Our implementation relies on Trainer of huggingface/transformers (while the official repository provides examples using AllenNLP).
This repository improves preprocessing for non-space-delimited languages.
The code is compatible with fine-tuned LUKE NER models available on Hugging Face Hub.

Usage

Installation

$ git clone https://github.com/naist-nlp/luke-ner.git
$ cd luke-ner
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

Dataset preparation

Datasets must be in the JSON Lines format, where each line represents a document that consists of examples, as exemplified below:

{
  "id": "doc-001",
  "examples": [
    {
      "id": "s1",
      "text": "She graduated from NAIST.",
      "entities": [
        {
          "start": 19,
          "end": 24,
          "label": "ORG"
        }
      ],
      "word_positions": [[0, 3], [4, 13], [14, 18], [19, 24], [24, 25]]
    }
  ]
}

For each example, the surrounding examples in the document are used to extend the context. Note that the field of word_positions can be null as it is optional. word_positions are used to enforce the word boundaries on a tokenizer.

For CoNLL '03 datasets, you can use data/convert_conll2003_to_jsonl.py:

$ python data/convert_conll2003_to_jsonl.py eng.train eng.train.jsonl
$ python data/convert_conll2003_to_jsonl.py eng.testa eng.testa.jsonl
$ python data/convert_conll2003_to_jsonl.py eng.testb eng.testb.jsonl

Fine-tuning

torchrun --nproc_per_node 4 src/main.py \
    --do_train \
    --do_eval \
    --do_predict \
    --train_file data/eng.train.jsonl \
    --validation_file data/eng.testa.jsonl \
    --test_file data/eng.testb.jsonl \
    --model "studio-ousia/luke-large-lite" \
    --output_dir ./output/ \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 8 \
    --max_entity_length 64 \
    --max_mention_length 16 \
    --save_strategy epoch \
    --pretokenize false  # you can enable this to use word boundaries for tokenization

Evaluation/Prediction

torchrun --nproc_per_node 4 src/main.py \
    --do_eval \
    --do_predict \
    --validation_file data/eng.testa.jsonl \
    --test_file data/eng.testb.jsonl \
    --model PATH_TO_YOUR_MODEL \
    --output_dir ./output/ \
    --per_device_eval_batch_size 8 \
    --max_entity_length 64 \
    --max_mention_length 16 \
    --pretokenize false

Performances

CoNLL '03 English (test)

Model	Precision	Recall	F1
LUKE (paper)	-	-	94.3
studio-ousia/luke-large-finetuned-conll-2003 on notebook	93.86	94.53	94.20
studio-ousia/luke-large-finetuned-conll-2003 on script	94.58	94.65	94.61
studio-ousia/luke-large-finetuned-conll-2003 on our code	93.98	94.67	94.33
studio-ousia/luke-large-lite fine-tuned with our code	93.66	94.79	94.22
mLUKE (paper)	-	-	94.0
studio-ousia/mluke-large-lite-finetuned-conll-2003 on notebook*	94.23	94.23	94.23
studio-ousia/mluke-large-lite-finetuned-conll-2003 on script*	94.33	93.76	94.05
studio-ousia/mluke-large-lite-finetuned-conll-2003 on our code*	93.76	93.92	93.84
studio-ousia/mluke-large-lite fine-tuned with our code	94.10	94.49	94.29

Performance differences are due to different units of input for tokenization. Note that the codes marked with * are a bit tweaked when evaluating studio-ousia/mluke-large-lite-finetuned-conll-2003 because the current model was fine-tuned with erroneous entity_attention_mask (See the issues #166, #172 for details).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
default.conf		default.conf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LUKE-NER

Usage

Installation

Dataset preparation

Fine-tuning

Evaluation/Prediction

Performances

CoNLL '03 English (test)

About

Releases

Packages

Languages

License

naist-nlp/luke-ner

Folders and files

Latest commit

History

Repository files navigation

LUKE-NER

Usage

Installation

Dataset preparation

Fine-tuning

Evaluation/Prediction

Performances

CoNLL '03 English (test)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages