Implementation, trained models and result data for the paper Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (PDF on Arxiv). The supplemental material is available for download under GitHub Releases or Zenodo.
Requirements:
- Python >= 3.7 (Conda)
- Jupyter notebook (for evaluation)
- GPU with CUDA-support (for training Transformer models)
At first we advise to create a new virtual environment for Python 3.7 with Conda:
conda create -n docrel python=3.7
conda activate docrel
Install all Python dependencies:
pip install -r requirements.txt
Download dataset (and pretrained models):
# Navigate to data directory
cd data
# Wikipedia corpus
# - download
wget https://github.com/malteos/semantic-document-relations/releases/download/1.0/enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2
# - decompress
bzip2 -d enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2
# Train and test data
# - download
wget https://github.com/malteos/semantic-document-relations/releases/download/1.0/train_testdata__4folds.tar.gz
# - decompress
tar -xzf train_testdata__4folds.tar.gz
# Models
# - download
wget https://github.com/malteos/semantic-document-relations/releases/download/1.0/model_wiki.bert_base__joint__seq512.tar.gz
# - decompress
tar -xzf model_wiki.bert_base__joint__seq512.tar.gz
Run predefined experiment (settings can be found in experiments/predefined/wiki
)
# Config: wiki.bert_base__joint__seq128
# GPU ID: 1 (set via CUDA_VISIBLE_DEVICES=1)
# Output dir: ./output
python cli.py run ./output 1 wiki.bert_base__joint__seq512
You can run a Jupyter notebook on Google Colab:
If you are using our code, please cite our paper:
@InProceedings{Ostendorff2020,
title = {Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles},
booktitle = {Proceedings of the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},
author = {Ostendorff, Malte and Ruas, Terry and Schubotz, Moritz and Gipp, Bela},
year = {2020},
month = {Aug.},
}
MIT