low-res-nmt

For evaluation:

please run

pip install -r requirements.txt

python evaluator.py --input-file-path <path-to-test-file> --target-file-path <path-to-target-file>

Training our best model

$DATA_PATH=/path/to/data

$DATA_PATH will contain these files (generated synthetic data from monolingual text; to recreate see constructing the pipelines below):
- predictions/predictions_english_st_regex.txt
- unaligned_tokenized_rempunc.en
- predictions/predictions_french_bt_regex.txt
- unaligned_tokenized.fr
CUDA_VISIBLE_DEVICES="0" python train.py --data_path $DATA_PATH --experiment 1_st --batch_size 64
--num_layer 2 --d_model 1024 --dff 1024 --epochs 3
--p_wd_st 0.3 --p_wd_bt 0.1 --dropout_rate 0.4 --start 200000
--st --bt

Reconstructing our pipeline (from scratch):

Split Data

$DATA_PATH should contain these files:

train.lang1
train.lang2
python split_data.py --data_path $DATA_PATH

Train Self-Training and Back-Translation models on parallel-data

Self-Training:

CUDA_VISIBLE_DEVICES="0" python train.py --data_path $DATA_PATH --experiment 1_st --batch_size 64
--num_layer 1 --d_model 1024 --dff 1024 --epochs 50
--dropout_rate 0.4
--train_lang1 train/split_train.lang1
--train_lang2 train/split_train.lang2
--val_lang1 train/split_train.lang1
--val_lang2 train/split_train.lang2 \

Back-Translation:

switch the languages

CUDA_VISIBLE_DEVICES="0" python train.py --data_path $DATA_PATH --experiment 1_bt --batch_size 64
--num_layer 1 --d_model 1024 --dff 1024 --epochs 50
--dropout_rate 0.4
--train_lang1 train/split_train.lang2
--train_lang2 train/split_train.lang1
--val_lang1 train/split_train.lang2
--val_lang2 train/split_train.lang1 \

Forward Generation on Monolingual Data

CUDA_VISIBLE_DEVICES="0" python generation.py --checkpoint_path $/path/to/st/model \ --npz_path ../model/data_and_vocab_bt_st_upsample_best.npz \ --start 200000 --end 400000

Predictions generated will be saved in an outfile: predictions_english_monolingual_$(START)_$(END).txt

Backward Generation on Monolingual Data

CUDA_VISIBLE_DEVICES="0" python generation.py --checkpoint_path $/path/to/bt/model \ --npz_path ../model/data_and_vocab_bt_st_upsample_best.npz \ --start 200000 --end 400000

Predictions generated will be saved in an outfile: predictions_english_monolingual_$(START)_$(END).txt

post-process with regex

`python refine_preds_regex.py --file predictions/forward/txt
`python refine_preds_regex.py --file predictions/backward/txt

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
IFT6759H20-report-project2-Team13.pdf		IFT6759H20-report-project2-Team13.pdf
README.md		README.md
evaluator.py		evaluator.py
evaluator_old.py		evaluator_old.py
generation.py		generation.py
refine_preds_regex.py		refine_preds_regex.py
requirements.txt		requirements.txt
split_data.py		split_data.py
train.py		train.py
transformer.py		transformer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

low-res-nmt

For evaluation:

Training our best model

Reconstructing our pipeline (from scratch):

Split Data

Train Self-Training and Back-Translation models on parallel-data

Self-Training:

Back-Translation:

Forward Generation on Monolingual Data

Backward Generation on Monolingual Data

post-process with regex

Train the best model as described after evaluation section, and repeat for n-iterations!

About

Releases

Packages

Contributors 3

Languages

burglarhobbit/low-res-nmt

Folders and files

Latest commit

History

Repository files navigation

low-res-nmt

For evaluation:

Training our best model

Reconstructing our pipeline (from scratch):

Split Data

Train Self-Training and Back-Translation models on parallel-data

Self-Training:

Back-Translation:

Forward Generation on Monolingual Data

Backward Generation on Monolingual Data

post-process with regex

Train the best model as described after evaluation section, and repeat for n-iterations!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages