Skip to content

Paradigm Completion and Reinflection for Kabardian (kbd), Swahili (swc), and Mixtec (xty).

Notifications You must be signed in to change notification settings

Aadit3003/neural-reinflection-paradigm-completion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

subword-miniproject-2

Mini-Project 2 for 11824. I improve upon two baseline methods (neural and non-neural) for paradigm completion and reinflection for 3 languages - Kabardian (kbd), Swahili (swc), and Mixtec (xty).

File Structure

The files are organized as follows:

  • dataset:- Contains the given train/test/dev files as well as the augmented files and preprocessed train files.
  • 1_Scripts:- Cotains the scripts used for data preparation, augmentation and training the model as well as inferencing to produce the final inflected forms.
  • 2_Final_Submission:- The predicted inflected forms for the lemmas and tags for the three languages.
  • 3_Augment_Data - The extra data files I used to augment the training data for kbd and swc. All the files were obtained from the Sigmorphon 2019 Shared Task 1 data directory.

Run Commands

Navigate to the 'kbd' and 'swc' subfolders in 0_Scripts and simply run:- sbatch run.sh
For 'xty', further navigate to the 'neural-transducer' directory and run:- sbatch run_tagtransformer.sh

My Approach

Kabardian (kbd) and Swahili (swc) (code)

For these two languages, the neural baseline proved to be weaker than the non-neural baseline. I thought that augmenting it with data would be an easy way to drive up the accuracy metric. Swahili data was readily available (around 9000 more data points) as it was one of the high-resource languages for Sigmorphon 2019, while Kabardian data was more scarce (around 200 more data points). All these files can be found in the 3_Augment_Data folder above. I tried 3 approaches with data augmentation - First, simply concatenating the new data, this worked very well and surpassed both the baselines. Next, I tried a "self-pollination" approach by using cartesian products of lemmas and tags both from the new files. Finally, I tried a "cross-pollination" approach by augmenting the data with cartesian product pairs of lemmas from the new files, and tags, affixes from the existing train data. In both languages the simple addition worked better than self-pollination which in-turn worked better than cross-pollination. Kabardian showed slight improvement in the dev set performance (88.3 to 88.67), while Swahili showed a dramatic improvement (71.5 to 95.9). The results can be seen here:

Mixtec (xty) (code)

Since I could not find high-quality data for augmentation with Mixtec, and the neural baseline was giving good results, I decided to play around with hyperparameter optimization. I did a grid search by varying the following hyperparameters:

  • layers = [4, 8, 10]
  • architecture = [hmm, hmmfull, tagtransformer, taguniversaltransformer]
  • decode = [greedy, beam]
  • attention_heads = [4, 8]

I found that the tagtransformer was giving me the best dev set results with 10 layers, 4 attention heads and greedy decoding (80.16 on the dev set), closely followed by the taguniversaltransformer results (79.36). I had to use "PLACEHOLDER" tokens in the test case files, since there was no gold labeling, and hence could not report the test set metrics.

About

Paradigm Completion and Reinflection for Kabardian (kbd), Swahili (swc), and Mixtec (xty).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published