subword-miniproject-2

Mini-Project 2 for 11824. I improve upon two baseline methods (neural and non-neural) for paradigm completion and reinflection for 3 languages - Kabardian (kbd), Swahili (swc), and Mixtec (xty).

File Structure

The files are organized as follows:

dataset:- Contains the given train/test/dev files as well as the augmented files and preprocessed train files.
1_Scripts:- Cotains the scripts used for data preparation, augmentation and training the model as well as inferencing to produce the final inflected forms.
2_Final_Submission:- The predicted inflected forms for the lemmas and tags for the three languages.
3_Augment_Data - The extra data files I used to augment the training data for kbd and swc. All the files were obtained from the Sigmorphon 2019 Shared Task 1 data directory.

Run Commands

Navigate to the 'kbd' and 'swc' subfolders in 0_Scripts and simply run:- sbatch run.sh
For 'xty', further navigate to the 'neural-transducer' directory and run:- sbatch run_tagtransformer.sh

My Approach

Kabardian (kbd) and Swahili (swc) (code)

For these two languages, the neural baseline proved to be weaker than the non-neural baseline. I thought that augmenting it with data would be an easy way to drive up the accuracy metric. Swahili data was readily available (around 9000 more data points) as it was one of the high-resource languages for Sigmorphon 2019, while Kabardian data was more scarce (around 200 more data points). All these files can be found in the 3_Augment_Data folder above. I tried 3 approaches with data augmentation - First, simply concatenating the new data, this worked very well and surpassed both the baselines. Next, I tried a "self-pollination" approach by using cartesian products of lemmas and tags both from the new files. Finally, I tried a "cross-pollination" approach by augmenting the data with cartesian product pairs of lemmas from the new files, and tags, affixes from the existing train data. In both languages the simple addition worked better than self-pollination which in-turn worked better than cross-pollination. Kabardian showed slight improvement in the dev set performance (88.3 to 88.67), while Swahili showed a dramatic improvement (71.5 to 95.9). The results can be seen here:

Mixtec (xty) (code)

Since I could not find high-quality data for augmentation with Mixtec, and the neural baseline was giving good results, I decided to play around with hyperparameter optimization. I did a grid search by varying the following hyperparameters:

layers = [4, 8, 10]
architecture = [hmm, hmmfull, tagtransformer, taguniversaltransformer]
decode = [greedy, beam]
attention_heads = [4, 8]

I found that the tagtransformer was giving me the best dev set results with 10 layers, 4 attention heads and greedy decoding (80.16 on the dev set), closely followed by the taguniversaltransformer results (79.36). I had to use "PLACEHOLDER" tokens in the test case files, since there was no gold labeling, and hence could not report the test set metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
1_Scripts		1_Scripts
2_Final_Submission		2_Final_Submission
3_Augment_Data		3_Augment_Data
dataset		dataset
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

subword-miniproject-2

File Structure

Run Commands

My Approach

Kabardian (kbd) and Swahili (swc) (code)

Mixtec (xty) (code)

About

Releases

Packages

Languages

Aadit3003/neural-reinflection-paradigm-completion

Folders and files

Latest commit

History

Repository files navigation

subword-miniproject-2

File Structure

Run Commands

My Approach

Kabardian (kbd) and Swahili (swc) (code)

Mixtec (xty) (code)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages