Armenian MFA

In-progress project on forced alignment of Armenian using the Montreal Forced Aligner.

We trained an acoustic model on the Armenian data from the FLEURS dataset. The dataset is around 14 hours of Eastern Armenian speech (n=4380 sound files). We normalized the transcript for the following:

to remove word-internal punctuation
to remove word-external punctuation
to convert digits into number lemmas
to find errors in the transcripts

We manually created a pronunciation dictionary by examining the tokens in FLEURS against the Armenian Wiktionary entries on Wikipron.

We at first trained the model with a beam of 100. The model generated TextGrids for 4324 sound files with word-alignment and phone-alignment. We then re-ran the model on the data with a beam of 1000 to get TextGrids for 4379 sound files. One file seems to be broken.

Each TextGrid has the following structure:

words tier, generated by MFA.
phones tier, generated by MFA.
sentenceOriginal tier, manually generated. Lists the original transcript from FLEURS.
sentenceNormalized tier, manually generated. Lists the transcript that we created by normalizing the sentenceOriginal tier. The model was run over this tier.
notes tier, manually generated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Armenian MFA

Files

README.md

Latest commit

History

README.md

File metadata and controls

Armenian MFA