Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare partially diacritized input dataset #1

Open
ruohoruotsi opened this issue Nov 24, 2018 · 1 comment
Open

Prepare partially diacritized input dataset #1

ruohoruotsi opened this issue Nov 24, 2018 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@ruohoruotsi
Copy link
Member

To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.

The dataset can be used in the following ways:

  1. Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
  2. Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]

Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.

@ruohoruotsi ruohoruotsi added the enhancement New feature or request label Nov 24, 2018
@ruohoruotsi ruohoruotsi assigned ruohoruotsi and unassigned dadelani Dec 24, 2018
@ruohoruotsi
Copy link
Member Author

Two sets of parallel data are needed:

  • Partially diacritized ==> used for [non, partially] diacritized training pairs. Tracked by this issue: Scrape partially diacritized text yoruba-text#11
  • Fully diacritized ==> used for [partial, fully] diacritized training pairs. This can based on the current ADR dataset, enhanced with Kola and Timilehin's fully diacritized contributions. The rule for creating the partial set will include decomposing the fully diacritized text, so just the accents (not the under-dots) are removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants