-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str uses both comma , and cedilla , #319
Comments
changed str-equiv to: - no longer change cedilla to comma - map the confusables to cedilla instead of comma - but also leave the comma as itself since it might be punctuation changed str-to-ipa to: - map both cedilla and comma to glottal stop - document in the mapping why, and what's still not right about this. Partially addresses #319 but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop. Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics.
changed str-equiv to: - no longer change cedilla to comma - map the confusables to cedilla instead of comma - but also leave the comma as itself since it might be punctuation changed str-to-ipa to: - map both cedilla and comma to glottal stop - document in the mapping why, and what's still not right about this. Partially addresses #319 but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop. Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics.
#321 partially addresses this, but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop. Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics. |
When we merge #321, I will suggest we close this issue an open a separate one tagged as an enhancement for that, and to be honest I don't see us ever doing it given the extent of restructuring that will be required. |
changed str-equiv to: - no longer change cedilla to comma - map the confusables to cedilla instead of comma - but also leave the comma as itself since it might be punctuation changed str-to-ipa to: - map both cedilla and comma to glottal stop - document in the mapping why, and what's still not right about this. Partially addresses #319 but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop. Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics.
In the new keyboard, cedilla is used exclusively as glottal stop and comma is used exclusively as punctuation. However, there is older text which was written where comma is used as both glottal stops and punctuation. We should not perpetuate this by converting cedilla to comma as we currently do in
str
mappings, but we should also try and make thestr-equiv
account for some of this messiness.We can use some heuristics, but they won't be comprehensive without a statistical model: e.g. two side-by-side commas, the first will always be a cedilla: ,, -> ¸,
The text was updated successfully, but these errors were encountered: