Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str uses both comma , and cedilla , #319

Open
roedoejet opened this issue Feb 7, 2024 · 2 comments
Open

str uses both comma , and cedilla , #319

roedoejet opened this issue Feb 7, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@roedoejet
Copy link
Owner

In the new keyboard, cedilla is used exclusively as glottal stop and comma is used exclusively as punctuation. However, there is older text which was written where comma is used as both glottal stops and punctuation. We should not perpetuate this by converting cedilla to comma as we currently do in str mappings, but we should also try and make the str-equiv account for some of this messiness.

We can use some heuristics, but they won't be comprehensive without a statistical model: e.g. two side-by-side commas, the first will always be a cedilla: ,, -> ¸,

@roedoejet roedoejet added the bug Something isn't working label Feb 7, 2024
joanise added a commit that referenced this issue Feb 12, 2024
changed str-equiv to:
 - no longer change cedilla to comma
 - map the confusables to cedilla instead of comma
 - but also leave the comma as itself since it might be punctuation

changed str-to-ipa to:
 - map both cedilla and comma to glottal stop
 - document in the mapping why, and what's still not right about this.

Partially addresses #319 but does not fully fix it since we'd still want
to disambiguate instances of commas between punctuation and glottal
stop.

Fully fixing it will pose all sorts of challenges in a module without
statistical components and where tokenization assumes you can decide up
front whether a character is a letter or not. Any heuristic that maps
some but not all commas to glottal stop will likely break the ReadAlong
Studio, unless tokenization is patched at the same time to use the same
heuristics.
joanise added a commit that referenced this issue Feb 12, 2024
changed str-equiv to:
 - no longer change cedilla to comma
 - map the confusables to cedilla instead of comma
 - but also leave the comma as itself since it might be punctuation

changed str-to-ipa to:
 - map both cedilla and comma to glottal stop
 - document in the mapping why, and what's still not right about this.

Partially addresses #319 but does not fully fix it since we'd still want
to disambiguate instances of commas between punctuation and glottal
stop.

Fully fixing it will pose all sorts of challenges in a module without
statistical components and where tokenization assumes you can decide up
front whether a character is a letter or not. Any heuristic that maps
some but not all commas to glottal stop will likely break the ReadAlong
Studio, unless tokenization is patched at the same time to use the same
heuristics.
@joanise
Copy link
Collaborator

joanise commented Feb 12, 2024

#321 partially addresses this, but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop.

Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics.

@joanise
Copy link
Collaborator

joanise commented Feb 12, 2024

When we merge #321, I will suggest we close this issue an open a separate one tagged as an enhancement for that, and to be honest I don't see us ever doing it given the extent of restructuring that will be required.

joanise added a commit that referenced this issue Feb 20, 2024
changed str-equiv to:
 - no longer change cedilla to comma
 - map the confusables to cedilla instead of comma
 - but also leave the comma as itself since it might be punctuation

changed str-to-ipa to:
 - map both cedilla and comma to glottal stop
 - document in the mapping why, and what's still not right about this.

Partially addresses #319 but does not fully fix it since we'd still want
to disambiguate instances of commas between punctuation and glottal
stop.

Fully fixing it will pose all sorts of challenges in a module without
statistical components and where tokenization assumes you can decide up
front whether a character is a letter or not. Any heuristic that maps
some but not all commas to glottal stop will likely break the ReadAlong
Studio, unless tokenization is patched at the same time to use the same
heuristics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants