str uses both comma , and cedilla , #319

roedoejet · 2024-02-07T20:33:10Z

In the new keyboard, cedilla is used exclusively as glottal stop and comma is used exclusively as punctuation. However, there is older text which was written where comma is used as both glottal stops and punctuation. We should not perpetuate this by converting cedilla to comma as we currently do in str mappings, but we should also try and make the str-equiv account for some of this messiness.

We can use some heuristics, but they won't be comprehensive without a statistical model: e.g. two side-by-side commas, the first will always be a cedilla: ,, -> ¸,

The text was updated successfully, but these errors were encountered:

changed str-equiv to: - no longer change cedilla to comma - map the confusables to cedilla instead of comma - but also leave the comma as itself since it might be punctuation changed str-to-ipa to: - map both cedilla and comma to glottal stop - document in the mapping why, and what's still not right about this. Partially addresses #319 but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop. Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics.

joanise · 2024-02-12T21:48:38Z

#321 partially addresses this, but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop.

Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics.

joanise · 2024-02-12T21:50:13Z

When we merge #321, I will suggest we close this issue an open a separate one tagged as an enhancement for that, and to be honest I don't see us ever doing it given the extent of restructuring that will be required.

changed str-equiv to: - no longer change cedilla to comma - map the confusables to cedilla instead of comma - but also leave the comma as itself since it might be punctuation changed str-to-ipa to: - map both cedilla and comma to glottal stop - document in the mapping why, and what's still not right about this. Partially addresses #319 but does not fully fix it since we'd still want to disambiguate instances of commas between punctuation and glottal stop. Fully fixing it will pose all sorts of challenges in a module without statistical components and where tokenization assumes you can decide up front whether a character is a letter or not. Any heuristic that maps some but not all commas to glottal stop will likely break the ReadAlong Studio, unless tokenization is patched at the same time to use the same heuristics.

roedoejet added the bug Something isn't working label Feb 7, 2024

roedoejet mentioned this issue Feb 7, 2024

feat(str): accept space+comb-cedilla or space+comb-comma as equiv to cedilla #318

Merged

joanise mentioned this issue Feb 12, 2024

fix(str): cedilla is now the default glottal stop character #321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str uses both comma , and cedilla , #319

str uses both comma , and cedilla , #319

roedoejet commented Feb 7, 2024

joanise commented Feb 12, 2024

joanise commented Feb 12, 2024

str uses both comma , and cedilla , #319

str uses both comma , and cedilla , #319

Comments

roedoejet commented Feb 7, 2024

joanise commented Feb 12, 2024

joanise commented Feb 12, 2024