Improve the tokenizer to disambiguate between RE specials and specials used as letters #192

joanise · 2022-10-04T19:54:07Z

PR #190 patches one narrow problem of the tokenizer, making it handle alternations correctly when tokenizing, but it's not general. For example, ^ should be stripped from rules, unless it's a letter in the language (see #190 (comment)).

A better solution would probably be to use a character inventory for the language that enumerates all characters that are not considered letters according to the Unicode standard, but that are actually used as letters in the language, instead of paring the input field for each g2p rule.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the tokenizer to disambiguate between RE specials and specials used as letters #192

Improve the tokenizer to disambiguate between RE specials and specials used as letters #192

joanise commented Oct 4, 2022

Improve the tokenizer to disambiguate between RE specials and specials used as letters #192

Improve the tokenizer to disambiguate between RE specials and specials used as letters #192

Comments

joanise commented Oct 4, 2022