Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode script extensions #155

Open
747 opened this issue Jun 13, 2021 · 0 comments
Open

Support Unicode script extensions #155

747 opened this issue Jun 13, 2021 · 0 comments

Comments

@747
Copy link

747 commented Jun 13, 2021

Is there any plan to support the script extensions (scx) property, which allows characters to have non-singular script identities?
It has been available in many dynamic languages such as Perl, PHPPython, JavaScript (recently) etc., and would greatly improve the usefulness against the real-world text.

For example, in JS after ES2018:

// match by script (= Ruby /[\p{Hani}\p{Hira}\p{Kana}]+/)
"ア行〜タ行のデータ".match(/[\p{sc=Hani}\p{sc=Hira}\p{sc=Kana}]+/gu);
// => [ "ア行", "タ行のデ", "タ" ]

// match by script_extensions
"ア行〜タ行のデータ".match(/[\p{scx=Hani}\p{scx=Hira}\p{scx=Kana}]+/gu);
// => [ "ア行〜タ行のデータ" ]

While not being the silver bullet due to the Unicode complications, it will catch most of the common pitfalls on Unicode script matching. Manually reproducing the equivalent of scx properties with the vanilla script property can often result in a non-trivial expression.

# implement \p{scx=Hira} equivalent
/[\p{Hira}、-〃〈-】〓-〟〰-〵〷〼〽\u3099-゜゠・ー﹅﹆。-・ー゙゚]/

Sorry if already discussed somewhere, but at least I couldn't find a relevant issue in this repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant