Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name2CType data wrong for many Indic scripts? #146

Open
deepestblue opened this issue Jan 12, 2020 · 2 comments
Open

Name2CType data wrong for many Indic scripts? #146

deepestblue opened this issue Jan 12, 2020 · 2 comments

Comments

@deepestblue
Copy link

deepestblue commented Jan 12, 2020

I found this when trying to use Ruby Regexp on Tamil Unicode codepoint data.

irb(main):002:0> "\u0BAE\u0BC0\u0BA9\u0BCD\u0BA9".scan(/[[:alpha:]]+/).each { |s| puts s.dump }
"\u0BAE\u0BC0\u0BA9"
"\u0BA9"
=> ["மீன", "ன"]
irb(main):003:0>

Notice that both \u0BC0 and \u0BCD are combining vowel markers in the Mark, Nonspacing [Mn] character category, which should match the [:alpha:] class. But \u0BCD does not seem to match the class. Stackoverflow told me Ruby uses Onigmo under the hood, and I found the following except in name2ctype.h in CR_Alpha, CR_Alnum, etc.

	0x0bca, 0x0bcc,
	0x0c01, 0x0c03,

Notice the missing 0x0bcd.

P.S. I found a number of other missing Indic codepoints as well in that file. If you agree this is a bug I can look in the file some more and do an audit. Thanks!

@JoergWMittag
Copy link

See Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby? on Stack Overflow for a discussion, partially reproduced below:

The two characters in question are (I have marked some interesting things in bold):

The Ruby documentation for the Regexp class does not explicitly spell out what [[:alpha:]] matches, but it does say that the POSIX bracket expressions match non-ASCII characters, and it gives [[:digit:]] as an example, saying it matches anything with the Unicode property Nd (Decimal Number).

While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.

On the other hand, the documentation for Onigmo does explicitly specify the workings of [[:alpha:]]. In fact, it specifies it in two different places, and they contradict each other:

So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as [[:alpha:]]. In that case, the Onigmo documentation is incorrect, and the Ruby documentation is imprecise.

@deepestblue
Copy link
Author

Thanks, Joerg.

While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.

Given [[:digit:]] matches Unicode category Nd, for the sake of consistency I'd rather [[:alpha:]] match the union of Unicode category Letter and Unicode category Mark, rather than Unicode property Alphabetic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants