Name2CType data wrong for many Indic scripts? #146

deepestblue · 2020-01-12T23:05:41Z

I found this when trying to use Ruby Regexp on Tamil Unicode codepoint data.

irb(main):002:0> "\u0BAE\u0BC0\u0BA9\u0BCD\u0BA9".scan(/[[:alpha:]]+/).each { |s| puts s.dump }
"\u0BAE\u0BC0\u0BA9"
"\u0BA9"
=> ["மீன", "ன"]
irb(main):003:0>

Notice that both \u0BC0 and \u0BCD are combining vowel markers in the Mark, Nonspacing [Mn] character category, which should match the [:alpha:] class. But \u0BCD does not seem to match the class. Stackoverflow told me Ruby uses Onigmo under the hood, and I found the following except in name2ctype.h in CR_Alpha, CR_Alnum, etc.

	0x0bca, 0x0bcc,
	0x0c01, 0x0c03,

Notice the missing 0x0bcd.

P.S. I found a number of other missing Indic codepoints as well in that file. If you agree this is a bug I can look in the file some more and do an audit. Thanks!

The text was updated successfully, but these errors were encountered:

JoergWMittag · 2020-01-12T23:49:17Z

See Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby? on Stack Overflow for a discussion, partially reproduced below:

The two characters in question are (I have marked some interesting things in bold):

U+0BC0 Tamil Vowel Sign II, with the following (relevant) properties:
- General Category: Nonspacing Mark
- Alphabetic: Yes
U+0BCD Tamil Sign Virama, with the following (relevant) properties:
- General Category: Nonspacing Mark
- Alphabetic: No

The Ruby documentation for the Regexp class does not explicitly spell out what [[:alpha:]] matches, but it does say that the POSIX bracket expressions match non-ASCII characters, and it gives [[:digit:]] as an example, saying it matches anything with the Unicode property Nd (Decimal Number).

While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.

On the other hand, the documentation for Onigmo does explicitly specify the workings of [[:alpha:]]. In fact, it specifies it in two different places, and they contradict each other:

In doc/RE, it says that [[:alpha:]] matches Letter | Mark.
In doc/UnicodeProps.txt, it seems to imply that [[:alpha:]] matches Alphabetic.

So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as [[:alpha:]]. In that case, the Onigmo documentation is incorrect, and the Ruby documentation is imprecise.

deepestblue · 2020-01-12T23:59:58Z

Thanks, Joerg.

While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.

Given [[:digit:]] matches Unicode category Nd, for the sake of consistency I'd rather [[:alpha:]] match the union of Unicode category Letter and Unicode category Mark, rather than Unicode property Alphabetic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name2CType data wrong for many Indic scripts? #146

Name2CType data wrong for many Indic scripts? #146

deepestblue commented Jan 12, 2020 •

edited

Loading

JoergWMittag commented Jan 12, 2020

deepestblue commented Jan 12, 2020

Name2CType data wrong for many Indic scripts? #146

Name2CType data wrong for many Indic scripts? #146

Comments

deepestblue commented Jan 12, 2020 • edited Loading

JoergWMittag commented Jan 12, 2020

deepestblue commented Jan 12, 2020

deepestblue commented Jan 12, 2020 •

edited

Loading