Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ſ(U+017F) and K(U+212A) should not be case-insensitive equivalent to S and K #141

Open
dan42 opened this issue Aug 6, 2019 · 5 comments

Comments

@dan42
Copy link

dan42 commented Aug 6, 2019

The current behavior is certainly correct in some abstract Unicode-consortium perspective, but from a practical perspective, for programmers using regular expressions, it will usually produce an incorrect and efficient result.

For example I find the following cases problematic:

str.scan(/[a-z]/i)
Most programmers' notion of "lowercase and uppercase alphabet" does not include U+017F and U+212A.

str.scan(/\w/) != str.scan(/[a-z_\d]/i)
Most programmers would perceive these two regexes to be equivalent, except they are not.

str.scan(/<script/i)
Most programmers' notion of "html script tag" does not include <ſcript

etc.

The only people who might want /[a-z]/i to match U+017F are people handling Fraktur and Gaelic languages, and in any case they would use /\p{LC}/ or such.

The only people who might want /[a-z]/i to match U+212A are ???

These two characters are very rare, so in > 99.9% of cases programmers will never encounter a problem. This makes the very tiny number of edge cases all the more tricky. And we must still pay the de-optimization penalty in all cases because these two characters are multibyte.

Related to this, I believe that /(?a:[a-z])/i (ascii subgroup) should not match U+017F and U+212A.

@k-takata
Copy link
Owner

k-takata commented Aug 8, 2019

This is the same behavior as perl:

$ perl -Mutf8 -e 'if ("ſ" =~ /(?a)s/i) {print "match"}'
match

@k-takata
Copy link
Owner

k-takata commented Aug 9, 2019

Perl supports /(?aa)/i, but Onigmo doesn't support it (yet).

@dan42
Copy link
Author

dan42 commented Aug 10, 2019

Hmm, it looks like the onigmo/ruby default mode is like perl's ascii mode? In perl unicode mode /\d/ is same as /\p{Digit}/ but /\d/a is like ruby /\d/, matching only [0-9]. I apologize, I don't know very well the delineation between onigmo and ruby. But it looks like in ruby, /\w/ behaves like /\w/u but differently from /(?u)\w/. So I think that ruby enables onigmo's ascii mode by default?

It looks like the only difference between modes "a" and "aa" are those two characters. I did a search for all case-insensitive equivalences, and those were the only ones where ascii and non-ascii characters were mixed (see below). If "aa" mode was implemented we could push for ruby to adopt it as default mode. So please pretty please. m(_ _)m

["A", "a"]
["B", "b"]
["C", "c"]
["D", "d"]
["E", "e"]
["F", "f"]
["G", "g"]
["H", "h"]
["I", "i"]
["J", "j"]
["K", "k", "K"]
["L", "l"]
["M", "m"]
["N", "n"]
["O", "o"]
["P", "p"]
["Q", "q"]
["R", "r"]
["S", "s", "ſ"]
["T", "t"]
["U", "u"]
["V", "v"]
["W", "w"]
["X", "x"]
["Y", "y"]
["Z", "z"]
["µ", "Μ", "μ", "൜", "൵", "ർ", "ᵜ", "ᵵ", "ᵼ", "ⵜ", "\u2D75", "\u2D7C", "㵜", "㵵", "㵼", "䵜", "䵵", "䵼", "嵜", "嵵", "嵼", "浜", "浵", "浼", "絜", "絵", "絼", "赜", "赵", "赼", "鵜", "鵵", "鵼", "굜", "굵", "굼", "뵜", "뵵", "뵼", "최", "쵵", "쵼", "", "", "", "ﵜ", "ﵵ", "ﵼ"]
["À", "à"]
["Á", "á"]
["Â", "â"]
["Ã", "ã"]
["Ä", "ä"]
["Å", "å", "Å"]
["Æ", "æ"]
["Ç", "ç"]
etc, all non-ascii

@dan42
Copy link
Author

dan42 commented Aug 10, 2019

This is more complicated than I expected... in ruby, /[[:alpha]]/ behaves like /(?u)[[:alpha]]/ but differently from /(?a)[[:alpha]]/

So I have no idea what the defaults are anymore.

@dan42
Copy link
Author

dan42 commented Aug 26, 2019

After reading the code I finally managed to understand that ruby default mode (?d) is a mix between unicode mode (?u) and ascii mode (?a). If mode (?aa) was added to Onigmo it would be possible to switch ruby default mode to a mix of (?u) and (?aa) instead.

results for (all utf8 from U+0000 to U+FFFF).grep(regexp mode + expr).size

(?d) (?u) (?a) (?aa) comment
\d 10 370 10
\w 63 50567 63
\s 5 24 5
[[:digit:]] 370 370 10
[[:word:]] 50561 50561 63
[[:alpha:]] 49655 49655 52
[[:blank:]] 18 18 2
[[:space:]] 24 24 5
[A-Za-z] 52 52 52
(?i)[a-z] 54 54 54 52 U+017F and U+212A
a\b 1 1 1
あ\b 1 1 0
st 0 0 0
(?i)st 2 2 2 0 ligatures U+FB05 and U+FB06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants