extend utf8 to 31bits #111

omochi · 2019-01-22T15:30:54Z

This extends utf8 to 31bits.

Explanation of motivation is here #110

I implemented logic with same manner on original 21 bits logic.

I add test program for testing utf-8 codec.

It can execute following steps. (at least in my mac environment)

$ aclocal
$ automake -ac
$ ./configure
$ make
$ make test_enc_utf8
$ ./test_enc_utf8

31bits mode is enabled by USE_UTF8_31BITS flag in utf8.c.
To ease testing this PR, I enabled it.
If this is accepted, I amend commit to disable it.

USE_UTF8_31BITS flag is also in test_enc_utf8.c.
We need to keep that these two flag have same.

Finally, I attached my memo to implement decoding table.

5 bytes
            111110yy 10yyyxxx 10xxxxxx 10xxxxxx 10xxxxxx 
min            
U+200000          00   001000   000000   000000   000000  26 bit
                  F8       88       80       80       80 
max
U+3FFFFFF         11   111111   111111   111111   111111  
                  FB       BF       BF       BF       BF 

6bytes
            1111110y 10yyyyxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
min            
U+4000000          0   000100   000000   000000   000000   000000  31 bit
                  FC       84       80       80       80       80
max
U+7FFFFFF          1   111111   111111   111111   111111   111111
                  FD       BF       BF       BF       BF       BF

サロゲートペア
U+D800 - U+DBFF, U+DC00 - U+DFFFをそのままコードポイントとして
UTF-8規則でエンコードすると
ED A0 80 - ED AF BF, ED B0 80 - ED BF BF



21bits(original)

S0: 最初の1バイト
      1バイト文字を見た → ACCEPT
      2バイト文字を見た → S1
      E0を見た → S2
      E1-EF(!ED) → S3
      ED → S4
      F0を見た → S5
      F1-F3を見た → S6
      F4を見た → S7
S1: 通常の後続バイトかどうか判定(1) 80-BF → ACCEPT
S2: 3バイト文字の2文字目だが、
    1バイト目のyyyyが000だったので、
    2バイト目のyの0が禁止。 A0-BF → S1
S3: 通常の後続バイトかどうか判定(2) 80-BF → S1
S4: A0-BFを見たらサロゲートペアなので、
    80-9Fだけ許可 → S1
S5: 4バイト文字の2文字目だが、
    1バイト目のyyyが000だったので、
    2バイト目のyyの00が禁止。90-BF → S3
S6: 通常の後続バイトかどうか判定(3) 80-BF → S3
S7: 1バイト目のyyyが100なので、
    U+10FFFF以下にするため、
    2バイト目のyyは00になる。80-8F → S3

31bits

 S0: 最初の1バイト
       1バイト文字を見た → ACCEPT
       2バイト文字を見た → S1
       E0を見た → S2
       E1-EF(!ED) → S3
       ED → S4
       F0を見た → S5
       F1-F7を見た → S6
       F8を見た → S8
       F9-FBを見た → S9
       FCを見た → S10
       FDを見た → S11
 S1: 通常の後続バイトかどうか判定(1) 80-BF → ACCEPT
 S2: 3バイト文字の2文字目だが、
     1バイト目のyyyyが000だったので、
     2バイト目のyの0が禁止。 A0-BF → S1
 S3: 通常の後続バイトかどうか判定(2) 80-BF → S1
 S4: A0-BFを見たらサロゲートペアなので、
     80-9Fだけ許可 → S1
 S5: 4バイト文字の2文字目だが、
     1バイト目のyyyが000だったので、
     2バイト目のyyの00が禁止。90-BF → S3
 S6: 通常の後続バイトかどうか判定(3) 80-BF → S3
 S8: 5バイト文字の2文字目だが、
     1バイト目のyyが00だったので、
     2バイト目のyyyの000が禁止。88-BF → S6
 S9: 通常の後続バイトかどうか判定(4) 80-BF → S6
S10: 6バイト文字の2文字目だが、
     1バイト目のyが0だったので、
     2バイト目のyyyyの0000が禁止。84-BF → S9
S11: 通常の後続バイトかどうか判定(5) 80-BF → S9

codecov-io · 2019-01-22T15:33:36Z

Codecov Report

Merging #111 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #111   +/-   ##
=======================================
  Coverage   82.59%   82.59%           
=======================================
  Files          62       62           
  Lines       12442    12442           
=======================================
  Hits        10276    10276           
  Misses       2166     2166

Impacted Files	Coverage Δ
enc/utf_8.c	`87.17% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7472d58...8146a4d. Read the comment docs.

coveralls · 2019-01-22T15:35:13Z

Coverage remained the same at 80.529% when pulling 8146a4d on omochi:utf8-31bits into 7472d58 on k-takata:master.

k-takata · 2019-01-23T01:04:57Z

It seems that the modified codes are not tested on CI according to the result of Codecov.
Maybe L90-L96 in Makefile.am should be updated.
(Hmm, you are not including the change of Makefile.in, so codecov might not check the coverage?)
(I'm thinking whether I should delete configure, Makefile.in and other automake generated files from the repository, and execute automake in CI.)

USE_UTF8_31BITS should be disabled before merging.

omochi · 2019-01-23T08:51:58Z

I added execution of my new test to make test.
And I committed Makefile.in, aclocal.m4, config.h.in, configure, sample/Makefile.in.
But they contains large diffs.

k-takata · 2019-01-24T23:59:31Z

Thank you. Now I can check the coverage.
It seems that L416-L417 in utf_8.c is not covered: https://codecov.io/gh/k-takata/Onigmo/compare/27fa4fbbcb064c5e18870081fbd652461db34239...c0ad88f52d632e7760af430bc2c25c2cd989ccf7/diff#D8-412...418
But it is acceptable. (Of course, it's nice if you could add tests for that.)

Could you revert the changes to the generated files (Makefile.in, aclocal.m4, config.h.in, configure, sample/Makefile.in) and comment out the #define USE_UTF8_31BITS lines with /* */?
I will merge this after that. (Sorry for the inconvenience.)

omochi · 2019-01-25T17:24:36Z

I pushed 3 commits.

add test codes to improve coverage.
I got 100% for utf_8.c at this point. https://codecov.io/gh/k-takata/Onigmo/compare/27fa4fbbcb064c5e18870081fbd652461db34239...f28a24e5ad2624a53709d69166e660ff217f75b3/changes
remove build files
disable USE_UTF8_31BITS in utf_8.c and test_enc_utf8.c

k-takata · 2019-01-25T23:18:22Z

Thank you for increasing the test coverage.

remove build files

Could you revert the build files (to the same state as the master branch) instead of removing?
I'm going to remove them in another PR: #115.
Just removing them is not enough and it breaks the CI.

omochi · 2019-01-26T03:40:26Z

Sorry my misunderstand.
I rebased these commits and built ideal commits.

k-takata · 2019-01-26T03:51:54Z

Thank you.

omochi · 2019-01-26T03:55:38Z

Thanks to merge!

omochi force-pushed the utf8-31bits branch from a4d32d2 to b133e3b Compare January 26, 2019 03:34

omochi added 3 commits January 26, 2019 12:37

extend utf8 to 31bits

efd19d4

increase test coverage

b0dc621

disable 31bits flag

8146a4d

omochi force-pushed the utf8-31bits branch from b133e3b to 8146a4d Compare January 26, 2019 03:37

k-takata merged commit 2ee1e18 into k-takata:master Jan 26, 2019

omochi deleted the utf8-31bits branch January 26, 2019 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend utf8 to 31bits #111

extend utf8 to 31bits #111

omochi commented Jan 22, 2019

codecov-io commented Jan 22, 2019 •

edited

Loading

coveralls commented Jan 22, 2019 •

edited

Loading

k-takata commented Jan 23, 2019

omochi commented Jan 23, 2019

k-takata commented Jan 24, 2019

omochi commented Jan 25, 2019

k-takata commented Jan 25, 2019

omochi commented Jan 26, 2019

k-takata commented Jan 26, 2019

omochi commented Jan 26, 2019

extend utf8 to 31bits #111

extend utf8 to 31bits #111

Conversation

omochi commented Jan 22, 2019

codecov-io commented Jan 22, 2019 • edited Loading

Codecov Report

coveralls commented Jan 22, 2019 • edited Loading

k-takata commented Jan 23, 2019

omochi commented Jan 23, 2019

k-takata commented Jan 24, 2019

omochi commented Jan 25, 2019

k-takata commented Jan 25, 2019

omochi commented Jan 26, 2019

k-takata commented Jan 26, 2019

omochi commented Jan 26, 2019

codecov-io commented Jan 22, 2019 •

edited

Loading

coveralls commented Jan 22, 2019 •

edited

Loading