Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend utf8 to 31bits #111

Merged
merged 3 commits into from
Jan 26, 2019
Merged

extend utf8 to 31bits #111

merged 3 commits into from
Jan 26, 2019

Conversation

omochi
Copy link
Contributor

@omochi omochi commented Jan 22, 2019

This extends utf8 to 31bits.

Explanation of motivation is here #110

I implemented logic with same manner on original 21 bits logic.

I add test program for testing utf-8 codec.

It can execute following steps. (at least in my mac environment)

$ aclocal
$ automake -ac
$ ./configure
$ make
$ make test_enc_utf8
$ ./test_enc_utf8

31bits mode is enabled by USE_UTF8_31BITS flag in utf8.c.
To ease testing this PR, I enabled it.
If this is accepted, I amend commit to disable it.

USE_UTF8_31BITS flag is also in test_enc_utf8.c.
We need to keep that these two flag have same.

Finally, I attached my memo to implement decoding table.

5 bytes
            111110yy 10yyyxxx 10xxxxxx 10xxxxxx 10xxxxxx 
min            
U+200000          00   001000   000000   000000   000000  26 bit
                  F8       88       80       80       80 
max
U+3FFFFFF         11   111111   111111   111111   111111  
                  FB       BF       BF       BF       BF 

6bytes
            1111110y 10yyyyxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
min            
U+4000000          0   000100   000000   000000   000000   000000  31 bit
                  FC       84       80       80       80       80
max
U+7FFFFFF          1   111111   111111   111111   111111   111111
                  FD       BF       BF       BF       BF       BF

サロゲートペア
U+D800 - U+DBFF, U+DC00 - U+DFFFをそのままコードポイントとして
UTF-8規則でエンコードすると
ED A0 80 - ED AF BF, ED B0 80 - ED BF BF



21bits(original)

S0: 最初の1バイト
      1バイト文字を見た → ACCEPT
      2バイト文字を見た → S1
      E0を見た → S2
      E1-EF(!ED) → S3
      ED → S4
      F0を見た → S5
      F1-F3を見た → S6
      F4を見た → S7
S1: 通常の後続バイトかどうか判定(1) 80-BF → ACCEPT
S2: 3バイト文字の2文字目だが、
    1バイト目のyyyyが000だったので、
    2バイト目のyの0が禁止。 A0-BF → S1
S3: 通常の後続バイトかどうか判定(2) 80-BF → S1
S4: A0-BFを見たらサロゲートペアなので、
    80-9Fだけ許可 → S1
S5: 4バイト文字の2文字目だが、
    1バイト目のyyyが000だったので、
    2バイト目のyyの00が禁止。90-BF → S3
S6: 通常の後続バイトかどうか判定(3) 80-BF → S3
S7: 1バイト目のyyyが100なので、
    U+10FFFF以下にするため、
    2バイト目のyyは00になる。80-8F → S3

31bits

 S0: 最初の1バイト
       1バイト文字を見た → ACCEPT
       2バイト文字を見た → S1
       E0を見た → S2
       E1-EF(!ED) → S3
       ED → S4
       F0を見た → S5
       F1-F7を見た → S6
       F8を見た → S8
       F9-FBを見た → S9
       FCを見た → S10
       FDを見た → S11
 S1: 通常の後続バイトかどうか判定(1) 80-BF → ACCEPT
 S2: 3バイト文字の2文字目だが、
     1バイト目のyyyyが000だったので、
     2バイト目のyの0が禁止。 A0-BF → S1
 S3: 通常の後続バイトかどうか判定(2) 80-BF → S1
 S4: A0-BFを見たらサロゲートペアなので、
     80-9Fだけ許可 → S1
 S5: 4バイト文字の2文字目だが、
     1バイト目のyyyが000だったので、
     2バイト目のyyの00が禁止。90-BF → S3
 S6: 通常の後続バイトかどうか判定(3) 80-BF → S3
 S8: 5バイト文字の2文字目だが、
     1バイト目のyyが00だったので、
     2バイト目のyyyの000が禁止。88-BF → S6
 S9: 通常の後続バイトかどうか判定(4) 80-BF → S6
S10: 6バイト文字の2文字目だが、
     1バイト目のyが0だったので、
     2バイト目のyyyyの0000が禁止。84-BF → S9
S11: 通常の後続バイトかどうか判定(5) 80-BF → S9

@codecov-io
Copy link

codecov-io commented Jan 22, 2019

Codecov Report

Merging #111 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #111   +/-   ##
=======================================
  Coverage   82.59%   82.59%           
=======================================
  Files          62       62           
  Lines       12442    12442           
=======================================
  Hits        10276    10276           
  Misses       2166     2166
Impacted Files Coverage Δ
enc/utf_8.c 87.17% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7472d58...8146a4d. Read the comment docs.

@coveralls
Copy link

coveralls commented Jan 22, 2019

Coverage Status

Coverage remained the same at 80.529% when pulling 8146a4d on omochi:utf8-31bits into 7472d58 on k-takata:master.

@k-takata
Copy link
Owner

It seems that the modified codes are not tested on CI according to the result of Codecov.
Maybe L90-L96 in Makefile.am should be updated.
(Hmm, you are not including the change of Makefile.in, so codecov might not check the coverage?)
(I'm thinking whether I should delete configure, Makefile.in and other automake generated files from the repository, and execute automake in CI.)

USE_UTF8_31BITS should be disabled before merging.

@omochi
Copy link
Contributor Author

omochi commented Jan 23, 2019

I added execution of my new test to make test.
And I committed Makefile.in, aclocal.m4, config.h.in, configure, sample/Makefile.in.
But they contains large diffs.

@k-takata
Copy link
Owner

Thank you. Now I can check the coverage.
It seems that L416-L417 in utf_8.c is not covered: https://codecov.io/gh/k-takata/Onigmo/compare/27fa4fbbcb064c5e18870081fbd652461db34239...c0ad88f52d632e7760af430bc2c25c2cd989ccf7/diff#D8-412...418
But it is acceptable. (Of course, it's nice if you could add tests for that.)

Could you revert the changes to the generated files (Makefile.in, aclocal.m4, config.h.in, configure, sample/Makefile.in) and comment out the #define USE_UTF8_31BITS lines with /* */?
I will merge this after that. (Sorry for the inconvenience.)

@omochi
Copy link
Contributor Author

omochi commented Jan 25, 2019

I pushed 3 commits.

@k-takata
Copy link
Owner

Thank you for increasing the test coverage.

  • remove build files

Could you revert the build files (to the same state as the master branch) instead of removing?
I'm going to remove them in another PR: #115.
Just removing them is not enough and it breaks the CI.

@omochi
Copy link
Contributor Author

omochi commented Jan 26, 2019

Sorry my misunderstand.
I rebased these commits and built ideal commits.

@k-takata k-takata merged commit 2ee1e18 into k-takata:master Jan 26, 2019
@k-takata
Copy link
Owner

Thank you.

@omochi
Copy link
Contributor Author

omochi commented Jan 26, 2019

Thanks to merge!

@omochi omochi deleted the utf8-31bits branch January 26, 2019 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants