Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-1517: [Perl] Encode UTF-8 strings as bytes #2979

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

jjatria
Copy link
Contributor

@jjatria jjatria commented Jun 26, 2024

From John Karp's original description of the original issue at https://issues.apache.org/jira/browse/AVRO-1517:

By default in Perl, a string is a sequence of bytes, values 0-255.
However, if a Unicode character is included that cannot be represented
with a single byte, the string gets 'upgraded' to a non-byte-based
Unicode string allowing ordinals outside that range. When string
operations are done with byte and non-byte Unicode strings, the result
is always non-byte, with the byte string first 'upgraded'. Upgrading
consists of utf8 encoding and setting a utf8 flag on the string. ('utf8'
is a variant of UTF-8 used by Perl)

The Perl Avro API is accepting these Unicode strings as-is for the
'bytes' type. This is a problem because

  1. values >255 are not valid as bytes, and any encoding is their job

  2. As Avro assembles the serialized data, Perl 'upgrades' all the data,
    having the effect of utf8 encoding our serialized binary data.

The correct behavior is for the Avro Perl API is to attempt to downgrade
the string, and if this fails because it contained values >255 then to
raise an error. (The behavior of 'string' won't change, it will still
take Unicode strings as expected.)

This change, based on the one submitted for that ticket, adds these behaviours and tests to exercise them.

Verifying this change

This change extended t/01_schema.t to add tests for the encoding of different kind of strings (including upgraded ones) as type 'byte' and 'fixed'.

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Change was documented in the change log.

From John Karp's original description of [the issue]:

> By default in Perl, a string is a sequence of bytes, values 0-255.
> However, if a Unicode character is included that cannot be represented
> with a single byte, the string gets 'upgraded' to a non-byte-based
> Unicode string allowing ordinals outside that range. When string
> operations are done with byte and non-byte Unicode strings, the result
> is always non-byte, with the byte string first 'upgraded'. Upgrading
> consists of utf8 encoding and setting a utf8 flag on the string. ('utf8'
> is a variant of UTF-8 used by Perl)
>
> The Perl Avro API is accepting these Unicode strings as-is for the
> 'bytes' type. This is a problem because
>
>   1. values >255 are not valid as bytes, and any encoding is their job
>
>   2. As Avro assembles the serialized data, Perl 'upgrades' all the data,
>      having the effect of utf8 encoding our serialized binary data.
>
> The correct behavior is for the Avro Perl API is to attempt to downgrade
> the string, and if this fails because it contained values >255 then to
> raise an error. (The behavior of 'string' won't change, it will still
> take Unicode strings as expected.)

This change, based on the one submitted for that ticket, adds these
behaviours and tests to exercise them.

[the issue]: https://issues.apache.org/jira/browse/AVRO-1517
@github-actions github-actions bot added the Perl label Jun 26, 2024
@martin-g martin-g merged commit 677e982 into apache:main Jun 26, 2024
6 checks passed
@jjatria jjatria deleted the avro-1517-unicode-strings branch June 26, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants