AVRO-1517: [Perl] Encode UTF-8 strings as bytes #2979

jjatria · 2024-06-26T08:41:28Z

From John Karp's original description of the original issue at https://issues.apache.org/jira/browse/AVRO-1517:

By default in Perl, a string is a sequence of bytes, values 0-255.
However, if a Unicode character is included that cannot be represented
with a single byte, the string gets 'upgraded' to a non-byte-based
Unicode string allowing ordinals outside that range. When string
operations are done with byte and non-byte Unicode strings, the result
is always non-byte, with the byte string first 'upgraded'. Upgrading
consists of utf8 encoding and setting a utf8 flag on the string. ('utf8'
is a variant of UTF-8 used by Perl)

The Perl Avro API is accepting these Unicode strings as-is for the
'bytes' type. This is a problem because

values >255 are not valid as bytes, and any encoding is their job

As Avro assembles the serialized data, Perl 'upgrades' all the data,
having the effect of utf8 encoding our serialized binary data.

The correct behavior is for the Avro Perl API is to attempt to downgrade
the string, and if this fails because it contained values >255 then to
raise an error. (The behavior of 'string' won't change, it will still
take Unicode strings as expected.)

This change, based on the one submitted for that ticket, adds these behaviours and tests to exercise them.

Verifying this change

This change extended t/01_schema.t to add tests for the encoding of different kind of strings (including upgraded ones) as type 'byte' and 'fixed'.

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Change was documented in the change log.

From John Karp's original description of [the issue]: > By default in Perl, a string is a sequence of bytes, values 0-255. > However, if a Unicode character is included that cannot be represented > with a single byte, the string gets 'upgraded' to a non-byte-based > Unicode string allowing ordinals outside that range. When string > operations are done with byte and non-byte Unicode strings, the result > is always non-byte, with the byte string first 'upgraded'. Upgrading > consists of utf8 encoding and setting a utf8 flag on the string. ('utf8' > is a variant of UTF-8 used by Perl) > > The Perl Avro API is accepting these Unicode strings as-is for the > 'bytes' type. This is a problem because > > 1. values >255 are not valid as bytes, and any encoding is their job > > 2. As Avro assembles the serialized data, Perl 'upgrades' all the data, > having the effect of utf8 encoding our serialized binary data. > > The correct behavior is for the Avro Perl API is to attempt to downgrade > the string, and if this fails because it contained values >255 then to > raise an error. (The behavior of 'string' won't change, it will still > take Unicode strings as expected.) This change, based on the one submitted for that ticket, adds these behaviours and tests to exercise them. [the issue]: https://issues.apache.org/jira/browse/AVRO-1517

github-actions bot added the Perl label Jun 26, 2024

martin-g approved these changes Jun 26, 2024

View reviewed changes

martin-g merged commit 677e982 into apache:main Jun 26, 2024
6 checks passed

jjatria deleted the avro-1517-unicode-strings branch June 26, 2024 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-1517: [Perl] Encode UTF-8 strings as bytes #2979

AVRO-1517: [Perl] Encode UTF-8 strings as bytes #2979

jjatria commented Jun 26, 2024

AVRO-1517: [Perl] Encode UTF-8 strings as bytes #2979

AVRO-1517: [Perl] Encode UTF-8 strings as bytes #2979

Conversation

jjatria commented Jun 26, 2024

Verifying this change

Documentation