Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-1517: [Perl] Encode UTF-8 strings as bytes #2979

Merged
merged 1 commit into from
Jun 26, 2024

Commits on Jun 26, 2024

  1. AVRO-1517: [Perl] Encode UTF-8 strings as bytes

    From John Karp's original description of [the issue]:
    
    > By default in Perl, a string is a sequence of bytes, values 0-255.
    > However, if a Unicode character is included that cannot be represented
    > with a single byte, the string gets 'upgraded' to a non-byte-based
    > Unicode string allowing ordinals outside that range. When string
    > operations are done with byte and non-byte Unicode strings, the result
    > is always non-byte, with the byte string first 'upgraded'. Upgrading
    > consists of utf8 encoding and setting a utf8 flag on the string. ('utf8'
    > is a variant of UTF-8 used by Perl)
    >
    > The Perl Avro API is accepting these Unicode strings as-is for the
    > 'bytes' type. This is a problem because
    >
    >   1. values >255 are not valid as bytes, and any encoding is their job
    >
    >   2. As Avro assembles the serialized data, Perl 'upgrades' all the data,
    >      having the effect of utf8 encoding our serialized binary data.
    >
    > The correct behavior is for the Avro Perl API is to attempt to downgrade
    > the string, and if this fails because it contained values >255 then to
    > raise an error. (The behavior of 'string' won't change, it will still
    > take Unicode strings as expected.)
    
    This change, based on the one submitted for that ticket, adds these
    behaviours and tests to exercise them.
    
    [the issue]: https://issues.apache.org/jira/browse/AVRO-1517
    jjatria committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    c24a0be View commit details
    Browse the repository at this point in the history