Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Request: Official Runes Library for Unicode Substring, Length, and String Manipulation #502

Open
Libresse opened this issue Aug 31, 2023 · 6 comments

Comments

@Libresse
Copy link

Hi there,

I'm addressing an issue we’ve stumbled upon while using Google's Go Starlark, related to how string processing is handled. It happens due to the difference in treating strings between Starlark and Python.

In Python, all strings are Unicode, and operations like slicing or indexing take Unicode code points into account, rather than byte indices. On the flip side, Starlark treats all strings as ASCII, which can cause unexpected results when handling non-ASCII characters, especially those from non-Latin alphabets.

For instance, consider the Chinese string "你老公技术不错". In Python, a slice operation like s[:3] would return the first three characters '你老公'. However, in Starlark, this operation would yield "你" instead. More worryingly, for cases like s[:2] in Starlark, the string slicing completely breaks and returns an unexpected byte combination "\xe4\xbd".

These scenarios point towards a significant limitation when it comes to handling Unicode in Starlark, which could affect a wide range of applications and users worldwide, thereby dampening the reach and potential of the language.

To address these issues, it would be worthwhile to consider introducing an official runes library (or similar feature) for Unicode substrings, string lengths, and broader string manipulation capabilities that accommodate non-ASCII character sets properly.

Such an enhancement would greatly improve Starlark’s accessibility and usability for global, multilingual users, and serve to reduce unexpected errors and inconsistencies in string processing in various languages.

Your attention to this matter and your help on improving Starlark would be highly appreciated.

BR,

@Libresse
Copy link
Author

#482

@adonovan
Copy link
Collaborator

Starlark treats all strings as ASCII

Not quite: Starlark strings are sequences of UTF-k codes (where k=8 in the Go implementation). In this respect Starlark-go behaves like Go, and C and C++ (and Rust, except that Rust disallows splitting an single rune's encoding as in your s[:2] example).

What specific operations do you need? I would expect it is possible to implement many of them in pure Starlark using the various iterator methods on string (e.g. codepoints and codepoint_ords).

@Libresse
Copy link
Author

Starlark treats all strings as ASCII

Not quite: Starlark strings are sequences of UTF-k codes (where k=8 in the Go implementation). In this respect Starlark-go behaves like Go, and C and C++ (and Rust, except that Rust disallows splitting an single rune's encoding as in your s[:2] example).

What specific operations do you need? I would expect it is possible to implement many of them in pure Starlark using the various iterator methods on string (e.g. codepoints and codepoint_ords).

Thank you for your clarification regarding Starlark strings. We apologize for the misconception in our initial statement.

In terms of specific operations needed, our platform's editors utilize Starlark scripts to edit content in batch. String manipulation operations such as substring extraction, replacement, and indexing are crucial for their tasks. However, dealing with byte indexes presents challenges and is not straightforward in Starlark.

While we appreciate the availability of iterator methods like codepoints and codepoint_ords, we believe that introducing an official runes library or a similar feature for Unicode substrings, string lengths, and broader string manipulation would greatly benefit global, multilingual users. This enhancement would address the limitations in handling non-ASCII characters and improve the accessibility and usability of Starlark.

@adonovan
Copy link
Collaborator

I'm not averse to the idea of a package for operations on UTF-k strings, but what operations do you need that cannot be expressed (or expressed efficiently) today in terms of codepoints?

@Libresse
Copy link
Author

Libresse commented Oct 5, 2023

I'm not averse to the idea of a package for operations on UTF-k strings, but what operations do you need that cannot be expressed (or expressed efficiently) today in terms of codepoints?

Ops like s[:2] to get first 2 chars. For 3-bytes Unicode, s[:6] should return first 2 chars. But if the content just normal ASCII, it should use s[:2] or will get more than 2 chars.

@adonovan
Copy link
Collaborator

adonovan commented Oct 5, 2023

Ops like s[:2] to get first 2 chars. For 3-bytes Unicode, s[:6] should return first 2 chars. But if the content just normal ASCII, it should use s[:2] or will get more than 2 chars.

Currently you have to express that as:

def codepoints(s):
    return list(s.codepoints())

codepoints("<世界>")[1:3] # ["世", "界"]

but we could specify that the value returned by s.codepoints() is indexable, so that s.codepoints()[1:3] would do what you want. Is there anything else?

(BTW, I suggest you use the term "code point" not "char", since code point is defined by Unicode, and "char" seems to mean whatever the speaker wants it to mean.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants