Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese characters changed on paste in Editor #722

Open
takarabune opened this issue Nov 25, 2023 · 0 comments
Open

Japanese characters changed on paste in Editor #722

takarabune opened this issue Nov 25, 2023 · 0 comments

Comments

@takarabune
Copy link

System Information

  • Pythonista N/A (N/A), Default interpreter 3.10.4
  • iOS 17.1, model iPad11,1, resolution (portrait) 1536.0 x 2048.0 @ 2.0

Pythonista 3.4

Problem
Certain Japanese characters are changed upon pasting them into the IDE
Amongst the characters are
(〜で)で
だぢづでど ぱぴぷぺぽ がぎぐげご バビブべボ


⦅-゚

㈠-㉃㊀-㋾㌀-㍿

when the above is pasted to the IDE it changes to
(〜で)で
だぢづでど ぱぴぷぺぽ がぎぐげご バビブべボ
より
コト
⦅-゚
~
(一)-(至)一-ヲアパート-株式会社

many characters are decomposed into separate characters or similar (but importantly different) characters are substituted. While some characters may look identical they have different codepoints. Note how the brackets have changed from full width. Also note the change in the tilde like character 〜 and the changes in the symbols on the last line.

If the correct characters are in the editor they can be copied from there and pasted correctly to another text editor.

demonstration code

import unicodedata

typed_text = ("で")
pasted_text = ("で") # manual copy and paste from previous line

print (typed_text, "==", pasted_text, ":", typed_text == pasted_text)

print("typed text: ")
for c in typed_text: print(ord(c), unicodedata.name(c))

print("pasted text: ")
for c in pasted_text: print(ord(c), unicodedata.name(c))

NB: do not copy and paste this into Pythonista IDE.
Use a text editor then run the file in Pythonista.

it outputs the following in the console

で == で : False
typed text: 
12391 HIRAGANA LETTER DE
pasted text: 
12390 HIRAGANA LETTER TE
12441 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK

Expected behavior
I expect the characters copied to be the same characters after pasting them.

Comment
I noticed this when some regex patterns I pasted in weren’t correct.
When tokenizing some pasted text it showed how some characters had been decomposed.
When searching for a term in a sqlite database a pasted in query didn’t return an expected match because characters had been changed on pasting.

It can be worked around by not pasting I suppose. Or by using a 3rd party IDE or text editor.

When inputting data in a finished script it most likely isn’t an issue as data will be read from an external file.

It points to unicode not being handled properly on paste.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant