Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

Closed
laubonghaudoi opened this issue Sep 28, 2022 · 8 comments
Closed
Labels

Comments

@laubonghaudoi
Copy link

輸入

import pycantonese
pycantonese.pos_tag(pycantonese.segment("我今晚會 have dinner at home"))

輸出係

[('我', 'PRON'), ('今晚', 'ADV'), ('會', 'AUX'), ('havedinnerathome', 'VERB')]

可以睇到 havedinnerathome 成個變成咗一個動詞。如果想還原句子就做唔到。可唔可以喺保留英文單詞之間空格嘅前提下分詞?

@jacksonllee
Copy link
Owner

可唔可以喺保留英文單詞之間空格嘅前提下分詞?

呢個技術上應該可以喺 pycantonese.segment 度做到,不過跟住你想用 pycantonese.pos_tag,就會有另一個問題:

In [1]: import pycantonese

In [2]: pycantonese.pos_tag(['我', '今晚', '會', 'have', 'dinner', 'at', 'home'])
Out[2]: 
[('我', 'PRON'),
 ('今晚', 'ADV'),
 ('會', 'AUX'),
 ('have', 'VERB'),
 ('dinner', 'ADP'),
 ('at', 'ADV'),
 ('home', 'VERB')]

因為 pycantonese 專係處理廣東話,噉啲英文嘅 POS tagging 就會唔啱,呢個喺你嘅情況會唔會係大問題?

如果淨係分詞想保留運用空格而暫時唔理標詞問題住先,我可以睇下點做。

@laubonghaudoi
Copy link
Author

唔該晒,即係話我喺做 pycantonese.segment() 之前,要自己用空格分一次英文詞,係唔係?主要係我而家寫緊個 https://github.com/CanCLID/typo-corrector ,需要借助詞性嚟修改啲錯別字,然後將成句話拼返起身,所以需要保留啲英文單詞之間嘅空格。不過啲英文單詞嘅詞性就唔需要好準確,漢字詞嘅詞性準確性要求高啲。網上嘅粵文語料成日會有中英夾雜,所以處理起身有啲麻煩。

@jacksonllee
Copy link
Owner

即係話我喺做 pycantonese.segment() 之前,要自己用空格分一次英文詞,係唔係?

如果我冇理解錯嘅話,你嘅意思係咪即係類似呢個做法?

In [1]: import pycantonese

In [2]: import itertools

In [3]: user_input = "我今晚會 have dinner at home"

In [4]: list(itertools.chain.from_iterable(pycantonese.segment(x) for x in user_input.split()))
Out[4]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

@laubonghaudoi
Copy link
Author

係嘅冇錯,我後尾自己實現咗。只不過想知pycantonese有冇可能順便做到噉?唔使另外再實現。定係話噉樣將啲單詞連起身有其他目的?

@jacksonllee
Copy link
Owner

jacksonllee commented Oct 4, 2022

定係話噉樣將啲單詞連起身有其他目的?

pycantonese.segment drops all whitespace in the user input because I ran into this in my own testing:

In [1]: import pycantonese

In [2]: pycantonese.segment('我 今晚會')  # with an accidental space
Out[2]: ['我', ' ', '今晚', '會']  # note: not the behavior in production now. I saw outputs like this and decided to sanitize the user input by removing all whitespace before applying word segmentataion.

Now that I'm looking into the implementation of pycantonese.segment again, I think I see how I can update it to satisfy what both you and I have brought up (i.e., keeping English words as separated in the output per this GitHub issue, as well as what I've just described in this comment re: not showing superfluous, space-only words in the output). Between this week and the next, I should hopefully be able to update pycantonese and make a new release to resolve this issue. Stay tuned!

@laubonghaudoi
Copy link
Author

明白嘞,原來係呢個原因。噉至少我哋可以確定,將英文單詞合併成havedinnerathome唔係 intended behavior,而的確係個bug。唔該晒你修好呢個問題,之後個錯別字修正器應該會慳返好多力 :D

@jacksonllee
Copy link
Owner

I've just resolved this issue by updating the upstream main branch. The new main branch behaves as desired:

In [1]: import pycantonese

In [2]: pycantonese.segment("我今晚會 have dinner at home")
Out[2]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

In [3]: pycantonese.segment("我今 晚會 have dinner at home")
Out[3]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

I was thinking of making a new release after resolving this issue, but on second thought, I'm gonna hold it off a bit, because #33 is still up in the air, also because the new Python 3.11 is coming in a month or so and I'd like to wait on its Docker images etc to be available for CI build support.

@jacksonllee
Copy link
Owner

I forgot to mention that you've been acknowledged in the readme. Thanks for reporting this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants