中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

laubonghaudoi · 2022-09-28T05:28:11Z

輸入

import pycantonese
pycantonese.pos_tag(pycantonese.segment("我今晚會 have dinner at home"))

輸出係

[('我', 'PRON'), ('今晚', 'ADV'), ('會', 'AUX'), ('havedinnerathome', 'VERB')]

可以睇到 havedinnerathome 成個變成咗一個動詞。如果想還原句子就做唔到。可唔可以喺保留英文單詞之間空格嘅前提下分詞？

The text was updated successfully, but these errors were encountered:

jacksonllee · 2022-09-29T02:04:32Z

可唔可以喺保留英文單詞之間空格嘅前提下分詞？

呢個技術上應該可以喺 pycantonese.segment 度做到，不過跟住你想用 pycantonese.pos_tag，就會有另一個問題：

In [1]: import pycantonese

In [2]: pycantonese.pos_tag(['我', '今晚', '會', 'have', 'dinner', 'at', 'home'])
Out[2]: 
[('我', 'PRON'),
 ('今晚', 'ADV'),
 ('會', 'AUX'),
 ('have', 'VERB'),
 ('dinner', 'ADP'),
 ('at', 'ADV'),
 ('home', 'VERB')]

因為 pycantonese 專係處理廣東話，噉啲英文嘅 POS tagging 就會唔啱，呢個喺你嘅情況會唔會係大問題？

如果淨係分詞想保留運用空格而暫時唔理標詞問題住先，我可以睇下點做。

laubonghaudoi · 2022-09-29T06:22:18Z

唔該晒，即係話我喺做 pycantonese.segment() 之前，要自己用空格分一次英文詞，係唔係？主要係我而家寫緊個 https://github.com/CanCLID/typo-corrector ，需要借助詞性嚟修改啲錯別字，然後將成句話拼返起身，所以需要保留啲英文單詞之間嘅空格。不過啲英文單詞嘅詞性就唔需要好準確，漢字詞嘅詞性準確性要求高啲。網上嘅粵文語料成日會有中英夾雜，所以處理起身有啲麻煩。

jacksonllee · 2022-10-01T00:17:45Z

即係話我喺做 pycantonese.segment() 之前，要自己用空格分一次英文詞，係唔係？

如果我冇理解錯嘅話，你嘅意思係咪即係類似呢個做法？

In [1]: import pycantonese

In [2]: import itertools

In [3]: user_input = "我今晚會 have dinner at home"

In [4]: list(itertools.chain.from_iterable(pycantonese.segment(x) for x in user_input.split()))
Out[4]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

laubonghaudoi · 2022-10-01T02:13:47Z

係嘅冇錯，我後尾自己實現咗。只不過想知pycantonese有冇可能順便做到噉？唔使另外再實現。定係話噉樣將啲單詞連起身有其他目的？

jacksonllee · 2022-10-04T05:02:53Z

定係話噉樣將啲單詞連起身有其他目的？

pycantonese.segment drops all whitespace in the user input because I ran into this in my own testing:

In [1]: import pycantonese

In [2]: pycantonese.segment('我 今晚會')  # with an accidental space
Out[2]: ['我', ' ', '今晚', '會']  # note: not the behavior in production now. I saw outputs like this and decided to sanitize the user input by removing all whitespace before applying word segmentataion.

Now that I'm looking into the implementation of pycantonese.segment again, I think I see how I can update it to satisfy what both you and I have brought up (i.e., keeping English words as separated in the output per this GitHub issue, as well as what I've just described in this comment re: not showing superfluous, space-only words in the output). Between this week and the next, I should hopefully be able to update pycantonese and make a new release to resolve this issue. Stay tuned!

laubonghaudoi · 2022-10-05T17:43:38Z

明白嘞，原來係呢個原因。噉至少我哋可以確定，將英文單詞合併成havedinnerathome唔係 intended behavior，而的確係個bug。唔該晒你修好呢個問題，之後個錯別字修正器應該會慳返好多力 :D

jacksonllee · 2022-10-05T21:13:36Z

I've just resolved this issue by updating the upstream main branch. The new main branch behaves as desired:

In [1]: import pycantonese

In [2]: pycantonese.segment("我今晚會 have dinner at home")
Out[2]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

In [3]: pycantonese.segment("我今 晚會 have dinner at home")
Out[3]: ['我', '今晚', '會', 'have', 'dinner', 'at', 'home']

I was thinking of making a new release after resolving this issue, but on second thought, I'm gonna hold it off a bit, because #33 is still up in the air, also because the new Python 3.11 is coming in a month or so and I'd like to wait on its Docker images etc to be available for CI build support.

jacksonllee · 2022-10-05T22:59:53Z

I forgot to mention that you've been acknowledged in the readme. Thanks for reporting this issue!

laubonghaudoi added the bug label Sep 28, 2022

laubonghaudoi mentioned this issue Oct 4, 2022

分詞器速度太慢 #33

Open

jacksonllee closed this as completed in a4ef443 Oct 5, 2022

shivanraptor mentioned this issue Oct 9, 2023

Segmenter removes space of English words in code-mixed sentence #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

laubonghaudoi commented Sep 28, 2022

jacksonllee commented Sep 29, 2022

laubonghaudoi commented Sep 29, 2022

jacksonllee commented Oct 1, 2022

laubonghaudoi commented Oct 1, 2022

jacksonllee commented Oct 4, 2022 •

edited

Loading

laubonghaudoi commented Oct 5, 2022

jacksonllee commented Oct 5, 2022

jacksonllee commented Oct 5, 2022

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

Comments

laubonghaudoi commented Sep 28, 2022

jacksonllee commented Sep 29, 2022

laubonghaudoi commented Sep 29, 2022

jacksonllee commented Oct 1, 2022

laubonghaudoi commented Oct 1, 2022

jacksonllee commented Oct 4, 2022 • edited Loading

laubonghaudoi commented Oct 5, 2022

jacksonllee commented Oct 5, 2022

jacksonllee commented Oct 5, 2022

jacksonllee commented Oct 4, 2022 •

edited

Loading