分詞器速度太慢 #33

laubonghaudoi · 2022-10-04T03:29:40Z

目前個 .segment()效率有啲低，好似唔係最優算法。@graphemecluster @ZhanruiLiang 之後可能會開個 PR 睇下點優化。另外順便解決埋 #32 嘅分詞問題。

The text was updated successfully, but these errors were encountered:

jacksonllee · 2022-10-04T04:51:32Z

之後可能會開個 PR 睇下點優化。

I'd recommend discussing what you guys have in mind before opening a PR. I've had a strong preference for (i) using only datasets and models that have no legal issues, especially for commercial usage, and (ii) keeping this package simple, so no heavy dependencies, etc. These constraints are the major reason why pycantonese has used only the HKCanCor and rime-cantonese datasets so far.

ZhanruiLiang · 2022-10-04T10:47:44Z

75% of CPU time of our use case in typo-corrector falls into tagger.py. Currently the process throughput is about 72kB/sec in my test environment. I think a modern computer can do much better than this.
It seems like _AveragedPerceptron.predict() is essentially doing matrix multiplication in Python. Is numpy an acceptable new depenency in this project?

jacksonllee · 2022-10-04T12:56:37Z

So you guys are talking about pycantonese.pos_tag but not pycantonese.segment? I got confused by how this issue was first raised.

It seems like _AveragedPerceptron.predict() is essentially doing matrix multiplication in Python. Is numpy an acceptable new depenency in this project?

That's right. The open-source implementation of the averaged perception tagger that I got uses dicts as the major data structure for matrix multiplication, which is okay for a pure Python implementation as we're dealing with sparse matrices. I'd be curious to see if numpy arrays may speed things up.

laubonghaudoi · 2022-10-04T16:38:02Z

唔好意思，我一直都係講 .segment()唔係.pos_tag(). 另外我睇咗下個模型，好似係個已經 archive 咗嘅項目，入面話直接調用 NLTK 就得 https://github.com/sloria/textblob-aptagger

係唔係可以將個分詞器改成用 NLTK？

jacksonllee · 2022-10-04T17:17:52Z

唔好意思，我一直都係講 .segment()唔係.pos_tag()

I've just made a new release of wordseg to speed up word segmentation, thanks to @ZhanruiLiang's tip. pycantonese.segment should now be a couple times faster than before. If you pip install --upgrade wordseg in your Python environment where pycantonese is installed, you should be able to use this updated word segmentation code.

另外我睇咗下個模型，好似係個已經 archive 咗嘅項目，入面話直接調用 NLTK 就得 https://github.com/sloria/textblob-aptagger
係唔係可以將個分詞器改成用 NLTK？

When I got the averaged perceptron code from the textblob codebase, it had already been marked as archived as you've described. What's the difference between the averaged perceptron tagger code in NLTK and the one in pycantonese? Last time I checked, the copy in NLTK hadn't really been worked on, and was essentially a straight up copy-and-paste from the textblob codebase like the pycantonese copy. If possible, I'd avoid having to include NLTK as a dependency if it didn't add anything new.

graphemecluster · 2022-10-04T17:44:44Z

I should have replied earlier – this does not fix the problem at all.
What I observed from your code is that the time complexity of your algorithm is O(n²). To reduce it to O(n) you should rewrite the code using Trie. It’s simple and is used exclusively in CanCLID’s ToJyutping, which also performs word segmentation with rime-cantonese by longest prefix match.
A simple JS implementation would be: https://github.com/CanCLID/to-jyutping/blob/main/src/Trie.ts or https://github.com/CanCLID/inject-jyutping/blob/main/lib/Trie.js
For the Python version of ToJyutping we simply use pygtrie.

graphemecluster · 2022-10-04T17:47:39Z

Here is the performance test I did yesterday:

graphemecluster · 2022-10-04T17:51:08Z

Of course, I’d be happier with a ML-based segmentation if possible though.

ZhanruiLiang · 2022-10-04T18:04:30Z

That's right. The open-source implementation of the averaged perception tagger that I got uses dicts as the major data structure for matrix multiplication, which is okay for a pure Python implementation as we're dealing with sparse matrices. I'd be curious to see if numpy arrays may speed things up.

I tried a draft implementation using numpy (passing tests) and it turned out to have 3x speed up. The new bottleneck becomes _get_features which takes 50% of the time in the tag() function of tagger.py.
However due to issue #34 my workspace is not in a clean state so I can't upload it right now.

jacksonllee · 2022-10-04T18:16:40Z

Thanks for the note, @graphemecluster! I've just made a new release of wordseg (my own word segmentation package that pycantonese uses) to speed up word segmentation, thanks to @ZhanruiLiang's tip. pycantonese.segment should now be a couple times faster than before. If you pip install --upgrade wordseg in your Python environment where pycantonese is installed, you should be able to use this updated word segmentation code.

Here's my timeit results with the new wordseg code:

In [8]: %timeit ToJyutping.get_jyutping_list("大家都好鍾意做廣東話嘢。")
19.2 µs ± 39.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit pycantonese.segment("大家都好鍾意做廣東話嘢。")
9.79 µs ± 91.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Re: trie, I also have my own implementation in nskipgrams. I definitely agree the trie data structure is better for all the good reasons. Maybe I'll get to propagating it to my codebases one of these days.

jacksonllee · 2022-10-04T18:18:29Z

@ZhanruiLiang Thank you for looking into possibly using numpy to speed up POS tagging. Re: #34, ~~I'll take a look and keep you posted~~ EDIT: #34 has been resolved.

Thank you everyone for combing through my codebases, really appreciate it!

jacksonllee · 2022-10-05T21:26:11Z

Both #32 and #34 have been resolved. @ZhanruiLiang The upstream main branch is now ready if you're still up for a pull request with your improved POS tagger code.

ZhanruiLiang · 2022-10-06T00:00:07Z

I'm working on a PR but need time to get the tests passing.

laubonghaudoi added the bug label Oct 4, 2022

jacksonllee added performance and removed bug labels Oct 5, 2022

jacksonllee mentioned this issue Oct 5, 2022

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊 #32

Closed

ZhanruiLiang mentioned this issue Oct 17, 2022

Optimize tagger logic using numpy #35

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分詞器速度太慢 #33

分詞器速度太慢 #33

laubonghaudoi commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

ZhanruiLiang commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

laubonghaudoi commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

graphemecluster commented Oct 4, 2022

graphemecluster commented Oct 4, 2022

graphemecluster commented Oct 4, 2022

ZhanruiLiang commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

jacksonllee commented Oct 4, 2022 •

edited

Loading

jacksonllee commented Oct 5, 2022

ZhanruiLiang commented Oct 6, 2022

分詞器速度太慢 #33

分詞器速度太慢 #33

Comments

laubonghaudoi commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

ZhanruiLiang commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

laubonghaudoi commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

graphemecluster commented Oct 4, 2022

graphemecluster commented Oct 4, 2022

graphemecluster commented Oct 4, 2022

ZhanruiLiang commented Oct 4, 2022

jacksonllee commented Oct 4, 2022

jacksonllee commented Oct 4, 2022 • edited Loading

jacksonllee commented Oct 5, 2022

ZhanruiLiang commented Oct 6, 2022

jacksonllee commented Oct 4, 2022 •

edited

Loading