Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分詞器速度太慢 #33

Open
laubonghaudoi opened this issue Oct 4, 2022 · 13 comments
Open

分詞器速度太慢 #33

laubonghaudoi opened this issue Oct 4, 2022 · 13 comments

Comments

@laubonghaudoi
Copy link

目前個 .segment()效率有啲低,好似唔係最優算法。@graphemecluster @ZhanruiLiang 之後可能會開個 PR 睇下點優化。另外順便解決埋 #32 嘅分詞問題。

@jacksonllee
Copy link
Owner

之後可能會開個 PR 睇下點優化。

I'd recommend discussing what you guys have in mind before opening a PR. I've had a strong preference for (i) using only datasets and models that have no legal issues, especially for commercial usage, and (ii) keeping this package simple, so no heavy dependencies, etc. These constraints are the major reason why pycantonese has used only the HKCanCor and rime-cantonese datasets so far.

@ZhanruiLiang
Copy link

75% of CPU time of our use case in typo-corrector falls into tagger.py. Currently the process throughput is about 72kB/sec in my test environment. I think a modern computer can do much better than this.
It seems like _AveragedPerceptron.predict() is essentially doing matrix multiplication in Python. Is numpy an acceptable new depenency in this project?

@jacksonllee
Copy link
Owner

So you guys are talking about pycantonese.pos_tag but not pycantonese.segment? I got confused by how this issue was first raised.

It seems like _AveragedPerceptron.predict() is essentially doing matrix multiplication in Python. Is numpy an acceptable new depenency in this project?

That's right. The open-source implementation of the averaged perception tagger that I got uses dicts as the major data structure for matrix multiplication, which is okay for a pure Python implementation as we're dealing with sparse matrices. I'd be curious to see if numpy arrays may speed things up.

@laubonghaudoi
Copy link
Author

唔好意思,我一直都係講 .segment()唔係.pos_tag(). 另外我睇咗下個模型,好似係個已經 archive 咗嘅項目,入面話直接調用 NLTK 就得 https://github.com/sloria/textblob-aptagger

係唔係可以將個分詞器改成用 NLTK?

@jacksonllee
Copy link
Owner

唔好意思,我一直都係講 .segment()唔係.pos_tag()

I've just made a new release of wordseg to speed up word segmentation, thanks to @ZhanruiLiang's tip. pycantonese.segment should now be a couple times faster than before. If you pip install --upgrade wordseg in your Python environment where pycantonese is installed, you should be able to use this updated word segmentation code.

另外我睇咗下個模型,好似係個已經 archive 咗嘅項目,入面話直接調用 NLTK 就得 https://github.com/sloria/textblob-aptagger
係唔係可以將個分詞器改成用 NLTK?

When I got the averaged perceptron code from the textblob codebase, it had already been marked as archived as you've described. What's the difference between the averaged perceptron tagger code in NLTK and the one in pycantonese? Last time I checked, the copy in NLTK hadn't really been worked on, and was essentially a straight up copy-and-paste from the textblob codebase like the pycantonese copy. If possible, I'd avoid having to include NLTK as a dependency if it didn't add anything new.

@graphemecluster
Copy link

I should have replied earlier – this does not fix the problem at all.
What I observed from your code is that the time complexity of your algorithm is O(n²). To reduce it to O(n) you should rewrite the code using Trie. It’s simple and is used exclusively in CanCLID’s ToJyutping, which also performs word segmentation with rime-cantonese by longest prefix match.
A simple JS implementation would be: https://github.com/CanCLID/to-jyutping/blob/main/src/Trie.ts or https://github.com/CanCLID/inject-jyutping/blob/main/lib/Trie.js
For the Python version of ToJyutping we simply use pygtrie.

@graphemecluster
Copy link

Here is the performance test I did yesterday:
Performance Test

@graphemecluster
Copy link

Of course, I’d be happier with a ML-based segmentation if possible though.

@ZhanruiLiang
Copy link

That's right. The open-source implementation of the averaged perception tagger that I got uses dicts as the major data structure for matrix multiplication, which is okay for a pure Python implementation as we're dealing with sparse matrices. I'd be curious to see if numpy arrays may speed things up.

I tried a draft implementation using numpy (passing tests) and it turned out to have 3x speed up. The new bottleneck becomes _get_features which takes 50% of the time in the tag() function of tagger.py.
However due to issue #34 my workspace is not in a clean state so I can't upload it right now.

@jacksonllee
Copy link
Owner

Thanks for the note, @graphemecluster! I've just made a new release of wordseg (my own word segmentation package that pycantonese uses) to speed up word segmentation, thanks to @ZhanruiLiang's tip. pycantonese.segment should now be a couple times faster than before. If you pip install --upgrade wordseg in your Python environment where pycantonese is installed, you should be able to use this updated word segmentation code.

Here's my timeit results with the new wordseg code:

In [8]: %timeit ToJyutping.get_jyutping_list("大家都好鍾意做廣東話嘢。")
19.2 µs ± 39.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit pycantonese.segment("大家都好鍾意做廣東話嘢。")
9.79 µs ± 91.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Re: trie, I also have my own implementation in nskipgrams. I definitely agree the trie data structure is better for all the good reasons. Maybe I'll get to propagating it to my codebases one of these days.

@jacksonllee
Copy link
Owner

jacksonllee commented Oct 4, 2022

@ZhanruiLiang Thank you for looking into possibly using numpy to speed up POS tagging. Re: #34, I'll take a look and keep you posted EDIT: #34 has been resolved.

Thank you everyone for combing through my codebases, really appreciate it!

@jacksonllee
Copy link
Owner

Both #32 and #34 have been resolved. @ZhanruiLiang The upstream main branch is now ready if you're still up for a pull request with your improved POS tagger code.

@ZhanruiLiang
Copy link

I'm working on a PR but need time to get the tests passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants