-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
分詞器速度太慢 #33
Comments
I'd recommend discussing what you guys have in mind before opening a PR. I've had a strong preference for (i) using only datasets and models that have no legal issues, especially for commercial usage, and (ii) keeping this package simple, so no heavy dependencies, etc. These constraints are the major reason why pycantonese has used only the HKCanCor and rime-cantonese datasets so far. |
75% of CPU time of our use case in typo-corrector falls into tagger.py. Currently the process throughput is about 72kB/sec in my test environment. I think a modern computer can do much better than this. |
So you guys are talking about
That's right. The open-source implementation of the averaged perception tagger that I got uses dicts as the major data structure for matrix multiplication, which is okay for a pure Python implementation as we're dealing with sparse matrices. I'd be curious to see if numpy arrays may speed things up. |
唔好意思,我一直都係講 係唔係可以將個分詞器改成用 NLTK? |
I've just made a new release of
When I got the averaged perceptron code from the textblob codebase, it had already been marked as archived as you've described. What's the difference between the averaged perceptron tagger code in NLTK and the one in pycantonese? Last time I checked, the copy in NLTK hadn't really been worked on, and was essentially a straight up copy-and-paste from the textblob codebase like the pycantonese copy. If possible, I'd avoid having to include NLTK as a dependency if it didn't add anything new. |
I should have replied earlier – this does not fix the problem at all. |
Of course, I’d be happier with a ML-based segmentation if possible though. |
I tried a draft implementation using numpy (passing tests) and it turned out to have 3x speed up. The new bottleneck becomes _get_features which takes 50% of the time in the tag() function of tagger.py. |
Thanks for the note, @graphemecluster! I've just made a new release of Here's my timeit results with the new In [8]: %timeit ToJyutping.get_jyutping_list("大家都好鍾意做廣東話嘢。")
19.2 µs ± 39.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [9]: %timeit pycantonese.segment("大家都好鍾意做廣東話嘢。")
9.79 µs ± 91.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) Re: trie, I also have my own implementation in nskipgrams. I definitely agree the trie data structure is better for all the good reasons. Maybe I'll get to propagating it to my codebases one of these days. |
@ZhanruiLiang Thank you for looking into possibly using numpy to speed up POS tagging. Re: #34, Thank you everyone for combing through my codebases, really appreciate it! |
Both #32 and #34 have been resolved. @ZhanruiLiang The upstream |
I'm working on a PR but need time to get the tests passing. |
目前個
.segment()
效率有啲低,好似唔係最優算法。@graphemecluster @ZhanruiLiang 之後可能會開個 PR 睇下點優化。另外順便解決埋 #32 嘅分詞問題。The text was updated successfully, but these errors were encountered: