Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use KoreanAnalyzer for Korean language (ko) #2174

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sudokim
Copy link

@sudokim sudokim commented Aug 28, 2023

This PR enables the use of KoreanAnalyzer, an analyzer specialized for Korean.

The previous CJKAnalyzer only splits sequences into bi-grams, while KoreanAnalyzer splits a sentence into morphemes.

LUCENE-8231

@lintool
Copy link
Member

lintool commented Aug 30, 2023

Hi @sudokim thanks for the PR! Do you have any idea if effectiveness improves as a result of switching the analyzer? E.g., on MIRACL or Mr.Tydi?

@sudokim
Copy link
Author

sudokim commented Sep 1, 2023

Hi @lintool, here is the comparison result between CJKAnalyzer(previous) and KoreanAnalyzer(suggested) on Mr. Tydi v1.1 (ko) dataset.

Analyzer Indexing time Index Size Train Recall@100 Dev Recall@100 Test Recall@100
CJK 3:57.55 1.3G 0.6178 0.6733 0.6188
Korean 11:12.56 880M 0.7162 0.7409 0.6971

It seems that KoreanAnalyzer performs better, although indexing takes much longer.

@lintool
Copy link
Member

lintool commented Sep 1, 2023

Great! Do you happen to have MRR scores? And also results on MIRACL? (Which will give us nDCG scores.)

@sudokim
Copy link
Author

sudokim commented Sep 2, 2023

Sure! Here are the results:

Mr.Tydi v1.1

Indexing Time Index Size
CJK 3:57.55 1.3G
Korean 11:12.56 880M
Train Recall@100 Dev Recall@100 Test Recall@100 Train MRR Dev MRR Test MRR
CJK 0.6178 0.6733 0.6188 0.2596 0.2888 0.2848
Korean 0.7162 0.7409 0.6971 0.3103 0.3281 0.3025

MIRACL

Indexing Time Index Size
CJK 3:43.85 1.3G
Korean 11:13.62 876M
Dev nDCG@10 Dev Recall@100
CJK 0.4190 0.7831
Korean 0.4528 0.8554

@lintool
Copy link
Member

lintool commented Sep 2, 2023

Awesome, that's great!

We'll get this merged in... but it triggers a long dependency chain... we need to fix the regression... we also need to fix the pre-built indexes for pyserini, etc.

Let me queue this up and figure out the cleanest way to do this.

In the meantime, would you be willing to add a test case that confirms tokenization is done "correctly"?

cc/ @crystina-z @thakur-nandan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants