Use `KoreanAnalyzer` for Korean language (ko) #2174

sudokim · 2023-08-28T05:30:03Z

This PR enables the use of KoreanAnalyzer, an analyzer specialized for Korean.

The previous CJKAnalyzer only splits sequences into bi-grams, while KoreanAnalyzer splits a sentence into morphemes.

LUCENE-8231

lintool · 2023-08-30T02:32:41Z

Hi @sudokim thanks for the PR! Do you have any idea if effectiveness improves as a result of switching the analyzer? E.g., on MIRACL or Mr.Tydi?

sudokim · 2023-09-01T12:35:16Z

Hi @lintool, here is the comparison result between CJKAnalyzer(previous) and KoreanAnalyzer(suggested) on Mr. Tydi v1.1 (ko) dataset.

Analyzer	Indexing time	Index Size	Train Recall@100	Dev Recall@100	Test Recall@100
CJK	3:57.55	1.3G	0.6178	0.6733	0.6188
Korean	11:12.56	880M	0.7162	0.7409	0.6971

It seems that KoreanAnalyzer performs better, although indexing takes much longer.

lintool · 2023-09-01T16:34:25Z

Great! Do you happen to have MRR scores? And also results on MIRACL? (Which will give us nDCG scores.)

sudokim · 2023-09-02T16:16:15Z

Sure! Here are the results:

Mr.Tydi v1.1

	Indexing Time	Index Size
CJK	3:57.55	1.3G
Korean	11:12.56	880M

	Train Recall@100	Dev Recall@100	Test Recall@100	Train MRR	Dev MRR	Test MRR
CJK	0.6178	0.6733	0.6188	0.2596	0.2888	0.2848
Korean	0.7162	0.7409	0.6971	0.3103	0.3281	0.3025

MIRACL

	Indexing Time	Index Size
CJK	3:43.85	1.3G
Korean	11:13.62	876M

	Dev nDCG@10	Dev Recall@100
CJK	0.4190	0.7831
Korean	0.4528	0.8554

lintool · 2023-09-02T17:38:54Z

Awesome, that's great!

We'll get this merged in... but it triggers a long dependency chain... we need to fix the regression... we also need to fix the pre-built indexes for pyserini, etc.

Let me queue this up and figure out the cleanest way to do this.

In the meantime, would you be willing to add a test case that confirms tokenization is done "correctly"?

cc/ @crystina-z @thakur-nandan

Use KoreanAnalyzer for Korean language (ko)

0f19331

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `KoreanAnalyzer` for Korean language (ko) #2174

Use `KoreanAnalyzer` for Korean language (ko) #2174

sudokim commented Aug 28, 2023

lintool commented Aug 30, 2023

sudokim commented Sep 1, 2023

lintool commented Sep 1, 2023

sudokim commented Sep 2, 2023

lintool commented Sep 2, 2023

Use KoreanAnalyzer for Korean language (ko) #2174

Are you sure you want to change the base?

Use KoreanAnalyzer for Korean language (ko) #2174

Conversation

sudokim commented Aug 28, 2023

lintool commented Aug 30, 2023

sudokim commented Sep 1, 2023

lintool commented Sep 1, 2023

sudokim commented Sep 2, 2023

lintool commented Sep 2, 2023

Use `KoreanAnalyzer` for Korean language (ko) #2174

Use `KoreanAnalyzer` for Korean language (ko) #2174