Skip to content

🌼 Korean SentenceBERT : Sentence Embeddings using Siamese BERT-Networks using SKT KoBERT and kakaobrain KorNLU dataset

Notifications You must be signed in to change notification settings

kimjongwoo-cell/KoSentenceBERT_SKT

Β 
Β 

Repository files navigation

Ko-Sentence-BERT-SKTBERT

Installation

  • huggingface transformer, sentence transformers, tokenizers 라이브러리 μ½”λ“œλ₯Ό 직접 μˆ˜μ •ν•˜λ―€λ‘œ κ°€μƒν™˜κ²½ μ‚¬μš©μ„ ꢌμž₯ν•©λ‹ˆλ‹€.
  • μ‚¬μš©ν•œ Docker imageλŠ” Docker Hub에 μ²¨λΆ€ν•©λ‹ˆλ‹€.
git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt
pip install .
cd ..
git clone https://github.com/BM-K/KoSentenceBERT_SKTBERT.git
pip install -r requirements.txt
  • transformer, tokenizers, sentence_transformers 디렉토리λ₯Ό opt/conda/lib/python3.7/site-packages/ 둜 μ΄λ™ν•©λ‹ˆλ‹€.

Train Models

  • λͺ¨λΈ ν•™μŠ΅μ„ μ›ν•˜μ‹œλ©΄ KoSentenceBERT 디렉토리 μ•ˆμ— KorNLUDatasets이 μ‘΄μž¬ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€.
  • STSλ₯Ό ν•™μŠ΅ μ‹œ λͺ¨λΈ ꡬ쑰에 맞게 데이터λ₯Ό μˆ˜μ •ν•˜μ˜€μœΌλ©°, 데이터와 ν•™μŠ΅ 방법은 μ•„λž˜μ™€ κ°™μŠ΅λ‹ˆλ‹€ :

    KoSentenceBERT/KorNLUDatates/KorSTS/tune_test.tsv

    STS test λ°μ΄ν„°μ…‹μ˜ 일뢀
python training_nli.py      # NLI λ°μ΄ν„°λ‘œλ§Œ ν•™μŠ΅
python training_sts.py      # STS λ°μ΄ν„°λ‘œλ§Œ ν•™μŠ΅
python con_training_sts.py  # NLI λ°μ΄ν„°λ‘œ ν•™μŠ΅ ν›„ STS λ°μ΄ν„°λ‘œ Fine-Tuning

Pre-Trained Models

pooling modeλŠ” MEAN-strategyλ₯Ό μ‚¬μš©ν•˜μ˜€μœΌλ©°, ν•™μŠ΅μ‹œ λͺ¨λΈμ€ output 디렉토리에 μ €μž₯ λ©λ‹ˆλ‹€.

디렉토리 ν•™μŠ΅λ°©λ²•
training_nli Only Train NLI
training_sts Only Train STS
training_nli_sts STS + NLI

ν•™μŠ΅λœ pt νŒŒμΌμ€ λ‹€μŒ λ“œλΌμ΄λΈŒμ— μžˆμŠ΅λ‹ˆλ‹€.
https://drive.google.com/drive/folders/1fLYRi7W6J3rxt-KdGALBXMUS2W4Re7II?usp=sharing

각 폴더에 μžˆλŠ” resultνŒŒμΌμ„ output 디렉토리에 λ„£μœΌμ‹œλ©΄ λ©λ‹ˆλ‹€.
ex) sts ν•™μŠ΅ 파일 μ‚¬μš©μ‹œ μœ„ λ“œλΌμ΄λΈŒμ—μ„œ sts/result.pt νŒŒμΌμ„ output/training_sts/0_Transformer에 λ„£μœΌμ‹œλ©΄ λ©λ‹ˆλ‹€.
output/training_sts/0_Transformer/result.pt

Performance

Seed κ³ μ •, test set

Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
NLl 65.05 68.48 68.81 68.18 68.90 68.20 65.22 66.81
STS 80.42 79.64 77.93 77.43 77.92 77.44 76.56 75.83
STS + NLI 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22

Application Examples

  • 생성 된 λ¬Έμž₯ μž„λ² λ”©μ„ λ‹€μš΄ 슀트림 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ— μ‚¬μš©ν•  수 μžˆλŠ” 방법에 λŒ€ν•œ λͺ‡ 가지 예λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€.
  • STS pretrained λͺ¨λΈμ„ 톡해 μ§„ν–‰ν•©λ‹ˆλ‹€.

Semantic Search

SemanticSearch.pyλŠ” 주어진 λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ λ¬Έμž₯을 μ°ΎλŠ” μž‘μ—…μž…λ‹ˆλ‹€.
λ¨Όμ € Corpus의 λͺ¨λ“  λ¬Έμž₯에 λŒ€ν•œ μž„λ² λ”©μ„ μƒμ„±ν•©λ‹ˆλ‹€.

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = './output/training_sts'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
           '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
           'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
        


κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€ :

======================


Query: ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.

Top 5 most similar sentences in corpus:
ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€. (Score: 0.6800)
ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€. (Score: 0.6735)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.1256)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.1077)
ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€. (Score: 0.0968)


======================


Query: 고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.

Top 5 most similar sentences in corpus:
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.6832)
ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€. (Score: 0.2885)
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.2278)
κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€. (Score: 0.2018)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.1397)


======================


Query: μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.

Top 5 most similar sentences in corpus:
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.8141)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.3707)
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.1842)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.1716)
ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€. (Score: 0.1519)

Clustering

Clustering.pyλŠ” λ¬Έμž₯ μž„λ² λ”© μœ μ‚¬μ„±μ„ 기반으둜 μœ μ‚¬ν•œ λ¬Έμž₯을 ν΄λŸ¬μŠ€ν„°λ§ν•˜λŠ” 예λ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€.
이전과 λ§ˆμ°¬κ°€μ§€λ‘œ λ¨Όμ € 각 λ¬Έμž₯에 λŒ€ν•œ μž„λ² λ”©μ„ κ³„μ‚°ν•©λ‹ˆλ‹€.

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = './output/training_sts'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
          '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
          'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")

κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€ :

Cluster  1
['κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.', 'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.', '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.']

Cluster  2
['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.']

Cluster  3
['μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.', 'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

Cluster  4
['ν•œ λ‚¨μžκ°€ 말을 탄닀.', '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€.', 'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.']

Cluster  5
['ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.']

Citing

KorNLU Datasets

@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}

@article{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    journal= "arXiv preprint arXiv:2004.09813",
    month = "04",
    year = "2020",
    url = "http://arxiv.org/abs/2004.09813",
}

About

🌼 Korean SentenceBERT : Sentence Embeddings using Siamese BERT-Networks using SKT KoBERT and kakaobrain KorNLU dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%