Tweet-Sentiment-Extraction

This repository is my semester project in course Intelligent Data Analysis & Machine Learning II in winter semester 2022/23 at the University of Potsdam. The project provides a solution to the Kaggle Tweet Sentiment Extraction Competition.

Problem setting & dataset

Problem setting

For a given tweet, predict what word or phrase best supports the sentiment labels (positive, negative, neutral). The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.).

Dataset

The dataset of the competition can be downloaded from the Kaggle website.

Needed files:
train.csv - the training set
test.csv - the test set
sample_submission.csv - a sample submission file in the correct format

Columns:
textID - unique ID for each piece of text
text - the text of the tweet
sentiment - the general sentiment of the tweet
selected_text - [train only] the text that supports the tweet's sentiment

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

Approaches

Modelling as a Question & Answering (Q&A) problem

Given a question and a context, train a transformer model that is able to find the answer in the context.

We have:
Question: sentiment column (positive/negative/neutral)
Context: text column
Answer: selected_text column

Models

deepset/roberta-base-squad2 (roberta-base model, fine-tuned using the SQuAD2.0 dataset)
distilbert-base-cased-distilled-squad (DistilBERT-base-cased, fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1)

Evaluation metric

The metric in this competition is the word-level Jaccard score. The Jaccard score is defined as the size of the intersection divided by the size of the union of the sample sets:

$$Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}$$

Results

Fine-tuned roberta-base-squad2: 70,2 Jaccard score
Fine-tuned distilbert-base-cased-distilled-squad: 68,6 Jaccard score

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset_qa_format		dataset_qa_format
tweet-sentiment-extraction		tweet-sentiment-extraction
README.md		README.md
helper.py		helper.py
tweet-eda.ipynb		tweet-eda.ipynb
tweet_question_answering.ipynb		tweet_question_answering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet-Sentiment-Extraction

Problem setting & dataset

Problem setting

Dataset

Approaches

Modelling as a Question & Answering (Q&A) problem

Models

Evaluation metric

Results

About

Releases

Packages

Languages

iuliia-mozhina/Tweet-Sentiment-Extraction

Folders and files

Latest commit

History

Repository files navigation

Tweet-Sentiment-Extraction

Problem setting & dataset

Problem setting

Dataset

Approaches

Modelling as a Question & Answering (Q&A) problem

Models

Evaluation metric

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages