Skip to content

prithvi2226/Bilingual-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bilingual Sentiment Analysis (On Two Regional Languages)

Start here: Analysis_OG.ipynb

💭 Background

This project applies concepts and techniques from Natural language processing and Opinion mining.The goal here is simply to build an artificial intelligience system that differentiates Hindi, Marathi code mixed with an english text on basis of their polarity. (ie positive, negative, neutral).overall.

Sentiment vs. Software

Using natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

🔧 Progress

Mining and Collecting the data

The main goal is to get as much comments as possible for this model. We took these comments from major social media websites like facebook and Youtube related to social and political views from many sources which contributes in giving us data in from of polarities. We collected about 5000 comments.

Tagging the data

Next step was to tag all the data according to their polarity(i.e. Positive, Negative, Nuetral). Tagging scheme was basically according to --

  • Positive Comment : 3
  • Negative Comment : 1
  • Neutral Comment : 2

Data Pre-Processing

As the data is all tagged, before feeding it to the model we pre-process the data.The goal of preprocessing text data is to take the data from its raw, readable form to a format that the computer can more easily work with. Most text data, and the data we will work with in this article, arrive as strings of text. Preprocessing is all the work that takes the raw input data and prepares it for insertion into a model.

While preprocessing for numerical data is dependent largely on the data, preprocessing of text data is actually a fairly straightforward process, although understanding each step and its purpose is less trivial. Our preprocessing method consists of two stages: preparation and vectorization. The preparation stage consists of steps that clean up the data and cut the fat. The steps are 1. removing URLs, 2. making all text lowercase, 3. removing numbers, 4. removing punctuation, 5. tokenization, 6. removing stopwords, and 7. lemmatization. Stopwords are words that typically add no meaning.

Splitting data 75-25(train-test)

train_test_split returns four arrays namely training data, test data, training labels and test labels. By default train_test_split, splits the data into 75% training data and 25% test data which we can think of as a good rule of thumb.

test_size keyword argument specifies what proportion of the original data is used for the test set. Here we have mentioned the test_size=0.3 which means 70% training data and 30% test data.

Hyperparameter Tuning

Hyperparameter Tuning used on various algorithms such as linear Regression , XGBoost used in Analysis_OG.ipynb

Accuracy and other Values

Accuracy, precision, Recall and Fscore for every algorithm used is given in Values Simply got overall accuracy around 70%.

💡Work to be done

  • Contextual understanding and tone
  • sentiment analysis at Brandwatch?
  • The caveats of sentiment analysis
  • Predictions for the future of sentiment analysis

❓ Open questions

📚 Resources

Sentiment Analysis-related publications

Releases

No releases published

Packages

No packages published