Skip to content

Amir79Naziri/TwitterSentimentAnalysisWithSpark_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Twitter Sentiment Analysis Using Spark

Table of Contents
  1. About The Project
  2. Usage
  3. Phases
  4. Contact

About The Project

This project implements a sentiment analyzer with Spark ML library. It uses tweets and retweets of Twitter Dataset for training and testing model. There are many phases in this project which complete the project implementation step by step. Steps are as follows:

(back to top)

Usage

Requirements

Project uses Spark(SparkSQL) for data exploration, model training and Streaming. Therefore, Spark should be installed in order to run Notebooks. Moreover, It should be noted that this project is totally implemented in python, so in order to make Spark able to run codes, PySpark should be installed.

$ pip install pyspark

Also, project needs SparkSQL to use built-in Dataframe objects for querying data and training model. It should be noted that RDD is not used during this project.

$ pip install pyspark[sql]

(optional) And finally, in order to set environmental variables for enabling Jupyter to run Spark code, findspark should be installed. (Also, setting environmental variables could be done manually without findspark)

$ pip install findspark

Run

Run all cells in order. Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.

(back to top)

Phases


Introduce Data

The data is gathered by Sentiment140 from Twitter API as a CSV (1,600,000 lines) with emotions removed. Data file format has 6 fields:

  • The polarity of the tweet (0 = negative, 4 = positive)
  • The id of the tweet (2087)
  • The date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • The query (lyx). If there is no query, then this value is NO_QUERY.
  • The user that tweeted (robotickilldozr)
  • The text of the tweet (Lyx is cool)

For more details and downloading data, please visit here.


Preprocess Data

For this phase data_exploration.ipynb file is implemented and explained in details. Also, around 300,000 lines of raw dataset have been randomly sampled and used for next phase (streaming simulation).

Simulate Streaming

In this phase, simulate_streaming.py script is implemented to simulate real-time data generation every one second. In each second, a new CSV file will be created as a new fetched tweet. Obviously, this time is very unrealistic; Hence, feel free to decrease it as much as possible to make it close to real.

Introduce and Train model

In this phase Logistic Regression is used in order to classify tweets. Also, Spark ML library is used for all machine learning algorithms. For more details about implementation, check classifier_model.ipynb file.

Using model in Real-Time Senario

In this phase, stream_sentiment_analysis.ipynb file is implemented to simulate real-time data classification every five seconds. Obviously, this time is very unrealistic; Hence, feel free to decrease it as much as possible to make it close to real.

⚠️ warning: Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.


Further works!

This is a small project just to work with Spark. In real world, data must be in range of GB to use Spark. Also, using Logistic Regression might not be a good choice. Feel free to use other models such as Naive Bayes, SVM, and even Neural Networks.

Also, all configurations are set to run codes on local, so in order to use codes on clusters, one should change configurations and even partitioning strategy!!

(back to top)

Contact

Amirreza Naziri
Email: [email protected]

(back to top)

Releases

No releases published

Packages

No packages published