IMDb Movie Review Rating Prediction

This is an application where you can build models that can predict ratings of IMDb movie reviews. It works by extracting important features from IMDb Largest Review Dataset. Feature extractor jobs are implemented in map-reduce fashion and executed on a Hadoop cluster. Tf-Idf, N-Gram Count and Exclamation/Question Mark Count features are available for you to extract. After feature extraction, a Random Forest Classifier is trained using the extracted features. Then, you can predict an unseen instance from the application and see the top 5 similar reviews. Similar reviews are found by executing a job on hadoop that calculates cosine similarity scores between every review in the dataset and the given review.

Technical Details

Job implementations are done by using MRJob library in Python. A website is developed using Flask and JavaScript (Vue.js). You can train models, make a prediction and find similar reviews from the landing page of the website.

All of the info related to the model such as selected feature types, dataset, model's pickle file, train data, etc. are stored in the file storage system in a separate folder. User can give names
to the models that they are built. In the prediction, user can refer to that model name to select the corresponding model.

How to Run the Application

First of all, you need an Hadoop cluster up and running on your machine. Check out this article for setting up an Hadoop cluster.

Application is developed on Ubuntu 20 operating system. Assume that your default pip and python binaries have the following names: pip3 and python3.

Install the necessary libraries:

pip3 install -r requirements.txt

Download the dataset. sample.json and part1.json files are used in the application. You need to preprocess these files.
Preprocess both files using the following command:

python3 processing.py --input /path/to/sample.json –output /output/path/to/preprocessed_sample.csv

python3 processing.py --input /path/to/part1.json –output /output/path/to/preprocessed_part1.csv

Make a directory in HDFS named /input. Upload the processed datasets to that directory

hadoop fs -mkdir /input

hadoop fs -put /path/to/preprocessed_sample.csv /input/preprocessed_sample.csv

hadoop fs -put /path/to/preprocessed_sample.csv /input/preprocessed_part1.csv

Run the application by executing the run.sh file:

./run.sh

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
app		app
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
cosine_similarity_calculator.py		cosine_similarity_calculator.py
data_builder.py		data_builder.py
exclamation_mark_counter.py		exclamation_mark_counter.py
feature_extractor.py		feature_extractor.py
inverse_document_frequency_calculator.py		inverse_document_frequency_calculator.py
n_gram_counter.py		n_gram_counter.py
n_gram_local_counter.py		n_gram_local_counter.py
processing.py		processing.py
run.sh		run.sh
term_frequency_calculator.py		term_frequency_calculator.py
train_model.py		train_model.py
word_counter.py		word_counter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDb Movie Review Rating Prediction

Technical Details

How to Run the Application

Some Screenshots

About

Releases

Packages

Languages

tugrulhkarabulut/hadoop-movie-rating-prediction

Folders and files

Latest commit

History

Repository files navigation

IMDb Movie Review Rating Prediction

Technical Details

How to Run the Application

Some Screenshots

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages