Skip to content

Exploratory data and sentiment analysis on Reddit Comments

Notifications You must be signed in to change notification settings

sfox1975/Reddit_Comments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Reddit_Comments

Exploratory data and sentiment analysis on Reddit Comments

Data Set Overview

  • Reddit.com = 4th busiest website in the USA
  • Over 1TB of Reddit comments (2007 - 2015) are available online
  • Over 50GB of new comment data is generated each month
  • Data could provide interesting insights into how people interact online

Personal Motivation

  • Limited experience working with text based data sets = great learning experience
  • Exploratory analysis on 1 month's worth of comments: September 2009 (1.2GB)

Usage

Monthly comment data (2007 - 2015) can be downloaded from the following site: https://archive.org/download/2015_reddit_comments_corpus/reddit_data/

To run the notebook, type jupyter notebook Reddit_Comments.ipynb from command line

Organization

The notebook contains the following sections:

  • Exploratory analysis - overview of the data and some feature engineering
  • Sentiment analysis - sentiment scoring on comment body, using VADER
  • Features important to score - analysis of which features appear to drive the commment score
  • Map Individual Commenters to Score / Sentiment Space

Key Findings

  • Small subset of dedicated commenters
  • Yet, top commenters are not rewarded with higher scores
  • Most comments on reddit are short
  • Long-winded comments (>500 characters) = few high or low scores
  • Subreddit matters for score and sentiment
  • Individual commenters can be segmented in interesting ways

Some Possible Next Steps

  • Interactive graphs = really cool way to view reddit.com
  • Machine learning applications:
  • Score prediction, using above features and NLP (in progress)
  • Unsupervised learning to classify authors into categories (possible moderator application)
  • Sentiment analysis improvement (compared to VADER) using Reddit corpus

About

Exploratory data and sentiment analysis on Reddit Comments

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published