Skip to content

Data analysis done for 2 datasets - Data Bricks retail dataset and UCI Adult Dataset using Apache Spark

Notifications You must be signed in to change notification settings

pallavitilloo/Data-Analytics-for-Big-Data

Repository files navigation

Data-Analytics-for-Big-Data

Data Analytics done in Apache PySpark. Dataset used is UCI Adult Data.

  • Data cleaning
  • Feature engineering
    • Distill and transform the features into vectors.
    • Use one-hot encoder to process categorical features
  • Build a logistic regression and a gradient-boosted tree model to fit the dataset.
  • Tune and evaluate using Logistic Regression and Gradient-boosted tree
  • Make predictions on the testing set and display the areaUnderROC.

Data Analytics done in Apache PySpark. Dataset used is DataBricks Online Retail Dataset.

  • Taking measure of items per invoice
  • Checking total spendings for customers
  • Analyzing number of products sold for each item
  • Checking if a returning customer spends less than or greater than their previous purchase

About

Data analysis done for 2 datasets - Data Bricks retail dataset and UCI Adult Dataset using Apache Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published