Sparkify

Description

Use case: Sparkify, an innovative startup, delivers music streaming services to users across the United States. Our users engage with the service daily, either through the free tier with intermittent advertisements or the premium subscription model, which offers ad-free music at a fixed monthly cost. Users have the flexibility to upgrade, downgrade, or cancel their service as desired, making it crucial for us to ensure their satisfaction. Each interaction a user has with our service, such as playing songs, logging out, liking tracks with a thumbs-up, listening to ads, or changing their subscription level, generates valuable data. This data holds key insights for maintaining user happiness and driving Sparkify's success. As members of the data team, our mission is to accurately predict which users are likely to churn, either by downgrading from the premium tier to the free tier or canceling their subscription altogether. By identifying these users in advance, Sparkify can proactively offer them discounts and incentives, potentially saving millions in revenue for the business.

As the volume of data in the log file has exceeded the memory capacity of standard desktop computers, Sparkify has adopted the distributed file system Apache Spark™. To analyze this large dataset, Udacity™ provides the complete 12GB dataset on AWS™ S3, and you can leverage AWS or IBM™ Cloud to run a Spark cluster in the cloud.

The project aim is to build a Natural Language Processing (NLP) model capable of predicting whether a user will churn or not on a real time basis.

This project is divided in the following key sections:

Processing data, building an ETL pipeline to extract data from source, clean the data and save them.
Build a machine learning pipeline for training a model capable of predicting whether a user will churn or not.
Run a web app which can show model results in real time.

Files Descriptions

The files structure is arranged as below:

- README.md: read me file.
- requirement.txt: dependencies list
- web_app.py: gradio file to run the app
- bestModel.zip: the compressed file of MLlib in PySpark.
- exemples.json: examples of inputs for the web app.
- Sparkify.ipynb: ETL process and ML & NLP pipeline notebook.

Getting Started

Dependencies

Python 3.6+
Big data and Machine Learning Libraries: PySpark
Web App: Gradio
Jupyter: ipython
Boto3: the AWS SDK for Python

Installation

Clone the git repository:

git clone https://github.com/eljandoubi/Sparkify.git

Change directory

cd Sparkify

Create conda environment

conda create -n "Sparkify" python=3.6

Install dependencies

pip install -r requirements.txt

You can run the following commands in the project's directory to set up the database, train model and save the model. You need first to set your AWS key id and secret key.

ipython Sparkify.ipynb

To run your web app, execute the following command.

python web_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify

Table of Contents

Description

Files Descriptions

Getting Started

Dependencies

Installation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Sparkify.ipynb		Sparkify.ipynb
exemples.json		exemples.json
requirements.txt		requirements.txt
web_app.py		web_app.py

eljandoubi/Sparkify

Folders and files

Latest commit

History

Repository files navigation

Sparkify

Table of Contents

Description

Files Descriptions

Getting Started

Dependencies

Installation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages