Skip to content

nasseredine/udacity-dend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Udacity Data Engineering Nanodegree projects

This repository contains the links to the projects I submitted for the Udacity Data Engineering Nanodegree Program.

Check the program's syllabus for more details.

Project 1a: Data Modeling with Postgres (Relational Database)


Model user activity data to create a relational database and ETL pipeline in PostgreSQL for a music streaming app.

  • Installed PostgreSQL and configured a new user and database.
  • Designed an optimized star schema (i.e. with a fact and dimensions tables) with normalized tables (i.e. using constraints) for queries on song play analysis.
  • Built an ETL pipeline to extract data from .json files, transform incorrect values and insert them into the tables with psycopg2 Python package.
  • Ran test to verify the database creation.
  • Created examples queries and expected results.

Project 1b: Data Modeling with Apache Cassandra (NoSQL Database)

Model event data to create a non-relational database and ETL pipeline in Apache Cassandra for a music streaming app.

  • Installed Apache Cassandra.
  • Designed optimized queries using denormalized tables.
  • Built an ETL pipeline to extract data from .csv files, transform incorrect values and insert them into the tables with cassandra Python package.

Build a Data Warehouse and an ETL pipeline that extracts data from Amazon S3, stages them in Amazon Redshift, and transforms data into a set of dimensional tables for their analytics team.

  • Create an IAM role to make Redshift access S3 (read only).
  • Create a Security group to access Redshift from a specific IP address.
  • Create programmatically a Redshift Cluster and attaching the previous policies using boto3 Python package.
  • Copy Data from S3 to staging tables in Redshift.
  • Transform the data using SQL (PostgreSQL) to create a set of dimensional analytics tables in Redshift.


Build a Data Lake and an ETL pipeline in Apache Spark that loads data from S3, processes the data into analytics tables, and loads them back into S3.


Improve the company's data infrastructure by creating and automating a set of data pipelines with Airflow, monitoring and debugging production pipelines.

About

Udacity Data Engineering Nanodegree projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published