Skip to content

Baylex/Movies-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Movies-ETL

Extract, Transform, Load

Purpose

  1. Create an ETL pipeline from raw data to a SQL database.
  2. Extract data from disparate sources using Python.
  3. Clean and transform data using Pandas.
  4. Use regular expressions (Regex) to parse data and to transform text into numbers.
  5. Load data with PostgreSQL and verify in PgAdmin.

The project included extracting a large data set from Kaggle, then transforming the data into a usable dataset for a "hacking competition." Once the data was transformed and narrowed in scope for the hack-a-thon, the DataFrames were loaded into PostgresSQL.

Extracting

Wikipedia Movies JSON file, starting with 193 Columns: Pic 1

Kaggle Movie Metadata, 24 columns Pic 2

Kaggle Ratings data, 2602489 rows by 4 columns

Pic 3

Transforming

Wikipedia Data

Wikipedia Movies transformed, 22 columns Pic 4

Wikipedia Movies, making the column names more succinct and uniform, 7033 rows of data. Pic 5

Kaggle Data

Wikipedia Movies merged with Kaggle Movies data, all column names and row counts, 6052 rows. Pic 8

Merged Movies with Kaggle ratings, all of the column names and row counts, 6052 rows.

Pic 9

Loading

Creating the Movie Database

Sending the data to PostgresSQL Pic 10

Verifying the data in PgAdmin

Movies Query

Pic 11

Ratings Query

Pic 12

Summary

A JSON file and 2 Kaggle files were extracted, then transformed, and joined. A movies and ratings file were loaded into a database for the hack-a-thon event.

Releases

No releases published

Packages

No packages published