This repository shows how one can use popular open source database Postgres to build a Data Warehouse.
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent As the log file has only song's title and the artist name, song_id and artist_id are extracted by comparing song's title, it's duration in song database and artist name in artist database. songplay_id is set as auto increment field in songplay table.
users in the app. Users created by extracting user_id, first_name, last_name, gender, level from log files.
Added constraint to overwrite user's first name, last name, gender and level when we encounter the user id again in log file. Also added constraint to make user_id not null.
This table is created from song data set. Added constraint to make song_id Primary key which means it is unique and not null.
artist_id, name, location, latitude, longitude Artist table is also created from song data set. Added constraint to make artist_id Primary Key which means we assume artist id is unique and not null.
time dimension shows time user played a particular song. The timestamp from log file is processed to store start_time, hour, day, week, month, year and weekday when user played the song. timestamp is being used as Primary key as other entries in the table are calculated from this value.
Taken together these tables should allow Data Engineering team at Sparkify to analyze user data. songplay contains individual records of what song user's played at what time. It also will support aggregate queries like how many songs user played on the platform, which are popular songs and who are popular artists on the platform, how many songs a paid customer played and which are popular browsers for users on Sparkify. songplays table along with time table also identifies number of users in the system which can help in scaling the platform.
ETL Process currently parses song files first. This process extracts song_id, song_title, duration, year and artist id. Same data is used to populate artist table by extracting artist name, artist id, artist location, location's latitude and longitude.
Once song and artist tables are populated, log data is processed and each line of log is used to populate user and time dimension tables. songplay table is populated by comparing song's title, duration and artist names to the corrsponding values in song and artist tables.