Skip to content

This project shows how to use Spark to transform log data and store it in an S3 bucket.

Notifications You must be signed in to change notification settings

mohanhh/CloudDataLakeWithAWS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fact Table

songplays -

Holds Records from log data associated with song plays i.e. records with page NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent As the log file has only song's title and the artist name, song_id and artist_id are extracted by comparing song's title, it's duration in song database and artist name in artist database. songplay_id is set as auto increment field in songplay table using monotonically increasing id generated by Spark. Query used. select (monotonically_increasing_id() + 1) as song_play_id, songs_log.ts as start_time, songs_log.userId, songs_log.level, songs.song_id, artist.artist_id, songs_log.sessionId as session_id, songs_log.location, songs_log.userAgent as user_agent from songs_log songs_log join songs on songs_log.song = songs.title join artist artist on songs_log.artist = artist.name

Dimension Tables

users -

users in the app. Users created by extracting user_id, first_name, last_name, gender, level from log files.

Added constraint to overwrite user's first name, last name, gender and level when we encounter the user id again in log file. To select unique set of users, start by selecting the user information song_event table using timestamp (ts) in descending order and extracting user information from last. Spark SQL and window function row_number() is used to extract user information from last log record. This way if any user updated the subscription or name we will always have latest information. Following are the queries being used to extract user information from latest log entries users table is created by select userId as user_id", "firstName as first_name", "lastName as last_name", "gender", "level", "ts" from log_data select user_id, first_name, last_name, gender, level, row_number() over (partition by user_id order by ts desc) as ts_order from users_table As we are setting up row_number() by partitioning entries by user_id and ordering them by timestamp in descending order, the latest entry will have row_number 1. So this query will get us the latest entry for user information select u1.user_id, u1.first_name, u1.last_name, u1.gender, u1.level from users u1 where ts_order = 1

songs - songs in music database

song_id, title, artist_id, year, duration

This table is created from song data set. Data is written to parquet partitioned by year and artist id. Assumption is artist_id and song_id are unique

artists - artists in music database

artist_id, name, location, latitude, longitude

Artist table is also created from song data set. Assumption is artist_id is unique.

time - timestamps of records in songplays broken down into specific units

time_id, start_time, hour, day, week, month, year, weekday

time dimension shows time user played a particular song. The timestamp from log file is processed to store start_time, hour, day, week, month, year and weekday when user played the song. timestamp is being used as key as other entries in the table are calculated from this value.

Taken together these tables should allow Data Engineering team at Sparkify to analyze user data. songplay contains individual records of what song user's played at what time. It also will support aggregate queries like how many songs user played on the platform, which are popular songs and who are popular artists on the platform, how many songs a paid customer played and which are popular browsers for users on Sparkify. songplays table along with time table also identifies number of users in the system which can help in scaling the platform.

ETL Process

ETL Process currently parses song files first. This process extracts song_id, song_title, duration, year and artist id. Same data is used to populate artist table by extracting artist name, artist id, artist location, location's latitude and longitude.

Once song and artist tables are populated, log data is processed and each line of log is used to populate user and time dimension tables. songplay table is populated by comparing song's title, duration and artist names to the corrsponding values in song and artist tables.

How to run the code

Use etl.ipnb notebook to run the ETL process. It first checks for existance of song_data and log_data folders, deletes them if found and then unzips song_data and log_data zip files and then runs etl.py.

About

This project shows how to use Spark to transform log data and store it in an S3 bucket.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published