Skip to content

The "Snowflake External Stage Data Pipeline" project automates the data ingestion process from a MySQL database to Snowflake using an External Stage.

Notifications You must be signed in to change notification settings

melwinmpk/Snowflake_External_Stage_DataPipeline

Repository files navigation

Snowflake_External_Stage_DataPipeline

Project Overview

The "Snowflake External Stage Data Pipeline" project automates the data ingestion process from a MySQL database to Snowflake using an External Stage. This project leverages AWS S3 as an intermediary storage solution, showcasing two distinct pipelines for loading data into Snowflake: one utilizing AWS Lambda and the other using Apache Airflow. Incremental loading is a critical feature for efficiently managing data pipelines, as it ensures that only new or modified data is transferred, saving on both processing time and costs. This approach allows for flexibility and demonstrates various methods of leveraging cloud services for efficient data management.

Youtube Video

Features

  • External Staging on AWS S3: Data is initially moved from MySQL to an S3 bucket, serving as an external stage for Snowflake.
  • Lambda-Based Pipeline: Utilizes AWS Lambda for serverless data loading from S3 to Snowflake, optimizing for efficiency and cost.
  • Airflow-Based Pipeline: Leverages Apache Airflow for orchestrated data transfer, providing robust scheduling and monitoring capabilities.
  • Incremental Data Loading: Implements an incremental loading approach, ensuring only new or updated records are transferred, optimizing the data ingestion process.
  • Metadata-Driven Ingestion: Uses metadata to dynamically manage data flows, enhancing the flexibility and scalability of the pipeline.

Technologies Used

  • MySQL: Source database for data ingestion.
  • Snowflake: Cloud data warehouse for data storage and analysis.
  • AWS S3: Used as external staging for data before loading into Snowflake.
  • AWS Lambda: Serverless computing service for executing data loading scripts.
  • Apache Airflow: Workflow automation and scheduling tool that orchestrates the data pipeline.
  • Python: Primary programming language for scripting and automation.

How It Works

  • Data is extracted incrementally from MySQL and loaded into an S3 bucket.
  • Two separate pipelines are defined for data loading into Snowflake:
    • Lambda Pipeline: Triggered upon new data arrival in S3, executing a script to load data into Snowflake.
    • Airflow Pipeline: Scheduled tasks for data extraction, loading to S3, and subsequent loading into Snowflake using External Stage.
  • Metadata-driven approach allows for dynamic adaptation of the data ingestion process as source data evolves.

Future Enhancements

  • Explore additional optimizations for handling larger datasets.
  • Implement error handling and retry mechanisms for increased reliability.
  • Enhance monitoring and alerting for pipeline operations.

Source Data

There are 2 tables which are getting ingested to Snowflake.

Dedicated Dags are developed for each Table

  1. amazone_books
  2. amazonebook_reviews

Source Table DDls

CREATE TABLE amazone_books (
	book_id INT NOT NULL AUTO_INCREMENT
	,book_title TEXT
	,book_amount FLOAT
	,book_author TEXT
	,book_rating FLOAT
	,book_link TEXT
	,business_date DATE DEFAULT(CURRENT_DATE)
	,PRIMARY KEY (book_id)
	);

CREATE TABLE amazonebook_reviews ( book_id INT NOT NULL ,reviewer_name TEXT ,rating FLOAT ,review_title TEXT ,review_content TEXT ,reviewed_on DATE ,business_date DATE DEFAULT(CURRENT_DATE) );

For the Incremental load. Primary Keys are required in the Tables. Respective Primary key for the Table are

  • amazone_books
    • book_id
  • amazonebook_reviews
    • book_id
    • reviewer_name
    • business_date

Note: This Source Data is from another Project. To know more about how source data is generated please refer AmazonBooks_DataPipeline

The Airflow Dag Ids for respective Tables are

  • amazone_books
    • Snowflake_ExternalStage_amazone_books_Dag
  • amazonebook_reviews
    • Snowflake_ExternalStage_source_amazonebook_review_Dag
    • Snowflake_ExternalStage_destination_amazonebook_review_Dag

For Detail Understanding Data Ingestion Using AWS Lambda Services Please refer amazone_books.md

For Detail Understanding Data Ingestion Using Airflow Services Please refer amazonebook_review.md

Source System

image

Pipeline Using AWS Lambda

image

Pipeline Using Airflow

image

About

The "Snowflake External Stage Data Pipeline" project automates the data ingestion process from a MySQL database to Snowflake using an External Stage.

Topics

Resources

Stars

Watchers

Forks

Languages