Hosted Apache Airflow on Docker to manage the workflow of a data engineering pipeline. The data is the NYC Yellow Taxi data from January 2022. This data is extracted from the source, transformed using Python, and then loaded on the Data Warehouse(BigQuery) on Google Cloud Platform.
This task is responsible to download the dataset from the dataset URL and save it locally on the Docker Airflow image by performing a Bash operation.
This task takes in 3 parameters:
- bucket: This is the GCS Bucket name specified in the docker-compose.yaml file
- object_name: This parameter sets the target path its file-name
- local_file: This parameter sets the source path of the file downloaded in the earlier step and its file-name In this task, we call a Python function to upload the locally stored dataset on GCP's GCS. We first initialize a storage client and associate it with the Bucket. Then we upload the data from a local path to this Bucket.
We initialize the BigQuery external table resources like Project ID, DataSet ID(set in Terraform), Table ID along with External Data Configurations like Data source format and the Source URI's(Reference to the GCS bucket that we uploaded data to in the previous step)
QUERY:
CREATE OR REPLACE TABLE `skillful-octane-358220.trips_data_all.non_partioned` AS
SELECT * FROM `skillful-octane-358220.trips_data_all.external_table`;
QUERY:
CREATE OR REPLACE TABLE `skillful-octane-358220.trips_data_all.trip_data_partioned`
PARTITION BY
DATE(tpep_pickup_datetime) AS
SELECT * FROM `skillful-octane-358220.trips_data_all.external_table`;
QUERY:
CREATE OR REPLACE TABLE `skillful-octane-358220.trips_data_all.trip_data_partioned_clustered`
PARTITION BY
DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM `skillful-octane-358220.trips_data_all.external_table`;
Dataset: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
Docker-compose.yaml : https://github.com/AakashDeorukhkar/Airflow_NYC_Taxi_Data/blob/main/docker-compose.yaml
Dockerfile: https://github.com/AakashDeorukhkar/Airflow_NYC_Taxi_Data/blob/main/Dockerfile
requirements.txt : https://github.com/AakashDeorukhkar/Airflow_NYC_Taxi_Data/blob/main/requirements.txt
DAG: https://github.com/AakashDeorukhkar/Airflow_NYC_Taxi_Data/tree/main/dags
Screenshot/Images: https://github.com/AakashDeorukhkar/Airflow_NYC_Taxi_Data/tree/main/Images