Airflow WeatherAPI ETL

This repository contains the code and files for an Airflow pipeline that performs ELT (Extract, Load, Transform) operations on yearly weather data from ncei.noaa.gov for stations in Germany.

Project Overview

The project consists of the following components:

weather_pipeline: package that contains the modules and functions for ELT tasks that download csv's, process and write them to PostgreSQL, apply transformations. The packages are separated into three parts:
- Extract
- Load
- Transform
pipeline.py: A Python script that runs the pipeline as a standalone application without Airflow.
aws_tasks: A folder that contains Python scripts for creating and deleting AWS resources such as creating S3 buckets and uploading files in single and multipart manner using the boto3 library.
dags: contains the Airflow DAG
tmp: A folder that stores temporary files generated during the pipeline execution.
Dockerfile: A file that defines the Docker image for running PostgreSQL containerized environment.

Task separation

The ELT tasks are gouped by their broad conceptual functionality into three packages, each of which implements its own class which is modular to new functionalities

Extract: The Extractor class takes a resource url as input and downloads the data from websource, the class can be extended to add more sources depending on requirements
Load: The Loader class is used to write the data to PostgreSQL. It supports two methods for writing into the tables, through a csv or a list of rows
Transform: Executes sql queries through a transformations handler class, the transform_* functions apply the respective transformations and save the result locally or on AWS S3
```
transform = Transform(db_handler)
result_cursor = transform.run("sql_query_name", **kwargs)
```
- Transformations applied:
  - transform_station_monthly_temp_avg()
    This function aims to derive the monthly average temperature for each weather station over a specific period (1990 to 2000). The underlying SQL query (monthly_avg_by_station) groups the entire 'TMAX' readings table based on the 'DATE' field. It employs the PERCENTILE_CONT window function within each group to calculate the median temperature for Germany over each day. The result is structured into a dictionary format containing station IDs, months, and corresponding average temperatures.
    The derived information is stored locally as a JSON file (stations_avg_temp.json). The locally stored JSON file is uploaded to an Amazon S3 bucket for broader accessibility.
  - transform_get_avg_top5()
    This function retrieves the average TMAX values for the five closest stations to each city from the 'cities' table. The SQL query involves the use of Common Table Expressions (CTEs) to calculate station rankings based on proximity. The result is saved locally as a CSV file and uploaded to Amazon S3.
The current SQL query, although logically correct, can become resource-intensive when executed for an extensive number of table entries. To address this challenge and enhance resource efficiency, a potential optimization strategy involves the introduction of a precomputed table named germany_means.
This table acts as a repository for precomputed results, eliminating the need to recalculate medians for the same dates.

Installation and Usage

To run the pipeline, you need to have the following prerequisites installed:

pip install -r requirements.txt

You also need to have an AWS account and credentials, and set the following environment variables:

AWS_ACCESS_KEY_ID: Your AWS access key ID
AWS_SECRET_ACCESS_KEY: Your AWS secret access key
AWS_REGION: Your AWS region

Setup PostgreSQL

Follow these steps to set up a PostgreSQL container and enable the necessary extensions, specifically earthdistance and cube. These extensions are used for geospatial calculations for latitute and longitute.

1. Create a PostgreSQL Container

Create a new PostgreSQL container:

docker run --name postgres-container -e POSTGRES_USER=USERNAME -e POSTGRES_PASSWORD=your_password -e POSTGRES_DB=DBNAME -p 5432:5432 -d postgres:latest

Adjust the values for -e POSTGRES_USER, -e POSTGRES_PASSWORD, and -e POSTGRES_DB as needed.

3. Access the PostgreSQL Container

Access the PostgreSQL container's bash shell:

docker exec -it postgres-container bash

4. Execute SQL Commands

Inside the container's shell, execute the following SQL commands to create the required extensions:

psql -U USERNAME -d DBNAME -c "CREATE EXTENSION earthdistance;"
psql -U USERNAME -d DBNAME -c "CREATE EXTENSION cube;"

Replace USERNAME with your PostgreSQL username and adjust the database name (DBNAME).

5. Verify Extensions

To verify that the extensions are installed, you can connect to the PostgreSQL database:

psql -U USERNAME -d DBNAME

Within the PostgreSQL prompt, you can check for the installed extensions:

\dx

This command should display a list of installed extensions, including earthdistance and cube.

Run as standalone application

Save your default and aws credentials in tmp/config.ini

[default]
POSTGRES_USER = LOCAL_USER
POSTGRES_PASSWORD = password
POSTGRES_HOST = localhost
POSTGRES_PORT = PORT
POSTGRES_DB = DBNAME

[aws]
POSTGRES_USER = AWS_USER
POSTGRES_PASSWORD = password
POSTGRES_HOST = dbname*.rds.amazonaws.com
POSTGRES_PORT = AWS_PORT
POSTGRES_DB = postgres

Set the appropriate values for S3_BUCKET and S3_DIR variables in config.py
Run the pipeline with the command

python pipeline.py

Run on Airflow

To run the pipeline using Airflow, start the Airflow web server and scheduler in separate terminals:

airflow webserver
airflow scheduler

Then, go to the Airflow UI at http://localhost:8080 and trigger the weather_pipeline_dag DAG.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
aws_tasks		aws_tasks
dags		dags
notebooks		notebooks
tmp		tmp
weather_pipeline		weather_pipeline
.gitignore		.gitignore
Dockerfile		Dockerfile
docker-compose.yml		docker-compose.yml
pipeline.py		pipeline.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airflow WeatherAPI ETL

Project Overview

Task separation

Transformations applied:

Installation and Usage

Setup PostgreSQL

1. Create a PostgreSQL Container

3. Access the PostgreSQL Container

4. Execute SQL Commands

5. Verify Extensions

Run as standalone application

Run on Airflow

About

Releases

Packages

Languages

pulkit6559/Airflow-WeatherAPI-ELTPipeline

Folders and files

Latest commit

History

Repository files navigation

Airflow WeatherAPI ETL

Project Overview

Task separation

Transformations applied:

Installation and Usage

Setup PostgreSQL

1. Create a PostgreSQL Container

3. Access the PostgreSQL Container

4. Execute SQL Commands

5. Verify Extensions

Run as standalone application

Run on Airflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages