Skip to content

ckongala/airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Airflow: An Overview

Apache Airflow is an open-source platform used to programmatically author, schedule, and monitor workflows. Here’s a simple breakdown of what Airflow is and why it is important:

What is Apache Airflow?

  • Workflow Management System: Airflow is designed to manage complex workflows and data processing pipelines.
  • Directed Acyclic Graphs (DAGs): It uses DAGs to represent workflows. Each DAG consists of nodes (tasks) and edges (dependencies between tasks).
  • Python-Based: Workflows are written in Python, making it easy to create, modify, and maintain pipelines using a familiar programming language.
  • Extensible and Scalable: Airflow supports plugins, custom operators, and scalable deployment options, making it highly customizable to different use cases.

Why Airflow?

  1. Automation:

    • Automates repetitive tasks and complex workflows, reducing manual effort.
    • Ensures workflows run on schedule without human intervention.
  2. Visualization:

    • Provides a user-friendly web interface to visualize and monitor workflows.
    • Allows users to see task progress, dependencies, logs, and results.
  3. Scheduling:

    • Allows precise scheduling of tasks using cron-like syntax.
    • Supports catching up on missed tasks and backfilling historical data.
  4. Modularity:

    • Promotes the modular design of workflows where each task can be independently managed and maintained.
    • Facilitates the reuse of code through custom operators and libraries.
  5. Extensibility:

    • Easily integrates with various data sources and destinations like AWS, Google Cloud, and databases.
    • Supports custom plugins to extend its functionality.
  6. Scalability:

    • Scales to handle large volumes of tasks and workflows.
    • Can be deployed on single machines or distributed clusters.
  7. Community and Support:

    • Backed by a large open-source community, providing a wealth of plugins, integrations, and support.
    • Regular updates and a vast repository of knowledge and tutorials.

Core Concepts

  1. DAG (Directed Acyclic Graph):

    • A collection of tasks with dependencies, defining the order of execution.
    • Each DAG is a Python script defining the tasks and their dependencies.
  2. Tasks:

    • The smallest unit of work in a DAG. Tasks can be anything from running a SQL query to triggering an API request.
    • Tasks are defined using operators, which are predefined templates for common operations.
  3. Operators:

    • Action Operators: Perform an action (e.g., PythonOperator, BashOperator).
    • Transfer Operators: Move data from one system to another (e.g., S3ToRedshiftOperator).
    • Sensor Operators: Wait for a certain condition to be met (e.g., S3Sensor, HttpSensor).
  4. Executors:

    • Determine how tasks are run (locally, on a cluster, etc.). Examples include LocalExecutor, CeleryExecutor, and KubernetesExecutor.
    • Queue: A list of tasks waiting to be executed.
    • Worker: A process that picks up tasks from the queue and executes them.
  5. Webserver:

    • The user interface for Airflow. It allows users to view the status of DAGs, task instances, logs, and more.
  6. Metastore:

    • The metadata database that stores information about DAGs, tasks, execution status, and more.
    • Commonly uses databases like MySQL, PostgreSQL, or SQLite.
  7. Triggerer:

    • A lightweight process designed to efficiently trigger tasks that are scheduled based on external events or conditions.
  8. Task Instance:

    • A specific run of a task within a DAG. Each time a task is executed, a new task instance is created.
  9. Workflow:

    • A sequence of tasks that perform a specific process or job. In Airflow, a workflow is represented by a DAG.

Why Airflow?

Use Cases and Benefits:

  1. Data Engineering and ETL:

    • Airflow is commonly used for Extract, Transform, Load (ETL) processes in data pipelines. It ensures data consistency and manages dependencies.
  2. Machine Learning Pipelines:

    • Airflow orchestrates complex ML workflows, from data preprocessing to model training and deployment.
  3. Task Automation:

    • Automates repetitive tasks, such as sending daily reports or data backups.
  4. Monitoring and Alerting:

    • Provides real-time monitoring of task execution and alerting for failures or delays.

By leveraging Apache Airflow, organizations can streamline their workflow management, reduce errors, and improve the efficiency of their data and task automation processes.

====================================================================================================================================================================================== AirFlow Drawbacks:

  1. Airflow is not a data streaming solution:

    • What it means: Airflow is not designed to handle data that needs to be processed in real-time or near real-time.
    • Implication: You cannot schedule tasks to run every second in Airflow. It is not built to handle such high-frequency scheduling.
  2. Airflow is not a data processing framework:

    • What it means: Airflow itself is not meant to process large volumes of data like frameworks such as Apache Spark.
    • Implication: You should not use Airflow tasks (or operators) to directly process data. For example, you shouldn't run extensive data transformations directly in Airflow tasks.
  3. Proper Use of Airflow:

    • Orchestrator: Airflow is designed to orchestrate data workflows. This means it helps in scheduling and managing tasks that trigger other tools or processes that handle data processing.
    • Example: Use operators like SparkSubmitJobOperator to submit jobs to Spark from Airflow. Airflow manages the workflow, but Spark does the heavy lifting of processing the data.

In summary, Airflow is best used as a workflow orchestrator to manage and schedule tasks that invoke other tools or systems for data processing, rather than performing the data processing itself.

======================================================================================================================================================================================

Single Node Architecture in Airflow

In a single node setup, all components of Apache Airflow run on a single machine. This includes the scheduler, web server, metadata database, and worker(s). Here’s a breakdown of each component and how they interact in a single node environment:

  1. Scheduler:

    • The scheduler is responsible for reading the DAGs (Directed Acyclic Graphs) and scheduling the tasks to be executed.
    • It places the tasks in a queue for execution.
  2. Web Server:

    • The web server provides a user interface to interact with Airflow.
    • It allows users to view the status of DAGs, start or stop DAGs, and see logs and task details.
  3. Metadata Database:

    • This database stores the state of the Airflow environment, including DAG definitions, task instances, and other configuration details.
    • Common choices for the metadata database include PostgreSQL, MySQL, and SQLite.
  4. Worker(s):

    • Workers are the processes that actually execute the tasks defined in the DAGs.
    • In a single node setup, these workers run on the same machine as the scheduler and web server.

How It Works in a Single Node Setup

  1. DAG Definition:

    • DAGs are defined as Python scripts and placed in a specified directory (typically the dags/ folder).
    • The scheduler reads these DAG files at regular intervals.
  2. Scheduling:

    • The scheduler determines which tasks need to be run based on the DAG definitions and the current state of tasks.
    • It places the tasks that need to be executed into a queue.
  3. Task Execution:

    • Workers pick up tasks from the queue and execute them.
    • Since everything runs on a single node, the workers, scheduler, and web server share the same machine resources.
  4. Monitoring and Management:

    • The web server allows users to monitor and manage DAGs and tasks.
    • Users can see task statuses, logs, and other details through the web interface.

Benefits of Single Node Architecture

  1. Simplicity:

    • Easy to set up and manage.
    • Ideal for development, testing, and small-scale deployments.
  2. Lower Resource Requirements:

    • All components share the same machine, so there’s no need for multiple servers.
  3. Quick Setup:

    • Faster to deploy as there’s no need to configure a distributed environment.

Limitations of Single Node Architecture

  1. Scalability:

    • Not suitable for large-scale deployments due to resource limitations of a single machine.
    • As the workload increases, the single node can become a bottleneck.
  2. Fault Tolerance:

    • If the single node fails, the entire Airflow environment goes down.
  3. Performance:

    • Limited by the hardware resources of the single node (CPU, memory, disk I/O).

Use Cases

  • Development and Testing: Great for local development and testing of DAGs before deploying to a production environment.
  • Small Projects: Suitable for small projects with light workloads that don’t require high availability or scalability.

In summary, single node architecture in Airflow is a straightforward setup where all components run on a single machine, making it simple to manage but limited in terms of scalability and fault tolerance.

======================================================================================================================================================================================

Multi-Node Architecture with Celery in Airflow

In a multi-node setup, different components of Apache Airflow are distributed across multiple machines. This setup leverages Celery for task execution, allowing for better scalability and fault tolerance. Here’s a detailed look at how this architecture works:

Components in a Multi-Node Setup

  1. Scheduler:

    • The scheduler reads DAGs and schedules tasks for execution.
    • It runs on a separate machine or multiple machines for high availability.
  2. Web Server:

    • The web server provides the user interface for interacting with Airflow.
    • It also runs on a separate machine or multiple machines for load balancing and high availability.
  3. Metadata Database:

    • This central database stores the state and configuration of the Airflow environment.
    • It runs on a dedicated database server or a managed database service.
  4. Celery Workers:

    • Celery workers are responsible for executing the tasks scheduled by the scheduler.
    • Multiple worker nodes can be deployed to distribute the workload.
  5. Celery Broker:

    • The broker is the messaging system that mediates between the scheduler and the workers.
    • Common choices include RabbitMQ and Redis.
  6. Result Backend:

    • The result backend is used to store the results of the task executions.
    • It can be the same as the metadata database or a different storage system.

How It Works in a Multi-Node Setup

  1. DAG Definition:

    • DAGs are defined and placed in a shared storage location accessible by all Airflow components.
  2. Scheduling:

    • The scheduler reads the DAG files and schedules tasks.
    • Tasks are placed in a queue managed by the Celery broker.
  3. Task Execution:

    • Celery workers pick up tasks from the broker and execute them.
    • Tasks are executed on different worker nodes, distributing the load.
  4. Result Storage:

    • Results of task executions are stored in the result backend.
  5. Monitoring and Management:

    • The web server allows users to monitor and manage DAGs and tasks.
    • Users can view the status, logs, and other details through the web interface.

Benefits of Multi-Node Architecture

  1. Scalability:

    • Easy to scale by adding more worker nodes to handle increased workload.
  2. Fault Tolerance:

    • The failure of a single node doesn’t bring down the entire system.
    • Redundancy can be added to the scheduler, web server, and worker nodes.
  3. Performance:

    • Distributing tasks across multiple workers improves performance and reduces the load on individual nodes.

Limitations of Multi-Node Architecture

  1. Complexity:

    • More complex to set up and manage compared to a single node setup.
    • Requires configuring and maintaining multiple components.
  2. Resource Management:

    • Effective resource management is required to ensure all components perform optimally.

Use Cases

  • Large-Scale Deployments: Ideal for production environments with high workloads and the need for scalability.
  • High Availability Requirements: Suitable for use cases where uptime and reliability are critical.

Example Architecture Diagram

Here's a simplified diagram of a multi-node setup using Celery:

      +-------------------+
      | Metadata Database |
      +-------------------+
               |
      +--------+--------+
      |                 |
+-----+-----+     +-----+-----+
| Scheduler |     | Web Server |
+-----------+     +-----------+
      |                 |
+-----+-----------------+-----+
|           Celery Broker       |
+-------------------------------+
      |                 |
+-----+-----+     +-----+-----+
| Celery     |     | Celery     |
| Worker     |     | Worker     |
+-----------+     +-----------+

Steps to Set Up Multi-Node Architecture with Celery

  1. Install Celery and a Broker:

    • Install Celery and a message broker like RabbitMQ or Redis.
  2. Configure Airflow to Use Celery:

    • Update the airflow.cfg file to set the executor to CeleryExecutor.
    • Configure the broker URL and result backend.
  3. Deploy Scheduler and Web Server:

    • Deploy the scheduler and web server on separate machines.
  4. Deploy Celery Workers:

    • Start Celery workers on multiple machines to handle task execution.
  5. Set Up Monitoring and Management Tools:

    • Use tools like Flower to monitor Celery workers.
    • Ensure logs and task states are appropriately stored and accessible. In summary, a multi-node architecture with Celery in Airflow distributes the workload across multiple machines, enhancing scalability, fault tolerance, and performance. This setup is more complex but essential for large-scale and production environments.

======================================================================================================================================================================================

WORK FLOW Overview

The text explains the lifecycle of a DAG (Directed Acyclic Graph) and its tasks in Apache Airflow, highlighting key steps and states that are crucial for debugging.

Steps in the Lifecycle of a DAG

  1. Creating a DAG:

    • You create a new DAG by writing a Python script (e.g., dag.py).
    • This script is placed in the dags folder, which is monitored by Airflow.
  2. Scheduler Detection:

    • The Scheduler parses the dags folder every five minutes (default interval) to detect new DAGs.
    • It might take up to five minutes for a new DAG to appear in the Airflow UI.
  3. Modifying a DAG:

    • When you modify an existing DAG, the Scheduler parses the folder every 30 seconds to detect changes.
    • It might take up to 30 seconds for modifications to be reflected.
  4. Running a DAG:

    • The Scheduler creates a DAG Run object with the state Running when it starts executing a DAG.
    • The Scheduler then creates task instances for each task in the DAG.

Task Lifecycle and States

  1. Task Instance Creation:

    • For each task in the DAG, a task instance object is created.
    • Initial state of the task instance is None.
  2. Task Scheduling:

    • The task instance state changes to Scheduled once it's ready to be picked up for execution.
  3. Task Queuing:

    • The Scheduler sends the task instance object to the Executor’s queue.
    • The state of the task instance changes to Queued.
  4. Task Execution:

    • The Executor creates a subprocess to run the task.
    • The state of the task instance changes to Running.
  5. Task Completion:

    • Upon completion, the task instance state changes to Success or Failed based on the outcome.
  6. DAG Completion:

    • The Scheduler checks if all tasks in the DAG have been executed.
    • If all tasks are completed, the DAG Run state changes to Success.

Monitoring and Debugging

  • Airflow UI:
    • You can monitor the states of both the DAG Run and individual task instances in the Airflow UI.
    • The UI will reflect the states as None, Scheduled, Queued, Running, Success, or Failed.

Summary of Key Points

  • Scheduler Detection Intervals: New DAGs are detected every five minutes, and modifications to existing DAGs are detected every 30 seconds.
  • State Transitions:
    • DAG Run States: Running, Success, or Failed.
    • Task Instance States: None, Scheduled, Queued, Running, Success, or Failed.
  • Execution Flow:
    • The Scheduler schedules tasks and places them in the Executor's queue.
    • The Executor runs tasks as subprocesses.
    • Task and DAG states are updated and can be monitored via the Airflow UI.

Diagram of the Process

Here’s a simplified diagram to visualize the process:

+-------------------+
| Create DAG (dag.py)|
+-------------------+
         |
+-------------------+
| Place in dags/ folder|
+-------------------+
         |
+-------------------+
| Scheduler detects DAG|
+-------------------+
         |
+-----------------------+
| Scheduler creates DAG Run|
+-----------------------+
         |
+---------------------+
| Create Task Instances  |
+---------------------+
         |
+--------------------+
| Schedule Tasks       |
+--------------------+
         |
+--------------------+
| Queue Tasks           |
+--------------------+
         |
+--------------------+
| Run Tasks (Executor)  |
+--------------------+
         |
+---------------------+
| Task Completion (Success/Failed)|
+---------------------+
         |
+--------------------+
| DAG Run Completion  |
+--------------------+

In summary, understanding these steps and states helps in debugging tasks and data pipelines in Apache Airflow by providing clear checkpoints to monitor and identify issues.