Airflow-Overview & Architecture

Deepak Belwal
3 min readMar 4, 2024

--

Airflow’s architecture allows for scalability and flexibility in managing workflows. It supports various executors, databases, and message brokers, enabling users to customize their setup based on their specific requirements.

Yes, there are different types of Airflow architectures that can be configured based on specific requirements.
Here are some key aspects of Airflow’s architecture:

  1. Scheduler: The scheduler is responsible for triggering and orchestrating the execution of workflows. It determines when tasks should be executed based on their dependencies and schedules.
  2. Executor: The executor is responsible for executing tasks. Airflow supports various types of executors, such as SequentialExecutor, LocalExecutor, CeleryExecutor, and Kubernetes Executor. The choice of executor depends on factors like scalability, parallelism, and resource management.
  3. Web Server and UI: Airflow provides a web server that serves as the user interface for managing workflows. It allows users to visualize DAGs, monitor task status, and manage configurations.
  4. Metadata Database: Airflow uses a metadata database, such as PostgreSQL or MySQL, to store information about DAGs, tasks, and their execution history. The metadata database helps maintain the state of workflows and provides valuable data for monitoring and troubleshooting.
  5. Message Broker (Optional): In distributed setups using executors like Celery, a message broker such as RabbitMQ or Redis is used to manage communication between the scheduler and workers. The message broker ensures reliable and efficient task execution in a distributed environment.
  6. Workers: Workers are responsible for executing tasks in a distributed setup. They receive task information from the scheduler via the message broker and execute tasks based on the dependencies and configurations defined in the DAG.

How Jobs Run in Apache Airflow

In Apache Airflow, jobs are represented as Directed Acyclic Graphs (DAGs) consisting of tasks and their dependencies. Here’s an overview of how jobs run in Airflow:

  1. DAG Definition: You define a DAG by writing Python code that specifies the tasks and their dependencies. Each task represents a unit of work, such as running a script, executing a SQL query, or triggering an external job.
  2. Scheduler: The Airflow scheduler continuously monitors the defined DAGs and their dependencies. It determines when a task is ready to be executed based on its dependencies and the specified schedule. The scheduler triggers task instances to be executed.
  3. Task Execution: When a task instance is triggered, Airflow assigns it to an available executor for execution. Executors can be local or distributed, such as Celery, Kubernetes, or Dask. The executor runs the task by executing the defined code or triggering the external job.
  4. Task Dependencies: Airflow ensures that tasks are executed in the correct order based on their dependencies. Tasks can have dependencies on other tasks, and Airflow ensures that a task is only executed after its dependencies have successfully completed.
  5. Monitoring and Logging: Airflow provides a web server that allows you to monitor the status of tasks and workflows. You can view the progress of task execution, check logs, and track the overall status of the job. Airflow also supports email notifications and alerts for task failures or other events.
  6. Retries and Backfilling: Airflow has built-in mechanisms for handling task failures. It supports task retries with configurable settings, allowing failed tasks to be retried automatically. Airflow also supports backfilling, which allows you to re-run historical data for a specific time range.
  7. Parallelism: Airflow can execute tasks in parallel, depending on the available resources and the configuration settings. Parallelism can be achieved by using distributed executors or by configuring the concurrency settings in Airflow.

Conclusion:

It’s important to note that Airflow provides a flexible and extensible framework for job scheduling and orchestration. It supports a wide range of integrations and can be customized to fit specific use cases.
An airflow is a tool which uses DAGs, Stages & task. Airflow uses Python to create workflows that can be easily scheduled and monitored.

--

--

Deepak Belwal

Army lover, Responsible_person, Influencer, Sharing Defence Knowledge, Joining the Armed Forces is my dream, Enthusiast Person, Parakarmo Vijayate, Jai Hind