Airflow

Daria Barysheva

Contents

Data model
Idempotence
Architecture

Airflow is a platform that allows to develop, plan, run, and monitor complex workflows. It fits perfectly with ETL/ELT processes and also can be useful if you need to periodically run any processes and monitor their execution. In order to describe processes, Aiflow uses the Python programming language.

Data model

The Airflow data model uses the following basic objects.

DAG (Directed acyclic graph)

It is the core concept of Airflow representing a semantic association of Tasks that should be performed in the strictly defined order — according to the specified schedule. Visually, the DAG looks like a directed acyclic graph, i.e. a graph without cyclic dependencies.

The simple DAG shown below consists of three Tasks. You can see that it defines the order of running Tasks and dependencies between them. Using Airflow you can also configure the scheduler for this DAG, actions in case of failure, number of retries, etc. The DAG does not care about what happens with data — this function is performed by Tasks themselves.

A simple DAG

Task

A task is a DAG’s node that describes operations to be applied to the data. For example, loading data from external sources, aggregating and indexing, removing duplicates, and other typical ETL operations.

Operator

This is a template for completing Tasks. While Tasks describe what actions to perform with data, Operators define how to perform these actions at the code level — using Python functions, Bash scripts, etc.

Airflow supports a lot of built-in core Operators. Besides that, you can use a huge set of community provider packages. Also, you can create your own Operators by extending the BaseOperator class.

A separate type of Operators — Sensors — allow users to define a reaction to certain events: the occurrence of a specific time, loading some data, starting of other DAGs/Tasks, etc.

Examples of useful Operators are listed below.

Airflow Operators

Operator

Description

PythonOperator

Execution of the Python code

BranchPythonOperator

Execution of the code depending on the specified conditions

BashOperator

Running Bash scripts

SimpleHttpOperator

Sending HTTP requests

PostgresOperator

Sending queries to a PostgreSQL database

MySqlOperator

Sending queries to a MySQL database

SqlSensor

Verifying the SQL query execution

DockerOperator

Running a Docker container for the Task

KubernetesPodOperator

Creating a separate Kubernetes pod for the Task

EmailOperator

Sending Emails

DummyOperator

The "empty" Operator that is used for grouping Tasks

Idempotence

One of the most important concepts underlying AirFlow is storing information about every DAG’s start time according to the specified schedule. For example, if you specify that the DAG should start at 01.01.2022 00:00:00 once a day, AirFlow will store information about running of this DAG using the following timestamps: 01.01.2022 00:00:00, 02.01.2022 00:00:00, 03.01.2022 00:00:00, and so on. The timestamps used in this example are called execution_date, the corresponding DAG instances — DAG Runs, and the Task entities associated with specific DAG Runs — Task Instances.

This concept is very important for maintaining the idempotence: start or restart of the Task with the same dataset for some date in the past allows you to reproduce the results obtained earlier. In addition, it is possible to run Tasks of one DAG for different timestamps simultaneously.

The concept of Task Instances and Dag Runs

Architecture

Airflow architecture is based on the following components:

Web Server. Provides a convenient user interface that allows to run and monitor DAGs and Tasks, track their schedule, execution status, and so on.
Metadata Database. A shared database that is used by Scheduler, Executor, and Web Server components to store the system metadata: global variables, data source connection settings, statuses of Task Instances and DAG Runs, etc. As this repository is based on the SQLAlchemy library, it requires the installation of a compatible database. For example, MySQL or PostgreSQL.
Scheduler. A service responsible for planning in Airflow. It handles both triggering scheduled workflows and submitting Tasks to the Executor for running.

The Scheduler analyzes the results of DAGs parsing and checks the existence of Tasks ready to run regularly (by default, once a minute). When the predefined conditions for Tasks running are met, the scheduler initializes Task Instances. To perform active Tasks, the scheduler uses the Executor specified in the settings.

For certain Metadata Database versions (PostgreSQL 9.6+ and MySQL 8+), using of several schedulers is supported. It helps to provide the maximum fault tolerance.

Executor. A mechanism that runs Task Instances. It works in conjunction with the Scheduler. In the default Airflow installation, the Executor runs everything inside the Scheduler (within the same process). But for production needs it is recommended to use Executors that push Tasks execution out to Workers.

Executor types

Type

Description

SequentialExecutor

Launches the Tasks sequentially and suspends the Scheduler for the duration of their execution. That is why this Executor is recommended only for testing — it is not suitable for a production environment

LocalExecutor

Starts a new child process for each Task, that allows to process multiple Tasks in parallel. It perfectly simulates a production environment in a test environment, but it is also not recommended for real use since the low fault tolerance: in case of a failure on the machine where Executor is running, the Task is not transferred to other nodes

CeleryExecutor

Allows to have multiple Workers running on different nodes. The executor uses Celery and requires additional installation of the message broker. For example, Redis or RabbitMQ. When the load increases, it is enough to connect an additional Worker. If any Worker fails, its work is transferred to the rest of the nodes

DaskExecutor

It is similar to CeleryExecutor, but instead of the Celery for parallel computing it uses the Dask library

KubernetesExecutor

Launches a new Worker on a separate pod in Kubernetes for each Task Instance

CeleryKubernetesExecutor

Allows to run CeleryExecutor and KubernetesExecutor simultaneously. The specific type is selected depending on the status of the Tasks queue

DebugExecutor

Used to run and debug pipelines from the IDE

Worker. A separate process in which Tasks are executed. The location of the Worker is determined by the selected type of the Executor.
DAG Directory. A folder of DAG files that can be read by the Scheduler, Executor, and any Worker that the Executor has.

IMPORTANT
All DAGs in the form of python-scripts should be located in the DAG Directory. Web UI is used only for monitoring DAGs, not for adding or editing.

The high-level architecture view of Airflow is shown below. Depending on the Executor type, this scheme can contain additional components: for example, the Message Queue between the Executor and its Workers for the CeleryExecutor. However, in such particular cases you can still think of an Executor and its Workers as a single logical component that handles the Tasks execution.

Airflow architecture

NOTE

For more details about Airflow concepts, see AirFlow documentation.

Found a mistake? Seleсt text and press Ctrl+Enter to report it