What we cover

In this tutorial we are going to install Apache Airflow on your system. Furthermore, we will implement a basic pipeline.

Apache Airflow logo

Windows Subsystem for Linux 2 (WSL2)

If you have a Windows machine, you need Windows Subsystem for Linux 2 (WSL2). Here, we follow the instructions provided by Microsoft's Craig Loewen to set up WSL2 (see this post to learn more).

wsl.exe --install

This will automatically install the open source operating system Ubuntu and the latest WSL Linux kernel version onto your machine (inside a virtual machine). This means that you only can use the Linux command line tools to install packages, run commands, and interact with the Linux kernel.

Your distribution will start after you boot up again, completing the installation.

You can use

wsl --update

to manually update your WSL Linux kernel, and you can use

wsl --update rollback

to rollback to a previous WSL Linux kernel version. To learn more about WSL, take a look at this post form Microsoft: "What is the Windows Subsystem for Linux?".

Install Miniforge

Next, we install Miniforge (an alternative to Anaconda and Miniconda) on your Linux system.

Next, we install Miniforge with wget (we use wget to download directly from the terminal).

wget https://github.com/conda-forge/miniforge/releases/download/4.12.0-0/Miniforge3-4.12.0-0-Linux-x86_64.sh
sh Miniforge3-4.12.0-0-Linux-x86_64.sh

Install Visual Studio Code

It is also recommended to install Visual Studio Code in your new Linux system.:

To learn more, read the post "Tips and Tricks for Linux development with WSL and Visual Studio Code".

MiniForge

To start this tutorial, I recommend to use Miniforge (a community-led alternative to Anaconda):

On Windows open your Linux terminal. On macOS or Linux open a terminal window.

We create an environment with a specific version of Python and install pip. We call the environment airflow (if you don't have Python 3.10 you can replace it with 3.9 or 3.8):

conda create -n airflow python=3.10 pip

When conda asks you to proceed (proceed ([y]/n)?), type y.

To install Airflow, we mainly follow the installation tutorial provided by Apache Airflow. Note that we use pip to install Airflow an some additional modules in our environment. When pip asks you to proceed (proceed ([y]/n)?), simply type y.

conda activate airflow
pip install --upgrade pip
pip install virtualenv

Here is the command for Mac and Linux:

export AIRFLOW_HOME=~/airflow
pip install "apache-airflow==2.3.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.3.1/constraints-3.10.txt"
pip install pandas
pip install -U scikit-learn
airflow standalone

We only run this command once when we install Airflow. If you want to run the individual parts of Airflow manually rather than using the all-in-one standalone command, check out the instructions provided here.

In this section, we take a look at how to start Airflow:

conda activate airflow
export AIRFLOW_HOME=~/airflow
airflow webserver

Here, we mainly follow the instructions provided in this Apache Airflow tutorial:

conda activate airflow
export AIRFLOW_HOME=~/airflow
python ~/airflow/dags/my_airflow_dag.py

If the script does not raise an exception it means that you have not done anything wrong, and that your Airflow environment is somewhat sound.

If you want to learn more about the content of the my_airflow-dag.py script, review the Airflow tutorial.

First, we use the command line to do some metadata validation. Let's run a few commands in your terminal to test your script:

airflow db init
airflow dags list
airflow tasks list my_airflow_dag
airflow tasks list my_airflow_dag --tree

Let's start our tests by running one actual task instance for a specific date (independent of other tasks).

The date specified in this context is called the "logical date" (also called execution date), which simulates the scheduler running your task or DAG for a specific date and time, even though it physically will run now (or as soon as its dependencies are met).

This is because each run of a DAG conceptually represents not a specific date and time, but an interval between two times, called a data interval. A DAG run's logical date is the start of its data interval.

The general command layout is as follows:

command subcommand dag_id task_id date
airflow tasks test my_airflow_dag task_print_date 2021-05-20

Take a look at the last lines in the output (ignore warnings for now)

airflow tasks test my_airflow_dag task_sleep 2021-05-20
airflow tasks test my_airflow_dag task_templated 2021-05-20

Everything looks like it's running fine so let's run a backfill.

backfill will respect your dependencies, emit logs into files and talk to the database to record status.

If you do have a webserver up, you will be able to track the progress.

airflow webserver will start a web server if you are interested in tracking the progress visually as your backfill progresses.

airflow dags backfill my_airflow_dag \
    --start-date 2021-05-20 \
    --end-date 2021-06-01

Let's proceed to the Airflow user interface (UI) - see next step.

Note that if you use depends_on_past=True, individual task instances will depend on the success of their previous task instance (that is, previous according to the logical date) In that case you may want to consider to set wait_for_downstream=True when using depends_on_past=True. While depends_on_past=True causes a task instance to depend on the success of its previous task_instance, wait_for_downstream=True will cause a task instance to also wait for all task instances immediately downstream of the previous task instance to succeed.

airflow webserver

Open the Airflow web interface in your browser:

Now start experimenting with the Airflow web interface:

Congratulations! You have completed the tutorial and learned how to:

✅ Install Apache Airflow
✅ Start Apache Airflow
✅ Create a simple pipeline

Next, you may want to proceed with this tutorial to build a simple Python machine learning pipeline using pandas and scikit-learn:

Jan Kirenz

Thank you for participating in this tutorial. If you found any issues along the way I'd appreciate it if you'd raise them by clicking the "Report a mistake" button at the bottom left of this site.

Jan Kirenz | kirenz.com | CC BY-NC 2.0 License