In this tutorial we are going to install Apache Airflow on your system. Furthermore, we will implement a basic pipeline.
If you have a Windows machine, you need Windows Subsystem for Linux 2 (WSL2). Here, we follow the instructions provided by Microsoft's Craig Loewen to set up WSL2 (see this post to learn more).
wsl.exe --install
This will automatically install the open source operating system Ubuntu and the latest WSL Linux kernel version onto your machine (inside a virtual machine). This means that you only can use the Linux command line tools to install packages, run commands, and interact with the Linux kernel.
Your distribution will start after you boot up again, completing the installation.
wsl
or bash
in Powershell.You can use
wsl --update
to manually update your WSL Linux kernel, and you can use
wsl --update rollback
to rollback to a previous WSL Linux kernel version. To learn more about WSL, take a look at this post form Microsoft: "What is the Windows Subsystem for Linux?".
Next, we install Miniforge (an alternative to Anaconda and Miniconda) on your Linux system.
wsl
or bash
in Powershell.Next, we install Miniforge with wget
(we use wget
to download directly from the terminal).
x86_64 (amd64)
). Here is the example for x86_64 (amd64):wget https://github.com/conda-forge/miniforge/releases/download/4.12.0-0/Miniforge3-4.12.0-0-Linux-x86_64.sh
sh Miniforge3-4.12.0-0-Linux-x86_64.sh
It is also recommended to install Visual Studio Code in your new Linux system.:
To learn more, read the post "Tips and Tricks for Linux development with WSL and Visual Studio Code".
To start this tutorial, I recommend to use Miniforge (a community-led alternative to Anaconda):
On Windows open your Linux terminal. On macOS or Linux open a terminal window.
We create an environment with a specific version of Python and install pip. We call the environment airflow
(if you don't have Python 3.10 you can replace it with 3.9 or 3.8):
conda create -n airflow python=3.10 pip
When conda asks you to proceed (proceed ([y]/n)?
), type y
.
To install Airflow, we mainly follow the installation tutorial provided by Apache Airflow. Note that we use pip to install Airflow an some additional modules in our environment. When pip asks you to proceed (proceed ([y]/n)?
), simply type y
.
conda activate airflow
pip install --upgrade pip
virualenv
so we install it:pip install virtualenv
your-home-directory/airflow
is the default:Here is the command for Mac and Linux:
export AIRFLOW_HOME=~/airflow
pip install "apache-airflow==2.3.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.3.1/constraints-3.10.txt"
pip install pandas
pip install -U scikit-learn
airflow standalone
command will airflow standalone
We only run this command once when we install Airflow. If you want to run the individual parts of Airflow manually rather than using the all-in-one standalone command, check out the instructions provided here.
username
and password
and store them somewhereusername
and password
.Ctrl
+c
(this will shut down components).In this section, we take a look at how to start Airflow:
airflow
environment if neededconda activate airflow
export AIRFLOW_HOME=~/airflow
airflow webserver
username
and password
.Ctrl
+c
(this will shut down all components).Here, we mainly follow the instructions provided in this Apache Airflow tutorial:
dags
in you airflow home (i.e. ~/airflow/dags
).my_airflow_dag.py
in your ~/airflow/dags
folder.airflow
environment if neededconda activate airflow
export AIRFLOW_HOME=~/airflow
python ~/airflow/dags/my_airflow_dag.py
If the script does not raise an exception it means that you have not done anything wrong, and that your Airflow environment is somewhat sound.
If you want to learn more about the content of the my_airflow-dag.py script, review the Airflow tutorial.
First, we use the command line to do some metadata validation. Let's run a few commands in your terminal to test your script:
airflow db init
airflow dags list
airflow tasks list my_airflow_dag
airflow tasks list my_airflow_dag --tree
Let's start our tests by running one actual task instance for a specific date (independent of other tasks).
The date specified in this context is called the "logical date" (also called execution date), which simulates the scheduler running your task or DAG for a specific date and time, even though it physically will run now (or as soon as its dependencies are met).
This is because each run of a DAG conceptually represents not a specific date and time, but an interval between two times, called a data interval. A DAG run's logical date is the start of its data interval.
The general command layout is as follows:
command subcommand dag_id task_id date
task_print_date
:airflow tasks test my_airflow_dag task_print_date 2021-05-20
Take a look at the last lines in the output (ignore warnings for now)
task_sleep
airflow tasks test my_airflow_dag task_sleep 2021-05-20
task_templated
airflow tasks test my_airflow_dag task_templated 2021-05-20
Everything looks like it's running fine so let's run a backfill.
backfill
will respect your dependencies, emit logs into files and talk to the database to record status.
If you do have a webserver up, you will be able to track the progress.
airflow webserver
will start a web server if you are interested in tracking the progress visually as your backfill progresses.
start_date
and optionally an end_date
, which are used to populate the run schedule with task instances from this dag.airflow dags backfill my_airflow_dag \
--start-date 2021-05-20 \
--end-date 2021-06-01
Let's proceed to the Airflow user interface (UI) - see next step.
Note that if you use depends_on_past=True
, individual task instances will depend on the success of their previous task instance (that is, previous according to the logical date) In that case you may want to consider to set wait_for_downstream=True
when using depends_on_past=True
. While depends_on_past=True
causes a task instance to depend on the success of its previous task_instance, wait_for_downstream=True
will cause a task instance to also wait for all task instances immediately downstream of the previous task instance to succeed.
airflow webserver
Open the Airflow web interface in your browser:
Now start experimenting with the Airflow web interface:
my_airflow_dag
from the list of DAGs.Congratulations! You have completed the tutorial and learned how to:
✅ Install Apache Airflow
✅ Start Apache Airflow
✅ Create a simple pipeline
Next, you may want to proceed with this tutorial to build a simple Python machine learning pipeline using pandas and scikit-learn:
Thank you for participating in this tutorial. If you found any issues along the way I'd appreciate it if you'd raise them by clicking the "Report a mistake" button at the bottom left of this site.
Jan Kirenz | kirenz.com | CC BY-NC 2.0 License