Python Data engineering
airflow
Aug 14, 2018     2 minutes read

1. What is airflow and why would you use it?

Wait, you may say, I can do that with cron!

Yes, you can, but with airflow:

Convinced? ;)

2. Installation and startup

pip install airflow

Not very complicated.

Then run airflow scheduler and webservice with:

airflow scheduler
airflow webservice

3. Best practises

You may feel tempted to create a git repository in your DAG folder, however this is not the best solution. It’s much easier and more logical to keep your DAG file in a repo where your project lives and softlink it with

ln -s /path-to-your-project-repo/my_project_dag.py /home/me/airflow/dags/

DAG names and DAG file names

Jinja templating

You can pass arguments to the command with jinja templating, instead of creating a command string by yourself. You can then keep all you parameters in a separate json file.

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime


default_args = {
  'depends_on_past': False,
  'start_date': datetime(2018, 1, 15, 16, 45, 0),
  'email': ['test_mail@gamil.com'],
  'email_on_failure': False,
  'retries': 0
}

dag = DAG(
  'dag_name',
  schedule_interval='0 12 * * *',
  default_args=default_args,
  catchup=False
)

first_task = BashOperator(
  task_id='first_task',
  bash_command='echo {{ params.number }} {{ params.subparam.one }}',
  # you can keep params in a json file
  params={'number': '10', 'subparam' : {'one': '1'}},  
  dag=dag
)

a good tutorial

another good tutorial

Airflow’s purpose is rather straightforward, so the best way to learn it is learning-by-doing.