Post

Apache Airflow for data pipelines and ETL management

I have been looking for good workflow management software and found Apache Airflow to be superior to other solutions.

I’ve taken some time to write a pretty detailed blog post on using Airflow for development of ETL pipelines.

Airflow is a great tool which allows you to:

  • centrally manage and track the execution of all your ETL jobs using a web UI
  • manage shared connections to databases
  • implement complex dependencies between various tasks in the form of a Directed Acyclic graph

In the blog post I cover a detailed implementation of two pipelines: one from Amazon S3 to Redshift and the other one from one table in S3 to another table using an upsert. I also show you how Airflow is used for administration of tasks and log tracking among other things.

You can read the full blog post at this link.

A small preview:

Screen Shot 2018-02-11 at 16.15.13.png

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.