Airflow & Kubernetes Jobs

Elimor
5 min readJun 21, 2021

A short story getting to Airflow, Kubernetes and a new operator that may help your Data Engineering team.

For those anxious to get to the point, here is the github repo we will lead up to:

Still with us? Great.

A bit of background, I’ve been using Airflow in production with a team of about 10 other data engineers spread across the world for almost 3 years now. It’s been a long journey to get to Airflow and we’ve had a lot of growing pains. When I first started with the team, almost 7 years ago, our stack was a loose collection of PHP, Bash, and Pentaho** scripts.

In the early days a typical sprint’s workflow went something like this…

  1. We talk about the tickets for the new sprint, usually data errors to patch or new data sources to figure out how to ingest.
  2. Regardless of the ticket, we would eventually CR and merge our new code to the master branch.
  3. Then, someone with permissions would go to the dedicated EC2 server with access to our production databases and pull the new code from master.
  4. The same person would often then open up a specific user’s crontab on that EC2 dedicated to scheduling the execution of the new code.
  5. Cross our fingers and hope it all goes ok!

You may have noticed a few things missing. No CI/CD to speak of, no good development or staging environments, no containerization, etc. It was the wild west days of a small startup racing madly to produce enough value to keep from burning up and facing the same fate most other startups do.

To be fair, this way of doing things worked well enough for us for a long time. It was very quick and straightforward to get new code deployed and our team was small, disciplined, and we cared a lot about maintaining our data quality.

And we succeeded! We were acquired.

We soon had some new problems. A larger fish had decided that we were valuable enough to buy because we did a thing better than they knew how to do it. But now we had more work to do than ever and a team spread across the world to do it. The old way would not work for us anymore.

Enter Airflow.

I had long been a strong proponent of using Python as our main language of choice on the data team. PHP is simply not the right tool for the kinds of data problems that we faced and Pentaho was, in my opinion, a messy and hard to maintain band-aid solution to a team problem, not a data problem. Airflow was just starting to enter the conversation in local meetups and a recent hire had some positive experiences with it in their previous positions. So we made a leap of faith.

With the acquisition came some seriously strong and new talent from other arms of the organization. Our new principal developer had extensive experience with Kubernetes in his organization before his acquisition and took the task of wrestling with deploying this Airflow thing into our new EKS cluster. (Apache Airflow has since released an official one.) Due to his work we got our v1 Airflow up and running in and deployed in EKS using a CI/CD solution (check!). Now we just had to learn how to actually write these DAGs.

Clueless and ignorant, we made the fatal mistake of thinking of Airflow as a complete data engineering framework to write and deploy our ETL jobs. A data pipeline, orchestration, all data everything tool that could solve our data team’s needs. If we had a new DAG, often we would write a new plugin or new operators to solve the specific needs of that ETL job.

A few sprints in we realized our mistake. Making deployments of new code now resulted in running jobs getting terminated and developers found it was harder and harder to test and run their code reliably. We thought we found the answer but it had become a mess!

If this sounds familiar, I feel your pain. I’ve talked with folks at other organizations that have fallen into similar situations. They frequently went either the path of digging themselves in deeper hoping things would work out OK or abandoning Airflow entirely. As for us, our principal developer came across this excellent article on the weaknesses of Airflow and how to think about a way forward focusing on its strengths. (For more interesting thoughts about the weaknesses of Airflow read “Dagster’s” comparison between the two tools here).

What we came up with was a solution roughly like this…

  1. For all our old code, make a Docker image that describes the environment of our old master EC2 instance
  2. Bundle our legacy code into this image
  3. Any new code can be its own repo related to the project its dealing with, the only big requirement is that it is docker ready
  4. On the Airflow side, write the relevant yaml files to describe a Kubernetes Job that runs the business logic and then write a DAG that controls the execution of that Job

Simply put, what used to be our mega crontab has essentially become Airflow.

Here are some major things we gain with this new approach:

A. Our execution flow and business logic are separate!

B. Developers only need to focus on the code they are writing to accomplish the specific things they need accomplished. You have one job, do it. (And as an added bonus you could write in any language you wanted)

C. If for some reason the Airflow service completely goes down, running jobs are not impacted at all. We fix Airflow and it auto-recovers seamlessly.

Confident with this new approach we began our journey of migrating the new code we had begun to write in Airflow into this new new way of doing things.

Where are we now?

Admittedly, I and our team didn’t come up with the ‘best’ way to do things. If my story shows nothing else, it’s that we as developers often stumble through trying to get to whatever is better, while ‘best’ is always a moving target depending on the teams and the times.

With all that in mind, I spent some time recently reflecting on my experience working with teams that use Airflow in production to sketch out a general-use Kubernetes Job Operator that I hope will be useful to the community. If you’re someone interested or knowledgeable about these things or someone on the Airflow team I’d be happy for your feedback and help in integrating it upstream someday soon.

Cheers.

**Pentaho: Consider yourself lucky if you aren’t familiar with it! The answer to ‘What if Airflow was made primarily for non programmers? —And if you must code, “simply” drag and drop ‘tasks’ to click into to add short javascript or bash scripts.’ Also it runs on the JVM. Also when you commit changes all you can see is what looks like giant XML files.

--

--