DEV Community

Cover image for Scaling Airflow – Astronomer Vs Cloud Composer Vs Managed Workflows For Apache Airflow
SeattleDataGuy
SeattleDataGuy

Posted on

Scaling Airflow – Astronomer Vs Cloud Composer Vs Managed Workflows For Apache Airflow

Over the last 3 months, I have taken on two different migrations that involved taking companies from manually managing Airflow VMs to going over to using Cloud Composer and MWAA (Managed Workflows For Apache Airflow).

These are two great options when it comes to starting your first Airflow project. They help reduce a lot of issues such as scaling workloads across workers and managing all the other various components required to run Airflow at scale.

Of course, both Cloud Composer and MWAA are services provided by two of the larger cloud providers. But there is a third option called Astronomer.

In this article, I want to discuss some of the benefits and limitations of Cloud Composer and MWAA as well as discuss why Astronomer might be a better fit for many companies.

Why Use A Managed Instance Of Airflow?

Maintenance and management of services and databases often hold many data teams back. We already deal with daily pipeline failures due to the engineering team updating that one column or removing a table altogether without telling us.

On top of that, trying to manage your Airflow instance generally means you need to have a meta database like Postgres, Redis, a logging location like S3, and several other components to manage your workers as they scale.

Each of these components increases technical complexity and the opportunity for failures. If your team is small or just needs to prioritize other work, then managed versions of Airflow are often a good choice.

But which managed Airflow service works best for you?

Comparing Composer And MWAA to Astronomer

I want to start with the fact that this article may come across as very positive towards one specific product. The truth is if you're just getting started. Cloud Composer and MWAA are great.

They can help set up a POC as well as an MVP without needing to set up too many external logistical components or agreements. Just click create an environment.

Where you will notice Astronomer shines is as you set up more complex jobs and need more flexibility. So let's dive into some of the benefits and challenges you will face with all of these solutions.

Ease Of Set-Up

Without a doubt. Setting up MWAA or Cloud Composer is easier than setting up Astronomer. Really.

It's just a few clicks and you have a fully managed Airflow instance that is easy to scale at your fingertips. It does take about 20-30 minutes to wait for the entire system to spin up.

Astronomer takes a little more lift. But personally, it's still far easier than trying to set up your own VMs to scale.

Overall, what you gain in ease of set-up you will lose in customizability.

What You Lose In Ease Of Set-Up You Gain In Customizability

Solutions are all about trade-offs. What Cloud Composer and MWAA give you in ease of set-up they take away in the difficulty of editing much more than your Python libraries.

I found this out first hand as I migrated a project over to Cloud Composer. My first question. How do I update my docker image?

Of course, there were other issues such as adding in drivers for older databases, and other plugins that were less straightforward. What might seem like an easy to solve problem like installing a driver can become very taxing and require complex workarounds.

At the end of the day your team will need to assess their overall goals in order to know whether they need a lot of flexibility or need to deploy DAGs fast.

Docker Image Limitations

One of the biggest challenges you will run into when you are working with Cloud Composer or MWAA is you won't be able to go into the docker image that is being created and edited.

This is fine if your DAGs are very simple and don't require complex drivers or external dependencies. However, as soon as your DAGs require complex technical dependencies you will likely have to perform a lot of workarounds.

For example, I saw a StackOverflow answer where a developer had to go to the cluster running the VM and manually force certain components to run on it. Which kind of defeats the purpose of Docker if you now have a manual step you will need to remember in future deployments.

Now let's compare that to Astronomer.

How would you customize the image of your Docker image?

Easy, here is an article on the topic. Yes, Astronomer has a steeper learning curve, but the trade-off is that you will have the ability to perform much more complex integrations with other systems to your Airflow instance. Like integrating with DBT or including an IBM driver file.

All easy to do.

Executor Limitations

When it comes to how you run your Airflow instance, Executors are a key component in determining how jobs are managed. Choosing the wrong Executor means your system may fail to scale and it won't be able to run your jobs fast enough.

The problem is in some cases you might find that you want to use a very specific Executor. Astronomer lets you use the Local, Celery and Kubernetes executors. In comparison MWAA only supports the Celery Executor and Cloud Composer also has many blocked configurations. This limitation of executors has become a major pain-point for some companies who are having to create massive work-arounds in order to run their jobs the way they need to with their managed service.

In the end, this can be very limiting depending on how you want to deploy your DAGs.

Astronomer Works Closely With Airflow

To also add to the many benefits of Astronomer. I will say that they focus on Airflow a lot. They contribute heavily to the Airflow code base as well as work closely with the Airflow team in general.

This leads to many of the Employees being experts in Airflow. They work heavily to understand the current needs of their end-users and truly have a great understanding of best practices when it comes to Airflow. The difference is that Astronomer is Airflow only.

They aren't looking to sell you on S3 or Cloud Storage. Instead, they want to figure out how to best deliver Airflow as a service.

In contrast, AWS and GCP focus on everything. Meaning their product will always lag. This is notable as MWAA is still back on 2.0.1 while Airflow had been updated to 2.2.0. which Astronomer can support.

If you call an AWS solutions architect to give you advice on how to better set up your MWAA, then you might just get generalized advice. It's hard to be a master of every tool and solution AWS has to offer. Again, compared to Astronomer that focuses only on Airflow.

Astronomer is always step in toe with Airflow updates and they are usually ready to utilize the most recent version of Airflow almost immediately.

When To Use What

MWAA and Cloud Composers are great solutions when you have pretty simple DAGs that you want to build. You don't want to involve too many extra components and it's honestly a great place to start your Airflow journey. If you're trying to build use for Airflow or just trying to prove that it works, then go ahead, build it on Composer or MWAA.

Of course, more than likely, as your company grows and you have more complex needs, you might find yourself needing to migrate to Astronomer. Also, if you work in an industry that has tighter regulations, requirements around private PyPI setups, or if it's important to have more fine-grained access and controls then Astronomer shines here as well.

I am a little surprised that AWS and GCP have made it so difficult to update your Docker Image or customize too much in terms of launching your Airflow instance. Perhaps they just want to make sure the tools are just as optimal as possible.

Also, it is just easier to launch.

Finding the right version of managed Airflow doesn't have to be hard. Start by looking at what use cases you have now, and what the future holds. You should be able to easily start with a tool like Cloud Composer and then eventually switch to Astronomer. You will lose out on some of the best practices. But on the flip side, you will already have some of your use cases set up and then can work with their team to improve your DAGs as you go.

If you enjoyed this article, then please check out these articles and videos:

How To Improve Your Data Analytics Strategy For 2022 - Reducing Costs And Improving Insights

Data Engineer Vs Analytics Engineer Vs Analyst

Why Migrate To The Modern Data Stack And Where To Start

5 Great Data Engineering Tools For 2021 -- My Favorite Data Engineering Tools

4 SQL Tips For Data Scientists

What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate

Top comments (2)

Collapse
 
bensums profile image
Ben Summers

Interesting. Never thought of problems scaling/customizing Airflow itself as just used it to schedule jobs on other platforms (e.g. dag could be: launch/acquire an EMR cluster, start the job, wait for the job to finish, kill the cluster (if our responsibility))

Collapse
 
codenameone profile image
Shai Almog

Interesting read!

Can I ask a newbie question?

How do you debug airflow instances at scale?