Apache Airflow Best Practices - Dylan Intorf - E-Book

Apache Airflow Best Practices E-Book

Dylan Intorf

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data professionals face the challenge of managing complex data pipelines, orchestrating workflows across diverse systems, and ensuring scalable, reliable data processing. This definitive guide to mastering Apache Airflow, written by experts in engineering, data strategy, and problem-solving across tech, financial, and life sciences industries, is your key to overcoming these challenges.
Covering everything from Airflow fundamentals to advanced topics such as custom plugin development, multi-tenancy, and cloud deployment, this book provides a structured approach to workflow orchestration. You’ll start with an introduction to data orchestration and Apache Airflow 2.x updates, followed by DAG authoring, managing Airflow components, and connecting to external data sources. Through real-world use cases, you’ll learn how to implement ETL pipelines and orchestrate ML workflows in your environment, and scale Airflow for high availability and performance. You’ll also learn how to deploy Airflow in cloud environments, tackle operational considerations for scaling, and apply best practices for CI/CD and monitoring.
By the end of this book, you’ll be proficient in operating and using Apache Airflow, authoring high-quality workflows in Python, and making informed decisions crucial for production-ready Airflow implementations.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 229

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Apache Airflow Best Practices

A practical guide to orchestrating data workflow with Apache Airflow

Dylan Intorf

Dylan Storey

Kendrick van Doorn

Apache Airflow Best Practices

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

The authors acknowledge the use of cutting-edge AI, such as ChatGPT, with the sole aim of enhancing the language and clarity within the book, thereby ensuring a smooth reading experience for readers. It’s important to note that the content itself has been crafted by the authors and edited by a professional publishing team.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Ali Abidi

Publishing Product Manager: Apeksha Shetty

Book Project Manager: Shambhavi Mishra

Senior Editor: Joseph Sunil

Technical Editor: Seemanjay Ameriya

Copy Editor: Safis Editing

Proofreader: Joseph Sunil

Indexer: Pratik Shirodkar

Production Designer: Joshua Misquitta

Senior DevRel Marketing Executive: Vinishka Kalra

First published: October 2024

Production reference: 1101024

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80512-375-0

www.packtpub.com

To my partner, Kristen, for always supporting my dreams and encouraging me to have confidence the pipelines won’t break in the middle of the night.

– Kendrick van Doorn

Contributors

About the authors

Dylan Intorf is a seasoned technology leader with a B.Sc. in computer science from Arizona State University. With over a decade of experience in software and data engineering, he has delivered custom, tailored solutions to the technology, financial, and insurance sectors. Dylan’s expertise in data and infrastructure management has been instrumental in optimizing Airflow deployments and operations for several Fortune 25 companies.

Dylan Storey holds a B.Sc. and M.Sc. in biology from California State University, Fresno, and a Ph.D. in life sciences from the University of Tennessee, Knoxville where he specialized in leveraging computational methods to study complex biological systems. With over 15 years of experience, Dylan has successfully built, grown, and led teams to drive the development and operation of data products across various scales and industries, including many of the top Fortune-recognized organizations. He is also an expert in leveraging AI and machine learning to automate processes and decisions, enabling businesses to achieve their strategic goals.

Kendrick van Doorn is an accomplished engineering and business leader with a strong foundation in software development, honed through impactful work with federal agencies and consulting technology firms. With over a decade of experience in crafting technology and data strategies for leading brands, he has consistently driven innovation and efficiency. Kendrick holds a B.Sc. in computer engineering from Villanova University, an M.Sc. in systems engineering from George Mason University, and an MBA from Columbia University.

About the reviewers

Ayoade Adegbite is an accomplished data and analytics engineer with extensive experience in leveraging advanced data tools and enterprise ecosystems to deliver actionable insights across diverse industries. He excels in designing sophisticated analytical models, ensuring data integrity, and implementing impactful data solutions. With a strong background in ETL processes, data visualization, and robust documentation practices, Ayoade has consistently driven significant improvements in data-driven decision-making and operational efficiency.

Ayoade has also optimized business processes and revitalized operations as a consultant, utilizing technologies such as Airflow and dbt. Ayoade is a member of the Apache Airflow Champion initiative.

Ananth Packkildurai, the author of the influential Data Engineering Weekly newsletter, has made significant contributions to the data industry through his deep expertise and innovative insights. His work has been instrumental in shaping how companies approach data engineering, and industry leaders highly regard his thought leadership. In the past, Ananth worked in companies such as Slack and Zendesk to build petabyte-scale data infrastructure, including a data pipeline, search infrastructure, customer-facing analytics, and an observability platform.

Frank Breetz is a highly experienced data consultant with over a decade of expertise in the field. His proficiency in Apache Airflow is extensive and well-rounded. During his tenure at Astronomer, Frank advised numerous clients on Airflow best practices and helped establish industry standards. Currently, at LinQuest Corporation, he continues to leverage Airflow alongside various other technologies while developing a Data Management Framework.

Frank’s in-depth knowledge and practical experience make him an invaluable resource for mastering Apache Airflow. He holds a Master’s degree in Computer Science and a Bachelor’s degree in Physics.

Vipul Bharat Marlecha is an accomplished software engineer with a focus on large-scale distributed data systems. His career spans roles at major tech companies, including Netflix, DoorDash, Twitter, and Nomura. He is particularly skilled in managing big data, designing scalable systems, and delivering solutions that emphasize impact over activity.

Table of Contents

Preface

Part 1: Apache Airflow: History, What, and Why

1

Getting Started with Airflow 2.0

What is data orchestration?

Industry use cases

Exploring Apache Airflow

Apache Airflow 2.0

Standout features of Apache Airflow

A look ahead

Core concepts of Airflow

Why Airflow may not be right

When to choose Airflow

Zen of Python

Idempotency

Code as configuration

Skills to use Apache Airflow effectively

Summary

2

Core Airflow Concepts

Technical requirements

DAGs

Decorators and a DAG definition

Scheduling with Apache Airflow and moving away from CRON

Tasks

Task operators

The first task – defining the DAG and extract

Defining the transform task

Xcoms

Defining the load task

Setting the flow of tasks and dependencies

Executing the DAG example

Task groups

Triggers

Summary

Part 2: Airflow Basics

3

Components of Airflow

Technical requirements

Overall architecture

Executors

Local Executors (Sequential and Local)

Parallelism

Celery Executor (Remote Executor)

Kubernetes Executor (Remote Executor)

Dask Executor (Remote Executor)

Kubernetes Local Executor (Hybrid Executor)

Scheduler

Summary

4

Basics of Airflow and DAG Authoring

Technical requirements

Designing a DAG

DAG authoring example architecture development

DAG example overview

Initial workflow requirements

Bringing our first Airflow DAG together

Extracting images from the NASA API

The NASA API

Building an API request in Jupyter Notebook

Automating your code with a DAG

Writing your first DAG

Instantiating a DAG object

Defining default arguments

Defining the first task

What are operators?

Defining the first task’s Python code

Defining the second task

Setting the task order

Summary

Part 3: Common Use Cases

5

Connecting to External Sources

Technical requirements

Connectors make Apache Airflow

Computing outside of Airflow

Where are these connections?

Connections stored in the metadata database

A quick note about secrets being added through the Airflow UI

Creating Connections from the CLI

Testing of Connections

Using environment variables

Airflow metadata database

Secrets management service

Secrets Cache

How to test environment variables and secret store Connections

Best practices

Building an email or Slack alert

Key considerations

Airflow notification types

Email notification

Creating a Slack webhook

Creating the Airflow Connection

Let’s build an example DAG

Summary

6

Extending Functionality with UI Plugins

Technical requirements

Understanding Airflow UI plugins

Creating a metrics dashboard plugin

Step 1 – project structure

Step 2 – view implementation

Step 3 – metrics dashboard HTML template

Step 4 – plugin implementation

Summary

References

7

Writing and Distributing Custom Providers

Technical requirements

Structuring your provider

General directory structure

Authoring your provider

Registering our provider

Authoring our hook

Authoring our operators

Authoring our sensor

Testing

Functional examples

Summary

8

Orchestrating a Machine Learning Workflow

Technical requirements

Basics of a machine learning-based project

Our recommendation system – movies for you

Designing our DAG

Implementing the DAG

Determining whether data has changed

Fetching data

Pre-processing stage

KNN feature creation

Deep learning model training

Promoting assets to production

Summary

9

Using Airflow as a Driving Service

Technical requirements

QA testing service

Designing the system

Choosing how to configure our workflows

Defining our general DAG topology

Creating our DAGs from our configurations

Scheduling (and unscheduling) our DAGs

Summary

Part 4: Scale with Your Deployed Instance

10

Airflow Ops: Development and Deployment

Technical requirements

DAG deployments

Bundling

De-coupled DAG delivery

Repository structures

Mono-repo

Multi-repo

Connection and Variable management

Environment variables

Secrets backends

Airflow deployment methods

Kubernetes

Virtual machines

Service providers

Localized development

Virtual environments

Docker Compose

Cloud development environments

Testing

Testing environments

Testing DAGs

Testing providers

Testing Airflow

Summary

11

Airflow Ops Best Practices: Observation and Monitoring

Technical requirements

Monitoring core Airflow components

Scheduler

Metadata database

Triggerer

Executors/workers

Web server

Monitoring your DAGs

Logging

Alerting

SLA monitoring

Performance profiling

Summary

12

Multi-Tenancy in Airflow

Technical requirements

When to choose multi-tenancy

Component configuration

The Celery Executor

The Kubernetes executor

The scheduler and triggerer

DAGs

Web UI

Summary

13

Migrating Airflow

Technical requirements

General management activities for a migration

Inventory

Sequence

Migrate

Monitor

Technical approaches for migration

Automating code migrations

QA/testing design

Planning a migration between Airflow environments

Connections and variables

DAGs

Summary

Index

Other Books You May Enjoy

Part 1: Apache Airflow: History, What, and Why

This part has the following chapters:

Chapter 1, Getting Started with Airflow 2.0Chapter 2, Core Airflow Concepts

1

Getting Started with Airflow 2.0

In modern software development and data processing, orchestration plays a pivotal role in ensuring the coordination and execution of complex workflows. As organizations strive to manage their ever-growing data and application landscapes, the need for an efficient orchestration system becomes paramount.

With Airflow 2.0 having been released for some time and moving quickly to increase its capabilities, we elected to distill our experiences in operating Airflow to help others by showing them patterns that have worked well for others in the past.

Our goal with this book is to help engineers and organizations adopting Apache Airflow as their orchestration solution get the most out of their technology selection by guiding them to better choices as they go through their adoption journey and scale.

In this chapter, we will learn what data orchestration is and how it is applied to several industries facing data challenges. In addition, we will explore the basic benefits of Apache Airflow and its features that may benefit your organization. We will take a look ahead at what you can expect to learn by reading this book and practicing industry-leading techniques for orchestrating your data pipelines with Apache Airflow. Apache Airflow remains the industry leader in data orchestration and pipeline management. With this success comes a set of tenets and principles that have been identified as best practices. We will cover some of the best practices and approaches in this chapter and identify the skills needed to be successful.

In this chapter, we’re going to cover the following main topics:

What is data orchestration?Exploring Apache AirflowCore concepts of AirflowSkills to use Apache Airflow effectively

What is data orchestration?

In today’s data-driven world, organizations face the challenge of handling vast amounts of data from diverse sources. Data orchestration is the key to managing this complex data landscape efficiently. It involves the coordination, automation, and monitoring of data workflows, ensuring the smooth execution of tasks and the timely delivery of valuable insights.

Orchestration, in the context of software development and data engineering, refers to the process of automating and managing the execution of interconnected tasks or processes to achieve a specific goal. These tasks might involve data processing, workflow scheduling, service provisioning, and more. The purpose of orchestration is to streamline the flow of operations, optimize resource utilization, and ensure that tasks are executed in a well-coordinated manner.

Traditional, manual orchestration is cumbersome and prone to errors, especially as the complexity of workflows increases. However, with modern orchestration tools and frameworks, developers can automate these intricate processes, resulting in enhanced efficiency and reliability.

Industry use cases

Regardless of the industry, Apache Airflow can bring benefits to any data engineering or data analysis team. To better align this, here are some examples of how a few key industries that we have worked with in the past may use this leading data orchestrator to benefit their needs:

E-commerce: An e-commerce brand may need an automated ETL/ELT pipeline for automating the extraction, transformation, and loading of data from various sources, such as sales, customer interactions, and current inventoryBanking/fintech: Leading financial firms may use Apache Airflow to orchestrate the processing of transaction data to identify fraud or risks in their reporting/billing systemsRetail: Major retailers and brands can use Apache Airflow to help automate their machine learning (ML) workloads to better predict user trends and purchases based on seasonality or current market environments

Now that we have learned what data orchestration is, how it is important for organizations, and some basic industry use-case examples, let us explore Apache Airflow, which is one of the most popular platforms and the core topic of this book.

Exploring Apache Airflow

Apache Airflow is known within the data engineering community as the go-to open source platform for “developing, scheduling, and monitoring batch-oriented workflows.” (Apache.org Airflow documentation: https://airflow.apache.org/docs/apache-airflow/stable/index.html)

Apache Airflow has emerged as the go-to open source platform for data orchestration and remains the leader as a result of its active development community. It offers a robust and flexible solution to the challenges of managing complex data workflows. Airflow enables data engineers, data scientists, artificial intelligence (AI)/ML engineering, and MLOps and DevOps professionals to design, schedule, and monitor data pipelines with ease.

The power of Apache Airflow lies in its ability to represent data workflows as directed acyclic graphs (DAGs). This intuitive approach allows users to visualize and understand relationships between tasks, making it easier to create and maintain complex data pipelines. Furthermore, Airflow’s extensibility and modularity allow users to customize the platform to their specific needs, making it an ideal choice for businesses of all sizes and industries.

Apache Airflow 2.0

The release of Apache Airflow 2 in December 2020 stands as one of the largest achievements of the community since Airflow was originally created as a solution at Airbnb in 2014. The move to 2.0 was a large lift for the community and came with hundreds of updates and bug fixes after an Airflow community survey in 2019.

This release brought with it an updated UI, a new scheduler, a Kubernetes executor, and a simpler way to group tasks of DAGs together. It was a groundbreaking achievement and laid out the roadmap for future releases that have only made Airflow an even more valuable tool for the community.

Standout features of Apache Airflow

Apache Airflow has brought with it a multitude of features to support the different needs of organizations and teams. Some of our favorites revolve around sensing, task grouping, and operators, but each of these can be grouped into one of these categories:

Extensible: Users can create custom operators and sensors or access a wide range of community-contributed plugins, enabling seamless integration with various technologies and services. This extensibility enhances Airflow’s adaptability to different environments and use cases, making its potential limited only by the engineer’s imagination.Dynamic: The platform supports dynamic workflows, meaning the number of tasks and their configurations can be determined at runtime, based on variables, external sensors, or data captured during a run. This feature makes Airflow more flexible as workflows can adapt to changing conditions or input parameters, resulting in better resource utilization and improved efficiency.Scalable: Airflow’s distributed architecture ensures scalability to handle large-scale and computationally intensive workflows. As businesses grow and their data processing demands increase, Airflow can accommodate these requirements by distributing tasks across multiple workers, reducing processing times, and improving overall performance.Built-in monitoring: Airflow provides a web-based UI to monitor the status of workflows and individual tasks. This interface allows users to visualize task execution and inspect logs, facilitating transparency and easy debugging. By gaining insights into workflow performance, users can optimize their processes and identify potential bottlenecks.Ecosystem: Airflow seamlessly integrates with a wide range of technologies and cloud providers. This integration allows users to access diverse data sources and services, making it easier to design comprehensive workflows that interact with various systems. Whether working with databases, cloud storage, or other tools, Airflow can bridge the gap between different components.

Apache Airflow brings with it years of open source development and well-thought-out designs by hundreds of contributors. It is the leading data orchestration tool, and learning how to better utilize its key features will help you become a better data engineer and manager.

A look ahead

Throughout this book, we will explore the essential features of Apache Airflow, providing you with the knowledge to leverage its full potential in your data orchestration journey. The key topics covered include the following:

Why use Airflow?: Tenants, skills, and first principlesAirflow basics: Understanding core concepts (DAGs, tasks, operators, deferrables, connections, and so on), the components of Airflow, and the basics of DAG authoringCommon use cases: Unlocking the potential of Airflow with ETL pipelines, custom plugins, and orchestrating workloads across systemsScaling with your team: Hardening your Airflow instance for production workloads with CI/CD, monitoring, and the cloud

By the end of this book, you will have a comprehensive understanding of Apache Airflow’s best practices, enabling you to build robust, scalable, and efficient data pipelines that drive your organization’s success. Let’s embark on this practical guide for data pipeline orchestration using Apache Airflow and unlock the true potential of data-driven decision-making.

Core concepts of Airflow

Apache Airflow is a dynamic, extensible, and flexible framework that allows for the building of workflows as code. Airflow allows for the definition of these automated workflows as code. This allows for better code versioning, development through CI/CD, easy testing, and extensible components and operators from a thriving community of committers.

Airflow is known for its approach to scheduling tasks and workflows. It can take advantage of CRON scheduling or its built-in scheduling functions. In addition, features such as backfilling, allowing for the rerun of pipelines, allow for going back and updating pipelines if the logic changes. This means it has powerful operational components that need to be accounted for as part of the design.

Following these guidelines will help you lay a foundation for scaling your Airflow deployments and increase the effectiveness of your workflows from both authorship and operational viewpoints.

Why Airflow may not be right

Before we jump into the tenets of Apache Airflow, we should take a moment to acknowledge when it is not a good fit for organizations. While each of the following statements can probably be shown to be “false” for sufficiently motivated and clever engineers (and we all know plenty of them), they are generally considered anti-patterns and should be avoided.

Some of these anti-patterns include the following:

Teams where there is no or limited experience with Python programming. Implementing DAGs in Python can be a complex process and requires active experience to upkeep the code.Streaming or non-batch workflows and pipelines, where the use case requires immediate updates. Airflow is designed for batch-oriented and scheduled tasks.

When to choose Airflow

Airflow is best used for implementing “batch-oriented” scheduled data pipelines. Often, use cases can include ETL/ELT, reverse ETL, ML, AI, and business intelligence (BI). Throughout this book, we will review major use cases that have been seen at different industry-leading companies. Some key use cases include the following:

ETL pipelines of data: Almost every implementation of Airflow helps to automate tasks of this type, whether it is to consolidate data within a data warehouse or move data through different toolsWriting and distributing custom plugins for organizations that have a unique stack and needs that have not been addressed by the open source community: Airflow allows for easy customization of the environment and ecosystem to your needsExtending the UI functionality with plugins: Modify and adjust the UI to allow for new views, charts, and widgets to integrate with external systems