Building Modern Data Applications Using Databricks Lakehouse - Will Girten - E-Book

Building Modern Data Applications Using Databricks Lakehouse E-Book

Will Girten

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

With so many tools to choose from in today’s data engineering development stack as well as operational complexity, this often overwhelms data engineers, causing them to spend less time gleaning value from their data and more time maintaining complex data pipelines. Guided by a lead specialist solutions architect at Databricks with 10+ years of experience in data and AI, this book shows you how the Delta Live Tables framework simplifies data pipeline development by allowing you to focus on defining input data sources, transformation logic, and output table destinations.
This book gives you an overview of the Delta Lake format, the Databricks Data Intelligence Platform, and the Delta Live Tables framework. It teaches you how to apply data transformations by implementing the Databricks medallion architecture and continuously monitor the data quality of your pipelines. You’ll learn how to handle incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll master how to recover from runtime errors automatically.
By the end of this book, you’ll be able to build a real-time data pipeline from scratch using Delta Live Tables, leverage CI/CD tools to deploy data pipeline changes automatically across deployment environments, and monitor, control, and optimize cloud costs.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 309

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Building Modern Data Applications Using Databricks Lakehouse

Develop, optimize, and monitor data pipelines on Databricks

Will Girten

Building Modern Data Applications Using Databricks Lakehouse

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Managers: Arindam Majumder and Nilesh Kowadkar

Book Project Manager: Shambhavi Mishra

Senior Content Development Editor: Shreya Moharir

Technical Editor: Seemanjay Ameriya

Copy Editor: Safis Editing

Proofreader: Shreya Moharir

Indexer: Manju Arasan

Production Designer: Prashant Ghare

Senior DevRel Marketing Coordinator: Nivedita Singh

First published: October 2024

Production reference: 1181024

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-80107-323-3

www.packtpub.com

To my beautiful and caring wife, Ashley, and our smiley son, Silvio Apollo, thank you for your unwavering support and encouragement.

– Will Girten

Contributors

About the author

Will Girten is a lead specialist solutions architect who joined Databricks in early 2019. With over a decade of experience in data and AI, Will has worked in various business verticals, from healthcare to government and financial services. Will’s primary focus has been helping enterprises implement data warehousing strategies for the lakehouse and performance-tuning BI dashboards, reports, and queries. Will is a certified Databricks Data Engineering Professional and Databricks Machine Learning Professional. He holds a Bachelor of Science in computer engineering from the University of Delaware.

I want to give a special thank you to one of my greatest supporters, mentors, and friends, YongSheng Huang.

About the reviewer

Oleksandra Bovkun is a senior solutions architect at Databricks. She helps customers adopt the Databricks Platform to implement a variety of use cases, follow best practices for implementing data products, and extract the maximum value from their data. During her career, she has participated in multiple data engineering and MLOps projects, including data platform setups, large-scale performance optimizations, and on-prem to cloud migrations. She previously had data engineering and software development roles for consultancies and product companies. She aims to support companies in their data and AI journey to maximize business value using data and AI solutions. She is a regular presenter at conferences, meetups, and user groups in Benelux and Europe.

Table of Contents

Preface

Part 1: Near-Real-Time Data Pipelines for the Lakehouse

1

An Introduction to Delta Live Tables

Technical requirements

The emergence of the lakehouse

The Lambda architectural pattern

Introducing the medallion architecture

The Databricks lakehouse

The maintenance predicament of a streaming application

What is the DLT framework?

How is DLT related to Delta Lake?

Introducing DLT concepts

Streaming tables

Materialized views

Views

Pipeline

Pipeline triggers

Workflow

Types of Databricks compute

Databricks Runtime

Unity Catalog

A quick Delta Lake primer

The architecture of a Delta table

The contents of a transaction commit

Supporting concurrent table reads and writes

Tombstoned data files

Calculating Delta table state

Time travel

Tracking table changes using change data feed

A hands-on example – creating your first Delta Live Tables pipeline

Summary

2

Applying Data Transformations Using Delta Live Tables

Technical requirements

Ingesting data from input sources

Ingesting data using Databricks Auto Loader

Scalability challenge in structured streaming

Using Auto Loader with DLT

Applying changes to downstream tables

APPLY CHANGES command

The DLT reconciliation process

Publishing datasets to Unity Catalog

Why store datasets in Unity Catalog?

Creating a new catalog

Assigning catalog permissions

Data pipeline settings

The DLT product edition

Pipeline execution mode

Databricks runtime

Pipeline cluster types

A serverless compute versus a traditional compute

Loading external dependencies

Data pipeline processing modes

Hands-on exercise – applying SCD Type 2 changes

Summary

3

Managing Data Quality Using Delta Live Tables

Technical requirements

Defining data constraints in Delta Lake

Using temporary datasets to validate data processing

An introduction to expectations

Expectation composition

Hands-on exercise – writing your first data quality expectation

Acting on failed expectations

Hands-on example – failing a pipeline run due to poor data quality

Applying multiple data quality expectations

Decoupling expectations from a DLT pipeline

Hands-on exercise – quarantining bad data for correction

Summary

4

Scaling DLT Pipelines

Technical requirements

Scaling compute to handle demand

Hands-on example – setting autoscaling properties using the Databricks REST API

Automated table maintenance tasks

Why auto compaction is important

Vacuuming obsolete table files

Moving compute closer to the data

Optimizing table layouts for faster table updates

Rewriting table files during updates

Data skipping using table partitioning

Delta Lake Z-ordering on MERGE columns

Improving write performance using deletion vectors

Serverless DLT pipelines

Introducing Enzyme, a performance optimization layer

Summary

Part 2: Securing the Lakehouse Using the Unity Catalog

5

Mastering Data Governance in the Lakehouse with Unity Catalog

Technical requirements

Understanding data governance in a lakehouse

Introducing the Databricks Unity Catalog

A problem worth solving

An overview of the Unity Catalog architecture

Unity Catalog-enabled cluster types

Unity Catalog object model

Enabling Unity Catalog on an existing Databricks workspace

Identity federation in Unity Catalog

Data discovery and cataloging

Tracking dataset relationships using lineage

Observability with system tables

Tracing the lineage of other assets

Fine-grained data access

Hands-on example – data masking healthcare datasets

Summary

6

Managing Data Locations in Unity Catalog

Technical requirements

Creating and managing data catalogs in Unity Catalog

Managed data versus external data

Saving data to storage volumes in Unity Catalog

Setting default locations for data within Unity Catalog

Isolating catalogs to specific workspaces

Creating and managing external storage locations in Unity Catalog

Storing cloud service authentication using storage credentials

Querying external systems using Lakehouse Federation

Hands-on lab – extracting document text for a generative AI pipeline

Generating mock documents

Defining helper functions

Choosing a file format randomly

Creating/assembling the DLT pipeline

Summary

7

Viewing Data Lineage Using Unity Catalog

Technical requirements

Introducing data lineage in Unity Catalog

Tracing data origins using the Data Lineage REST API

Visualizing upstream and downstream transformations

Identifying dependencies and impacts

Hands-on lab – documenting data lineage across an organization

Summary

Part 3: Continuous Integration, Continuous Deployment, and Continuous Monitoring

8

Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform

Technical requirements

Introducing the Databricks provider for Terraform

Setting up a local Terraform environment

Importing the Databricks Terraform provider

Configuring workspace authentication

Defining a DLT pipeline source notebook

Applying workspace changes

Configuring DLT pipelines using Terraform

name

notification

channel

development

continuous

edition

photon

configuration

library

cluster

catalog

target

storage

Automating DLT pipeline deployment

Hands-on exercise – deploying a DLT pipeline using VS Code

Setting up VS Code

Creating a new Terraform project

Defining the Terraform resources

Deploying the Terraform project

Summary

9

Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment

Technical requirements

Introduction to Databricks Asset Bundles

Elements of a DAB configuration file

Specifying a deployment mode

Databricks Asset Bundles in action

User-to-machine authentication

Machine-to-machine authentication

Initializing an asset bundle using templates

Hands-on exercise – deploying your first DAB

Hands-on exercise – simplifying cross-team collaboration with GitHub Actions

Setting up the environment

Configuring the GitHub Action

Testing the workflow

Versioning and maintenance

Summary

10

Monitoring Data Pipelines in Production

Technical requirements

Introduction to data pipeline monitoring

Exploring ways to monitor data pipelines

Using DBSQL Alerts to notify data validity

Pipeline health and performance monitoring

Hands-on exercise – querying data quality events for a dataset

Data quality monitoring

Introducing Lakehouse Monitoring

Hands-on exercise – creating a lakehouse monitor

Best practices for production failure resolution

Handling pipeline update failures

Recovering from table transaction failure

Hands-on exercise – setting up a webhook alert when a job runs longer than expected

Summary

Index

Other Books You May Enjoy

Preface

As datasets have exploded in size with the introduction of cheap cloud storage and processing data in near real time has become an industry standard, many organizations have turned to the lakehouse architecture, which combines the fast BI speeds of a traditional data warehouse with the scalable ETL processing of big data in the cloud. The Databricks Data Intelligence Platform – built upon several open source technologies, including Apache Spark, Delta Lake, MLflow, and Unity Catalog – eliminates friction points and accelerates the design and deployment of modern data applications built for the lakehouse.

In this book, you’ll start with an overview of the Delta Lake format, cover core concepts of the Databricks Data Intelligence Platform, and master building data pipelines using the Delta Live Tables framework. We’ll dive into applying data transformations, how to implement the Databricks medallion architecture, and how to continuously monitor the quality of data landing in your lakehouse. You’ll learn how to react to incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll learn how to use CI/CD tools such as Terraform and Databricks Asset Bundles (DABs) to deploy data pipeline changes automatically across deployment environments, as well as monitor, control, and optimize cloud costs along the way. By the end of this book, you will have mastered building a production-ready, modern data application using the Databricks Data Intelligence Platform.

With Databricks recently named a Leader in the 2024 Gartner Magic Quadrant for Data Science and Machine Learning Platforms, the demand for mastering a skillset in the Databricks Data Intelligence Platform is only expected to grow in the coming years.

Who this book is for

This book is for data engineers, data scientists, and data stewards tasked with enterprise data processing for their organizations. This book will simplify learning advanced data engineering techniques on Databricks, making implementing a cutting-edge lakehouse accessible to individuals with varying technical expertise. However, beginner-level knowledge of Apache Spark and Python is needed to make the most out of the code examples in this book.

What this book covers

Chapter 1, An Introduction to Delta Live Tables, discusses building near-real-time data pipelines using the Delta Live Tables framework. It covers the fundamentals of pipeline design as well as the core concepts of the Delta Lake format. The chapter concludes with a simple example of building a Delta Live Table pipeline from start to finish.

Chapter 2, Applying Data Transformations Using Delta Live Tables, explores data transformations using Delta Live Tables, guiding you through the process of cleaning, refining, and enriching data to meet specific business requirements. You will learn how to use Delta Live Tables to ingest data from a variety of input sources, register datasets in Unity Catalog, and effectively apply changes to downstream tables.

Chapter 3, Managing Data Quality Using Delta Live Tables, introduces several techniques for enforcing data quality requirements on newly arriving data. You will learn how to define data quality constraints using Expectations in the Delta Live Tables framework, as well as monitor the data quality of a pipeline in near real time.

Chapter 4, Scaling DLT Pipelines, explains how to scale a Delta Live Tables (DLT) pipeline to handle the unpredictable demands of a typical production environment. You will take a deep dive into configuring pipeline settings using the DLT UI and Databricks Pipeline REST API. You will also gain a better understanding of the daily DLT maintenance tasks that are run in the background and how to optimize table layouts to improve performance.

Chapter 5, Mastering Data Governance in the Lakehouse with Unity Catalog, provides a comprehensive guide to enhancing data governance and compliance of your lakehouse using Unity Catalog. You will learn how to enable Unity Catalog on a Databricks workspace, enable data discovery using metadata tags, and implement fine-grained row and column-level access control of datasets.

Chapter 6, Managing Data Locations in Unity Catalog, explores how to effectively manage storage locations using Unity Catalog. You will learn how to govern data access across various roles and departments within an organization while ensuring security and auditability with the Databricks Data Intelligence Platform.

Chapter 7, Viewing Data Lineage using Unity Catalog, discusses tracing data origins, visualizing data transformations, and identifying upstream and downstream dependencies by tracing data lineage in Unity Catalog. By the end of the chapter, You will be equipped with the skills needed to validate that data is coming from trusted sources.

Chapter 8, Deploying, Maintaining, and AdministratingDLT Pipelines Using Terraform, covers deploying DLT pipelines using the Databricks Terraform provider. You will learn how to set up a local development environment and automate a continuous build and deployment pipeline, along with best practices and future considerations.

Chapter 9, Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment, explores how DABs can be used to streamline the deployment of data analytics projects and improve cross-team collaboration. You will gain an understanding of the practical use of DABs through several hands-on examples.

Chapter 10,Monitoring Data Pipelines in Production, delves into the crucial task of monitoring data pipelines in Databricks. You will learn various mechanisms for tracking pipeline health, performance, and data quality within the Databricks Data Intelligence Platform.

To get the most out of this book

While not a mandatory requirement, to get the most out of this book, it’s recommended that you have beginner-level knowledge of Python and Apache Spark, and at least some knowledge of navigating around the Databricks Data Intelligence Platform. It’s also recommended to have the following dependencies installed locally in order to follow along with the hands-on exercises and code examples throughout the book:

Software/hardware covered in the book

Operating system requirements

Python 3.6+

Windows, macOS, or Linux

Databricks CLI 0.205+

Furthermore, it’s recommended that you have a Databricks account and workspace to log in, import notebooks, create clusters, and create new data pipelines. If you do not have a Databricks account, you can sign up for a free trial on the Databricks website https://www.databricks.com/try-databricks.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Modern-Data-Applications-Using-Databricks-Lakehouse. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “The result of the data generator notebook should be three tables in total: youtube_channels, youtube_channel_artists, and combined_table.”

A block of code is set as follows:

@dlt.table(     name="random_trip_data_raw",     comment="The raw taxi trip data ingested from a landing zone.",     table_properties={         "quality": "bronze"     } )

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

@dlt.table(     name="random_trip_data_raw",     comment="The raw taxi trip data ingested from a landing zone.",     table_properties={         "quality": "bronze",         "pipelines.autoOptimize.managed": "false"     } )

Any command-line input or output is written as follows:

$ databricks bundle validate

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Click the Run all button at the top right of the Databricks workspace to execute all the notebook cells, verifying that all cells execute successfully.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Building Modern Data Applications Using Databricks Lakehouse, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-80107-323-3

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1:Near-Real-Time Data Pipelines for the Lakehouse

In this first part of the book, we’ll introduce the core concepts of the Delta Live Tables (DLT) framework. We’ll cover how to ingest data from a variety of input sources and apply the latest changes to downstream tables. We’ll also explore how to enforce requirements on incoming data so that your data teams can be alerted of potential data quality issues that might contaminate your lakehouse.

This part contains the following chapters:

Chapter 1, An Introduction to Delta Live TablesChapter 2, Applying Data Transformations Using Delta Live TablesChapter 3, Managing Data Quality Using Delta Live TablesChapter 4, Scaling DLT Pipelines

1

An Introduction to Delta Live Tables

In this chapter, we will examine how the data industry has evolved over the last several decades. We’ll also look at why real-time data processing has significant ties to how a business can react to the latest signals in data. We’ll address why trying to build your own streaming solution from scratch may not be sustainable, and why the maintenance does not easily scale over time. By the end of the chapter, you should completely understand the types of problems the Delta Live Tables (DLT) framework solves and the value the framework brings to data engineering teams.

In this chapter, we’re going to cover the following main topics:

The emergence of the lakehouseThe importance of real-time data in the lakehouseThe maintenance predicament of a streaming applicationWhat is the Delta Live Tables framework?How are Delta Live Tables related to Delta Lake?An introduction to Delta Live Tables conceptsA quick Delta Lake primerA hands-on example – creating your first Delta Live Tables pipeline

Technical requirements

It’s recommended to have access to a Databricks premium workspace to follow along with the code examples at the end of the chapter. It’s also recommended to have Databricks workspace permissions to create an all-purpose cluster and a DLT pipeline using a cluster policy. Users will create and attach a notebook to a cluster and execute the notebook cells. All code samples can be downloaded from this chapter’s GitHub repository, located at https://github.com/PacktPublishing/Building-Modern-Data-Applications-Using-Databricks-Lakehouse/tree/main/chapter01. This chapter will create and run a new DLT pipeline using the Core product edition. As a result, the pipeline is estimated to consume around 5–10 Databricks Units (DBUs).

The emergence of the lakehouse

During the early 1980s, the data warehouse was a great tool for processing structured data. Combined with the right indexing methods, data warehouses allowed us to serve business intelligence (BI) reports at blazing speeds. However, after the turn of the century, data warehouses could not keep up with newer data formats such as JSON, as well as new data modalities such as audio and video. Simply put, data warehouses struggled to process semi-structured and unstructured data that most businesses used. Additionally, data warehouses struggled to scale to millions or billions of rows, common in the new information era of the early 2000s. Overnight, batch data processing jobs soon ran into BI reports scheduled to refresh during the early morning business hours.

At the same time, cloud computing became a popular choice among organizations because it provided enterprises with an elastic computing capacity that could quickly grow or shrink, based on the current computing demand, without having to deal with the upfront costs of provisioning and installing additional hardware on-premises.

Modern extract, transform, and load (ETL) processing engines such as Apache Hadoop and Apache Spark™ addressed the performance problem of processing big data ETL pipelines, ushering in a new concept, a data lake. Conversely, data lakes were terrible for serving BI reports and oftentimes offered degrading performance experiences for many concurrent user sessions. Furthermore, data lakes had poor data governance. They were prone to sloppy data wrangling patterns, leading to many expensive copies of the same datasets that frequently diverged from the source of truth. As a result, these data lakes quickly earned the nickname of data swamps. The big data industry needed a change. The lakehouse pattern was this change and aimed to combine the best of both worlds – fast BI reports and fast ETL processing of structured, semi-structured, and unstructured data in the cloud.

The Lambda architectural pattern

In the early 2010s, data streaming took a foothold in the data industry, and many enterprises needed a way to support both batch ETL processing and append-only streams of data. Furthermore, data architectures with many concurrent ETL processes needed to simultaneously read and change the underlying data. It was not uncommon for organizations to experience frequent conflicting write failures that led to data corruption and even data loss. As a result, in many early data architectures, a two-pronged Lambda architecture was built to provide a layer of isolation between these processes.

Figure 1.1 – A Lambda architecture was oftentimes created to support both real-time streaming workloads and batch processes such as BI reports

Using the Lambda architecture, downstream processes such as BI reports or Machine Learning (ML) model training could execute calculations on a snapshot of data, while streaming processes could apply near real-time data changes in isolation. However, these Lambda architectures duplicated data to support concurrent batch and streaming workloads, leading to inconsistent data changes that needed to be reconciled at the end of each business day.

Introducing the medallion architecture

In an effort to clean up data lakes and prevent bad data practices, data lake architects needed a data processing pattern that would meet the high demands of modern-day ETL processing. In addition, organizations needed a simplified architecture for batch and streaming workloads, easy data rollbacks, good data auditing, and strong data isolation, while scaling to process terabytes or even petabytes of data daily.

As a result, a design pattern within the lakehouse emerged, commonly referred to as the medallion architecture. This data processing pattern physically isolates data processing and improves data quality by applying business-level transformations in successive data hops, also called data layers.

Figure 1.2 – The lakehouse medallion architecture

A typical design pattern for organizing data within a lakehouse (as shown in Figure 1.2) includes three distinct data layers – a bronze layer, a silver layer, and finally, a gold layer:

The bronze layer serves as a landing zone for raw, unprocessed data.Filtered, cleaned, and augmented data with a defined structure and enforced schema will be stored in the silver layer.Lastly, a refined, or gold layer, will deliver pristine, business-level aggregations ready to be consumed by downstream BI and ML systems.

Moreover, this simplified data architecture unifies batch and streaming workloads, by storing datasets in a big data format that supports concurrent batch and streaming data operations.

The Databricks lakehouse

The Databricks lakehouse combines the processing power of a new high-performance processing engine, called the Photon Engine, with the augmentation of Apache Spark. Combined with open data formats for data storage, and support for a wide range of data types, including structured, semi-structured, and unstructured data, the Photon engine can process a wide variety of workloads using a single, consistent snapshot of the data in cheap and resilient cloud storage. In addition, the Databricks lakehouse simplifies data architecture by unifying batch and streaming processing with a single API – the Spark DataFrame API. Lastly, the Databricks lakehouse was built with data governance and data security in mind, allowing organizations to centrally define data access patterns and consistently apply them across their businesses.

In this book, we’ll cover three major features that the Databricks lakehouse is anchored in:

The Delta Lake formatThe Photon EngineUnity Catalog

While Delta Lake can be used to process both batch and streaming workloads concurrently, most data teams choose to implement their ETL pipelines using a batch execution model, mainly for simplicity’s sake. Let’s look at why that might be the case.

The maintenance predicament of a streaming application

Spark Structured Streaming provides near-real-time stream processing with fault tolerance, and exactly-once processing guarantees through the use of a DataFrame API that is near-identical to batch processing in Spark. As a result of a common DataFrame API, data engineering teams can convert existing batch Spark workloads to streaming with minimal effort. However, as the volume of data increases and the number of ingestion sources and data pipelines naturally grows over time, data engineering teams face the burden of augmenting existing data pipelines to keep up with new data transformations or changing business logic. In addition, Spark Streaming comes with additional configuration maintenance such as updating checkpoint locations, managing watermarks and triggers, and even backfilling tables when a significant data change or data correction occurs. Advanced data engineering teams may even be expected to build data validation and system monitoring capabilities, adding even more custom pipeline features to maintain. Over time, data pipeline complexity will grow, and data engineering teams will spend most of their time maintaining the operation of data pipelines in production and less time gleaning insights from their enterprise data. It’s evident that a framework is needed that allows data engineers to quickly declare data transformations, manage data quality, and rapidly deploy changes to production where they can monitor pipeline operations from a UI or other notification systems.

What is the DLT framework?

DLT is a declarative framework that aims to simplify the development and maintenance operations of a data pipeline by abstracting away a lot of the boilerplate complexities. For example, rather than declaring how to transform, enrich, and validate data, data engineers can declare what transformations to apply to newly arriving data. Furthermore, DLT provides support to enforce data quality, preventing a data lake from becoming a data swamp. DLT gives data teams the ability to choose how to handle poor-quality data, whether that means printing a warning message to the system logs, dropping invalid data, or failing a data pipeline run altogether. Lastly, DLT automatically handles the mundane data engineering tasks of maintaining optimized data file sizes of the underlying tables, as well as cleaning up obsolete data files that are no longer present in the Delta transaction log (Optimize and Vacuum operations are covered later in the A quick Delta Lake primer section). DLT aims to ease the maintenance and operational burden on data engineering teams so that they can focus their time on uncovering business value from the data stored in their lakehouse, rather than spending time managing operational complexities.

How is DLT related to Delta Lake?

The DLT framework relies heavily on the Delta Lake format to incrementally process data at every step of the way. For example, streaming tables and materialized views defined in a DLT pipeline are backed by a Delta table. Features that make Delta Lake an ideal storage format for a streaming pipeline include support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions so that concurrent data modifications such as inserts, updates, and deletions can be incrementally applied to a streaming table. Plus, Delta Lake features scalable metadata handling, allowing Delta Lake to easily scale to petabytes and beyond. If there is incorrect data computation, Delta Lake offers time travel – the ability to restore a copy of a table to a previous snapshot. Lastly, Delta Lake inherently tracks audit information in each table’s transaction log. Provenance information such as what type of operation modified the table, by what cluster, by which user, and at what precise timestamp are all captured alongside the data files. Let’s look at how DLT leverages Delta tables to quickly and efficiently define data pipelines that can scale over time.

Introducing DLT concepts

The DLT framework automatically manages task orchestration, cluster creation, and exception handling, allowing data engineers to focus on defining transformations, data enrichment, and data validation logic. Data engineers will define a data pipeline using one or more dataset types. Under the hood, the DLT system will determine how to keep these datasets up to date. A data pipeline using the DLT framework is made up of the streaming tables, materialized views, and views dataset types, which we’ll discuss in detail in the following sections. We’ll also briefly discuss how to visualize the pipeline, view its triggering method, and look at the entire pipeline data flow from a bird’s-eye view. We’ll also briefly understand the different types of Databricks compute and runtime, and Unity Catalog. Let’s go ahead and get started.

Streaming tables

Streaming tables leverage the benefits of Delta Lake and Spark Structured Streaming to incrementally process new data as it arrives. This dataset type is useful when data must be ingested, transformed, or enriched at a high throughput and low latency. Streaming tables were designed specifically for data sources that append new data only and do not include data modification, such as updates or deletes. As a result, this type of dataset can scale to large data volumes, since it can incrementally apply data transformations as soon as new data arrives and does not need to recompute the entire table history during a pipeline update.

Materialized views

Materialized views leverage Delta Lake to compute the latest changes to a dataset and materialize the results in cloud storage. This dataset type is great when the data source includes data modifications such as updates and deletions, or a data aggregation must be performed. Under the hood, the DLT framework will perform the calculations to recompute the latest data changes to the dataset, using the full table’s history. The output of this calculation is stored in cloud storage so that future queries can reference the pre-computed results, as opposed to re-performing the full calculations each time the table is queried. As a result, this type of dataset will incur additional storage and compute costs each time the materialized view is updated. Furthermore, materialized views can be published to Unity Catalog, so the results can be queried outside of the DLT data pipeline. This is great when you need to share the output of a query across multiple data pipelines.

Views

Views also recompute the latest results of a particular query but do not materialize the results to cloud storage, which helps save on storage costs. This dataset type is great when you want to quickly check the intermediate result of data transformations in a data pipeline or apply other ad hoc data validations. Furthermore, the results of this dataset type cannot be published to Unity Catalog and are only available within the context of the data pipeline.

The following table summarizes the differences between the different dataset types in the DLT framework and when it’s appropriate to use one dataset type versus the other:

Dataset type

When to use it

Streaming table

Ingestion workloads, when you need to continuously append new data to a target table with high throughput and low latency.

Materialized view

Data operations that include data modifications, such as updates and deletions, or you need to perform aggregations on the full table history.

View

When you need to query intermediate data without publishing the results to Unity Catalog (e.g., perform data quality checks on intermediate transformations)

Table 1.1 – Each dataset type in DLT serves a different purpose

Pipeline

A DLT pipeline is the logical data processing graph of one or more streaming tables, materialized views, or views. The DLT framework will take dataset declarations, using either the Python API or SQL API, and infer the dependencies between each dataset. Once a pipeline update runs, the DLT framework will update the datasets in the correct order using a dependency graph, called a dataflow graph.

Pipeline triggers

A pipeline will be executed based on some triggering event. DLT offers three types of triggers – manual, scheduled, and continuous triggers. Once triggered, the pipeline will initialize and execute the dataflow graph, updating each of the dataset states.

Workflow

Databricks workflows is a managed orchestration feature of the Databricks Data Intelligence Platform that allows data engineers to chain together one or more dependent data processing tasks. For more complex data processing use cases, it may be necessary to build a data pipeline using multiple, nested DLT pipelines. For those use cases, Databricks workflows can simplify the orchestration of these data processing tasks.

Types of Databricks compute

There are four types of computational resources available to Databricks users from the Databricks Data Intelligence Platform.

Job computes

A job compute is an ephemeral collection of virtual machines (VMs) with the Databricks Runtime (DBR) installed that are dynamically provisioned for the duration of a scheduled job. Once the job is complete, the VMs are immediately released back to the cloud provider. Since job clusters do not utilize the UI components of the Databricks Data Intelligence Platform (e.g., notebooks and the query editor), job clusters assess a lower Databricks Unit (DBU) for the entirety of their execution.

All-purpose computes

An all-purpose compute is a collection of ephemeral VMs with the DBR installed that is dynamically provisioned by a user, directly from the Databricks UI via a button click, or via the Databricks REST API (using the /api/2.0/clusters/create endpoint, for example), and they remain running until a user, or an expiring auto-termination timer, terminates the cluster. Upon termination, the VMs are returned to the cloud provider, and Databricks stops assessing additional DBUs.

Instance pools

Instance pools are a feature in Databricks that helps reduce the time it takes to provision additional VMs and install the DBR. Instance pools will pre-provision VMs from the cloud provider and hold them in a logical container, similar to a valet keeping your car running in a valet parking lot.

For some cloud providers, it can take 15 minutes or more to provision an additional VM, leading to longer troubleshooting cycles or ad hoc development tasks, such as log inspection or rerunning failed notebook cells during the development of new features.

Additionally, instance pools improve efficiency when many jobs are scheduled to execute closely together or with overlapping schedules. For example, as one job finishes, rather than releasing the VMs back to the cloud provider, the job cluster can place the VMs into the instance pool to be reused by the next job.

Before returning the VMs to the instance pool, the Databricks container installed on the VM is destroyed, and a new container is installed on the VM containing the DBR when the next scheduled job requests the VM.

Important note

Databricks will not assess additional DBUs while VM(s) are up and running. However, the cloud provider will continue to charge for as long as the VMs are held in the instance pools.

To help control costs, instance pools provide an autoscaling feature that allows the size of the pool to grow and shrink, in response to demand. For example, the instance pool might grow to 10 VMs during peak hours but shrink back to 1 or 2 during lulls in the processing demand.

Databricks SQL warehouses

The last type of computational resource featured in the Databricks Data Intelligence Platform is Databricks SQL (DBSQL) warehouses. DBSQL warehouses are designed to run SQL workloads such