Data Engineering with Databricks Cookbook - Pulkit Chadha - E-Book

Data Engineering with Databricks Cookbook E-Book

Pulkit Chadha

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Written by a Senior Solutions Architect at Databricks, Data Engineering with Databricks Cookbook will show you how to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, starting with comprehensive introduction to data ingestion and loading with Apache Spark.
What makes this book unique is its recipe-based approach, which will help you put your knowledge to use straight away and tackle common problems. You’ll be introduced to various data manipulation and data transformation solutions that can be applied to data, find out how to manage and optimize Delta tables, and get to grips with ingesting and processing streaming data. The book will also show you how to improve the performance problems of Apache Spark apps and Delta Lake. Advanced recipes later in the book will teach you how to use Databricks to implement DataOps and DevOps practices, as well as how to orchestrate and schedule data pipelines using Databricks Workflows. You’ll also go through the full process of setup and configuration of the Unity Catalog for data governance.
By the end of this book, you’ll be well-versed in building reliable and scalable data pipelines using modern data engineering technologies.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 442

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Engineering with Databricks Cookbook

Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

Pulkit Chadha

Data Engineering with Databricks Cookbook

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Manager: Deepesh Patel

Book Project Manager: Shambhavi Mishra

Senior Editor: Rohit Singh

Technical Editor: Kavyashree KS

Copy Editor: Safis Editing

Proofreaders: Safis Editing and Rohit Singh

Indexer: Manju Arasan

Production Designer: Alishon Mendonca

DevRel Marketing Executive: Nivedita Singh

First published: May 2024

Production reference: 1100524

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83763-335-7

www.packtpub.com

To my wonderful wife, Latika Kapoor, thank you for always being there for me. Your support and belief in me have helped me achieve my dream of writing this book. You’ve motivated me every step of the way, and I couldn’t have done it without you.

– Pulkit Chadha

Contributors

About the author

Pulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.

An alumnus of the University of Arizona, Pulkit holds a master’s degree in management information systems along with multiple cloud certifications. His impactful career includes tenures at Dell Services, Adobe, and Databricks, shaping data-driven decision-making and business growth.

About the reviewers

Gaurav Chawla is a seasoned data scientist and machine learning engineer at JP Morgan Chase, and possesses more than a decade of expertise in machine learning and software engineering. His focus lies in specialized areas such as fraud detection and the comprehensive development of real-time machine learning models. Gaurav holds a master’s degree in data science from Columbia University in the City of New York.

I express gratitude to my wife, Charika, for her constant support, and to our delightful son, Angad, who brings immense joy into our lives. I extend thanks to my parents for imparting valuable teachings and fostering a resilient foundation for my character.

Jaime Andres Salas is an exceptionally enthusiastic data professional. With a wealth of expertise spanning more than 6 years in data engineering and data management, he possesses extensive experience in designing and maintaining large-scale enterprise data platforms such as data warehouses, data lakehouses, and data pipeline solutions. Jaime Andres holds a bachelor’s degree in electronic engineering from Espol and an MBA from UOC.

Throughout his professional journey, he has successfully undertaken significant big data and data engineering projects for a diverse range of industries, including retail, production, brewing, and insurance.

Mohit Raja Sudhera has over a decade of extensive experience in data and cloud engineering. He currently leads a team of talented engineers at a prominent and innovative healthcare provider globally.

His core competencies lie in designing and delivering scalable and robust data-intensive solutions utilizing highly performant systems such as Spark, Kafka, Snowflake, and Azure Databricks. Mohit spearheads the architecture and standardization of data-driven capabilities responsible for optimizing the performance and latency of reporting dashboards.

His dedication to self-education within the data engineering domain has afforded him invaluable exposure to crafting and executing data-intensive jobs/pipelines.

Table of Contents

Preface

Part 1 – Working with Apache Spark and Delta Lake

1

Data Ingestion and Data Extraction with Apache Spark

Technical requirements

Reading CSV data with Apache Spark

How to do it...

There’s more…

See also

Reading JSON data with Apache Spark

How to do it...

There’s more…

See also

Reading Parquet data with Apache Spark

How to do it...

See also

Parsing XML data with Apache Spark

How to do it…

There’s more…

See also

Working with nested data structures in Apache Spark

How to do it…

There’s more…

See also

Processing text data in Apache Spark

How to do it…

There’s more…

See also

Writing data with Apache Spark

How to do it…

There’s more…

See also

2

Data Transformation and Data Manipulation with Apache Spark

Technical requirements

Applying basic transformations to data with Apache Spark

How to do it...

There’s more…

See also

Filtering data with Apache Spark

How to do it…

There’s more…

See also

Performing joins with Apache Spark

How to do it...

There’s more…

See also

Performing aggregations with Apache Spark

How to do it...

There’s more…

See also

Using window functions with Apache Spark

How to do it...

There’s more…

Writing custom UDFs in Apache Spark

How to do it...

There’s more…

See also

Handling null values with Apache Spark

How to do it...

There’s more…

See also

3

Data Management with Delta Lake

Technical requirements

Creating a Delta Lake table

How to do it...

There’s more…

See also

Reading a Delta Lake table

How to do it...

There’s more...

See also

Updating data in a Delta Lake table

How to do it...

See also

Merging data into Delta tables

How to do it...

There’s more…

See also

Change data capture in Delta Lake

How to do it...

See also

Optimizing Delta Lake tables

How to do it...

There’s more...

See also

Versioning and time travel for Delta Lake tables

How to do it...

There’s more...

See also

Managing Delta Lake tables

How to do it...

See also

4

Ingesting Streaming Data

Technical requirements

Configuring Spark Structured Streaming for real-time data processing

Getting ready

How to do it…

How it works…

There’s more…

See also

Reading data from real-time sources, such as Apache Kafka, with Apache Spark Structured Streaming

Getting ready

How to do it…

How it works…

There’s more…

See also

Defining transformations and filters on a Streaming DataFrame

Getting ready

How to do it…

See also

Configuring checkpoints for Structured Streaming in Apache Spark

Getting ready

How to do it…

How it works…

There’s more…

See also

Configuring triggers for Structured Streaming in Apache Spark

Getting ready

How to do it…

How it works…

See also

Applying window aggregations to streaming data with Apache Spark Structured Streaming

Getting ready

How to do it…

There’s more…

See also

Handling out-of-order and late-arriving events with watermarking in Apache Spark Structured Streaming

Getting ready

How to do it…

There’s more…

See also

5

Processing Streaming Data

Technical requirements

Writing the output of Apache Spark Structured Streaming to a sink such as Delta Lake

Getting ready

How to do it…

How it works…

See also

Idempotent stream writing with Delta Lake and Apache Spark Structured Streaming

Getting ready

How to do it…

See also

Merging or applying Change Data Capture on Apache Spark Structured Streaming and Delta Lake

Getting ready

How to do it…

There’s more…

Joining streaming data with static data in Apache Spark Structured Streaming and Delta Lake

Getting ready

How to do it…

There’s more…

See also

Joining streaming data with streaming data in Apache Spark Structured Streaming and Delta Lake

Getting ready

How to do it…

There’s more…

See also

Monitoring real-time data processing with Apache Spark Structured Streaming

Getting ready

How to do it…

There’s more…

See also

6

Performance Tuning with Apache Spark

Technical requirements

Monitoring Spark jobs in the Spark UI

How to do it…

See also

Using broadcast variables

How to do it…

How it works…

There’s more…

Optimizing Spark jobs by minimizing data shuffling

How to do it…

See also

Avoiding data skew

How to do it…

There’s more...

Caching and persistence

How to do it…

There’s more…

Partitioning and repartitioning

How to do it…

There’s more…

Optimizing join strategies

How to do it…

See also

7

Performance Tuning in Delta Lake

Technical requirements

Optimizing Delta Lake table partitioning for query performance

How to do it…

There’s more…

See also

Organizing data with Z-ordering for efficient query execution

How to do it…

How it works…

See also

Skipping data for faster query execution

How to do it…

See also

Reducing Delta Lake table size and I/O cost with compression

How to do it…

How it works…

See also

Part 2 – Data Engineering Capabilities within Databricks

8

Orchestration and Scheduling Data Pipeline with Databricks Workflows

Technical requirements

Building Databricks workflows

How to do it…

See also

Running and managing Databricks Workflows

How to do it...

See also

Passing task and job parameters within a Databricks Workflow

How to do it...

See also

Conditional branching in Databricks Workflows

How to do it...

See also

Triggering jobs based on file arrival

Getting ready

How to do it…

See also

Setting up workflow alerts and notifications

How to do it…

There’s more…

See also

Troubleshooting and repairing failures in Databricks Workflows

How to do it...

See also

9

Building Data Pipelines with Delta Live Tables

Technical requirements

Creating a multi-hop medallion architecture data pipeline with Delta Live Tables in Databricks

How to do it…

How it works…

See also

Building a data pipeline with Delta Live Tables on Databricks

How to do it…

See also

Implementing data quality and validation rules with Delta Live Tables in Databricks

How to do it…

How it works…

See also

Quarantining bad data with Delta Live Tables in Databricks

How to do it…

See also

Monitoring Delta Live Tables pipelines

How to do it…

See also

Deploying Delta Live Tables pipelines with Databricks Asset Bundles

Getting ready

How to do it…

There’s more…

See also

Applying changes (CDC) to Delta tables with Delta Live Tables

How to do it…

See also

10

Data Governance with Unity Catalog

Technical requirements

Connecting to cloud object storage using Unity Catalog

Getting ready

How to do it…

See also

Creating and managing catalogs, schemas, volumes, and tables using Unity Catalog

Getting ready

How to do it…

See also

Defining and applying fine-grained access control policies using Unity Catalog

Getting ready

How to do it…

See also

Tagging, commenting, and capturing metadata about data and AI assets using Databricks Unity Catalog

Getting ready

How to do it…

See also

Filtering sensitive data with Unity Catalog

Getting ready

How to do it…

See also

Using Unity Catalogs lineage data for debugging, root cause analysis, and impact assessment

Getting ready

How to do it…

See also

Accessing and querying system tables using Unity Catalog

Getting ready

How to do it…

See also

11

Implementing DataOps and DevOps on Databricks

Technical requirements

Using Databricks Repos to store code in Git

Getting ready

How to do it…

There’s more…

See also

Automating tasks by using the Databricks CLI

Getting ready

How to do it…

There’s more…

See also

Using the Databricks VSCode extension for local development and testing

Getting ready

How to do it…

See also

Using Databricks Asset Bundles (DABs)

Getting ready

How to do it…

See also

Leveraging GitHub Actions with Databricks Asset Bundles (DABs)

Getting ready

How to do it…

See also

Index

Other Books You May Enjoy

Part 1 – Working with Apache Spark and Delta Lake

In this part, we will explore the essentials of data operations with Apache Spark and Delta Lake, covering data ingestion, extraction, transformation, and manipulation to align with business analytics. We will delve into Delta Lake for reliable data management with ACID transactions and versioning, and tackle streaming data ingestion and processing for real-time insights. This part concludes with performance tuning strategies for both Apache Spark and Delta Lake, ensuring efficient data processing within the Lakehouse architecture.

This part contains the following chapters:

Chapter 1, Data Ingestion and Data Extraction with Apache SparkChapter 2, Data Transformation and Data Manipulation with Apache SparkChapter 3, Data Management with Delta LakeChapter 4, Ingesting Streaming DataChapter 5, Processing Streaming DataChapter 6, Performance Tuning with Apache SparkChapter 7, Performance Tuning in Delta Lake