35,99 €
Written by a Senior Solutions Architect at Databricks, Data Engineering with Databricks Cookbook will show you how to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, starting with comprehensive introduction to data ingestion and loading with Apache Spark.
What makes this book unique is its recipe-based approach, which will help you put your knowledge to use straight away and tackle common problems. You’ll be introduced to various data manipulation and data transformation solutions that can be applied to data, find out how to manage and optimize Delta tables, and get to grips with ingesting and processing streaming data. The book will also show you how to improve the performance problems of Apache Spark apps and Delta Lake. Advanced recipes later in the book will teach you how to use Databricks to implement DataOps and DevOps practices, as well as how to orchestrate and schedule data pipelines using Databricks Workflows. You’ll also go through the full process of setup and configuration of the Unity Catalog for data governance.
By the end of this book, you’ll be well-versed in building reliable and scalable data pipelines using modern data engineering technologies.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 442
Veröffentlichungsjahr: 2024
Data Engineering with Databricks Cookbook
Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
Pulkit Chadha
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Manager: Deepesh Patel
Book Project Manager: Shambhavi Mishra
Senior Editor: Rohit Singh
Technical Editor: Kavyashree KS
Copy Editor: Safis Editing
Proofreaders: Safis Editing and Rohit Singh
Indexer: Manju Arasan
Production Designer: Alishon Mendonca
DevRel Marketing Executive: Nivedita Singh
First published: May 2024
Production reference: 1100524
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83763-335-7
www.packtpub.com
To my wonderful wife, Latika Kapoor, thank you for always being there for me. Your support and belief in me have helped me achieve my dream of writing this book. You’ve motivated me every step of the way, and I couldn’t have done it without you.
– Pulkit Chadha
Pulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.
An alumnus of the University of Arizona, Pulkit holds a master’s degree in management information systems along with multiple cloud certifications. His impactful career includes tenures at Dell Services, Adobe, and Databricks, shaping data-driven decision-making and business growth.
Gaurav Chawla is a seasoned data scientist and machine learning engineer at JP Morgan Chase, and possesses more than a decade of expertise in machine learning and software engineering. His focus lies in specialized areas such as fraud detection and the comprehensive development of real-time machine learning models. Gaurav holds a master’s degree in data science from Columbia University in the City of New York.
I express gratitude to my wife, Charika, for her constant support, and to our delightful son, Angad, who brings immense joy into our lives. I extend thanks to my parents for imparting valuable teachings and fostering a resilient foundation for my character.
Jaime Andres Salas is an exceptionally enthusiastic data professional. With a wealth of expertise spanning more than 6 years in data engineering and data management, he possesses extensive experience in designing and maintaining large-scale enterprise data platforms such as data warehouses, data lakehouses, and data pipeline solutions. Jaime Andres holds a bachelor’s degree in electronic engineering from Espol and an MBA from UOC.
Throughout his professional journey, he has successfully undertaken significant big data and data engineering projects for a diverse range of industries, including retail, production, brewing, and insurance.
Mohit Raja Sudhera has over a decade of extensive experience in data and cloud engineering. He currently leads a team of talented engineers at a prominent and innovative healthcare provider globally.
His core competencies lie in designing and delivering scalable and robust data-intensive solutions utilizing highly performant systems such as Spark, Kafka, Snowflake, and Azure Databricks. Mohit spearheads the architecture and standardization of data-driven capabilities responsible for optimizing the performance and latency of reporting dashboards.
His dedication to self-education within the data engineering domain has afforded him invaluable exposure to crafting and executing data-intensive jobs/pipelines.
In this part, we will explore the essentials of data operations with Apache Spark and Delta Lake, covering data ingestion, extraction, transformation, and manipulation to align with business analytics. We will delve into Delta Lake for reliable data management with ACID transactions and versioning, and tackle streaming data ingestion and processing for real-time insights. This part concludes with performance tuning strategies for both Apache Spark and Delta Lake, ensuring efficient data processing within the Lakehouse architecture.
This part contains the following chapters:
Chapter 1, Data Ingestion and Data Extraction with Apache SparkChapter 2, Data Transformation and Data Manipulation with Apache SparkChapter 3, Data Management with Delta LakeChapter 4, Ingesting Streaming DataChapter 5, Processing Streaming DataChapter 6, Performance Tuning with Apache SparkChapter 7, Performance Tuning in Delta Lake