29,99 €
Discover what makes the Databricks Data Intelligence Platform the go-to choice for top-tier machine learning solutions. Written by a team of industry experts at Databricks with decades of combined experience in big data, machine learning, and data science, Databricks ML in Action presents cloud-agnostic, end-to-end examples with hands-on illustrations of executing data science, machine learning, and generative AI projects on the Databricks Platform.
You’ll develop expertise in Databricks' managed MLflow, Vector Search, AutoML, Unity Catalog, and Model Serving as you learn to apply them practically in everyday workflows. This Databricks book not only offers detailed code explanations but also facilitates seamless code importation for practical use. You’ll discover how to leverage the open-source Databricks platform to enhance learning, boost skills, and elevate productivity with supplemental resources.
By the end of this book, you'll have mastered the use of Databricks for data science, machine learning, and generative AI, enabling you to deliver outstanding data products.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 303
Veröffentlichungsjahr: 2024
Databricks ML in Action
Learn how Databricks supports the entire ML lifecycle end to end from data ingestion to the model deployment
Stephanie Rivera
Anastasia Prokaieva
Amanda Baker
Hayley Horn
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Ali Abidi
Publishing Product Manager: Sanjana Gupta
Content Development Editor: Priyanka Soam
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Project Coordinator: Shambhavi Mishra
Proofreader: Priyanka Soam
Indexer: Rekha Nair
Production Designer: Jyoti Kadam
Marketing Coordinator: Nivedita Singh
First published: May 2024
Production reference: 2220724
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul's Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80056-489-3
www.packtpub.com
To the strong women who have come before me, for their sacrifices and for exemplifying the power of determination. To the memory of my grandmother, Hazel Adolph, for being my best friend and cheerleader.
– Stephanie Rivera
To females who pursued a STEM career and did not give up no matter what obstacles occurred on their way. Some would say science it’s not for girls, well, prove them wrong.
- Anastasia Prokaieva
To my mother, Mary Baker. Thank you for showing me what true strength is, for being both a voice of reason and unbridled support, and for believing in me no matter what.
– Amanda Baker
This is dedicated to the women who inspired me to lead by example, to my mom, Susan Charba, who reminded me that I could be the one who inspires, and to the women still working their way up. I’ll send the elevator back down. There is plenty of room for us all.
– Hayley Horn
Stephanie Rivera has worked in big data and machine learning since 2011. She collaborates with teams and companies as they design their data intelligence platform as a senior solutions architect for Databricks.
Previously, Stephanie was the VP of data intelligence for a global company, ingesting 20+ terabytes of data daily. She led the data science, data engineering, and business intelligence teams.
Her data career has also included contributing to and leading a team creating software that teaches people how to explore fictional planets using data science algorithms. Stephanie authored numerous sections of Booz Allen Hamilton’s publication The Field Guide to Data Science.
I want to thank my loving partner, Rami Alba Lucio, Databricks coworkers, family, and friends for their unwavering support.
Anastasia Prokaieva began her career 9 years ago as a research scientist at CEA (France), focusing on large data analysis and satellite data assimilation, treating terabytes of data. She has been working within the big data analysis and machine learning domain since then. In 2021, she joined Databricks and became the regional AI subject matter expert.
On a daily basis, Anastasia is a consultant Databricks users on best practices for implementing AI projects end to end. She also delivers training and workshops to democratize AI. Anastasia holds two MSc degrees in theoretical physics and energy science.
I would like to thank my partner, Julien, and my family for their tremendous support. My gratitude to my talented teammates all around the globe, as you inspire me every day!
Amanda Bakerbegan her career in data 8 years ago. She loves leveraging her skills as a data scientist to orchestrate transformative journeys for companies across diverse industries as a solutions architect for Databricks. Her experiences have brought her from large corporations to small start-ups and everything in between. Amanda is a graduate of Carnegie Mellon University and the University of Washington.
Thank you to my partner, Emmanuel, my parents, sisters, and friends for their enduring love and support.
Hayley Horn started her data career 15 years ago as a data quality consultant on enterprise data integration projects. As a data scientist, she specialized in customer insights and strategy. Hayley has presented at data science and AI conferences in the US and Europe. She is currently a senior solutions architect for Databricks, with expertise in data science and technology modernization.
A graduate of the MS data science program at Southern Methodist University in Dallas, Texas, USA, she is now a capstone advisor to students in their final semesters of the program.
I’d like to thank my husband, Kevin, and my sons, Dyson and Dalton, for their encouragement and enthusiastic support.
Jeanne Choo studied plant biology and zoology at college, teaching herself how to code while doing so, to understand better how genes evolve over time. She was then hired by AI Singapore, a Singaporean government entity, as their first AI engineer. At AI Singapore, she was key in building the technical consulting practice, apprenticeship program, and technical stack from nothing. She set up the AI Apprenticeship Program, which won the Talent Accelerator Award Asia-Pacific IDC Digital Transformation Awards.
Most of Jeanne’s work has focused on challenges specific to the Asia-Pacific region. Some past projects include building ML pipelines for Japanese search query understanding, training speech recognition models for Singlish, and building NLP models and corpora for Indonesian. At Databricks, she advises customers on best practices related to all things MLOps and generative AI.
Benjamin Faircloth completed a Master of Science degree in Data Science at Northwestern University. Ben is a key figure in AI and machine learning within the public sector. As a Delivery Solutions Architect and Embedded Product Lead at Databricks, he has driven significant enhancements in service delivery and innovation across top accounts. Ben’s contributions to developing Databricks' delivery methodologies have streamlined processes and improved customer engagement. He led pivotal LLM hackathon initiatives with the Department of Defense, advancing AI capabilities crucial for national security. Throughout his career, Ben has witnessed the transformation from conceptual data architectures to the practical implementation of the lakehouse paradigm. At a significant moment in his career, as this book goes to press, his daughter Lydia is also entering the world, marking a personal milestone alongside his professional achievements. Committed to the mission, Ben strives to drive further innovation in applying advanced technologies to meet critical objectives.
Clever Anjos is a principal solutions architect at Qlik, a data analytics and data integration software company.
He has been working for Qlik since 2018 but has been around the Qlik ecosystem as a partner and customer since 2009. He is a business discovery professional with several years of experience working with Qlik, AWS, Google Cloud, Databricks, and other BI technologies.
He is a highly active member of Qlik Community, with over 8,000 posts and 4,500 page views.
In May 2022, he was named Qlik Community’s featured member.
Amreth Chandrasehar is an engineering leader in the cloud, AI/ML engineering, observability, and SRE. Over the last few years, Amreth has played a key role in cloud migration, generative AI, AIOps, observability, and ML adoption at various organizations. Amreth is also co-creator of the Conducktor Platform, serving T-Mobile’s over 100 million customers, and a tech/customer advisory board member at various companies on observability. Amreth has also co-created and open sourced Kardio.io, a service health dashboard tool. Amreth has been invited to and spoken at several key conferences and has won several awards.
I would like to thank my wife, Ashwinya, and my son, Athvik, for their patience and support provided during my review of this book.
Databricks subject matter experts, we want to extend our heartfelt gratitude to each of you who took the time to review the text and code. The speed at which Databricks evolves with technology makes it an incredible feat to be current. Your guidance and support throughout this journey has been invaluable. Your contributions have not only enhanced the technical accuracy and depth of our book but have also provided invaluable context and perspective rooted in your firsthand experiences at Databricks.
In this book, you will discover what makes the Databricks Data Intelligence Platform the go-to choice for top-tier machine learning solutions. Databricks ML in Action presents cloud-agnostic, end-to-end examples with hands-on illustrations of executing data science, machine learning, and generative AI projects on the Databricks Platform. You’ll develop expertise in Databricks’ managed MLflow, Vector Search, AutoML, Unity Catalog, and Model Serving as you learn to apply them practically in everyday workflows. This Databricks book not only offers detailed code explanations but also facilitates seamless code importation for practical use. You’ll discover how to leverage the open source Databricks platform to enhance your learning, boost your skills, and elevate your productivity with supplemental resources. By the end of this book, you’ll have mastered the use of Databricks for data science, machine learning, and generative AI, enabling you to deliver outstanding data products.
This book is for machine learning engineers, data scientists, and technical managers seeking hands-on expertise in implementing and leveraging the Databricks Data Intelligence Platform and its lakehouse architecture to create data products.
Chapter 1, Getting Started and Lakehouse Concepts, covers the different techniques and methods for data engineering and machine learning. The goal is not to unveil insights into data never seen before. If that were the case, this would be an academic paper. Instead, the goal of this chapter is to use open and free data to demonstrate advanced technology and best practices. You will list and describe each dataset present in the book.
Chapter 2, Designing Databricks: Day One, covers workspace design, model life cycle practices, naming conventions, what not to put in DBFS, and other preparatory topics. The Databricks platform is simple to use. However, there are many options available to cater to the different needs of different organizations. During my years as a contractor and my time at Databricks, I have seen teams succeed and fail. I will share with you the successful dynamics as well as any configurations that accompany those insights in this chapter.
Chapter 3, Building the Bronze Layer, begins your data journey in the Databricks DI Platform by exploring the fundamentals of the Bronze layer of the Medallion architecture. The Bronze layer is the first step in transforming your data for downstream projects, and this chapter will focus on the Databricks features and techniques you have available for the necessary transformations. We will start by introducing you to Auto Loader, a tool to automate data ingestion, which you can implement with or without Delta Live Tables (DLT) to insert and transform your data.
Chapter 4, Getting to Know Your Data, explores the features within the Databricks DI Platform that help improve and monitor data quality and facilitate data exploration. There are numerous approaches to getting to know your data better with Databricks. First, we cover how to oversee data quality with DLT to catch quality issues early and prevent the contamination of entire pipelines. We will take our first close look at Lakehouse Monitoring, which helps us analyze data changes over time and can alert us to changes that concern us.
Chapter 5, Feature Engineering on Databricks, progresses from Chapter 4, where we harnessed the power of Databricks to explore and refine our datasets, to delve into the components of Databricks that enable the next step – feature engineering. We will start by covering Databricks Feature Engineering (DFE) in Unity Catalog to show you how you can efficiently manage engineered features using Unity Catalog. Understanding how to leverage DFE in UC is crucial for creating reusable and consistent features across training and inference. Then, you will learn how to leverage Structured Streaming to calculate features on a stream, which allows you to create stateful features needed for models to make quick decisions.
Chapter 6, Tools for Model Training and Experimenting, examines how to use data science to search for a signal hidden in the noise of data. We will leverage the features we created within the Databricks platform during the previous chapter. We will start by using AutoML in a basic modeling approach, providing auto-generated code and quickly enabling data scientists to establish a baseline model to beat. When searching for a signal, we experiment with different features, hyperparameters, and models. Historically, tracking these configurations and their corresponding evaluation metrics is a time-consuming project in and of itself. A low-overhead tracking mechanism, such as the tracking provided by MLflow, an open source platform for managing data science projects and supporting MLOps, will reduce the burden of manually capturing configurations. More specifically, we’ll introduce MLflow Tracking, an MLflow component that significantly improves tracking each permutation’s many outputs. However, that is only the beginning.
Chapter 7, Productionizing ML on Databricks, explores productionizing a machine learning model using Databricks products, which makes the journey more straightforward and cohesive by incorporating functionality such as the Unity Catalog Registry, Databricks Workflows, Databricks Asset Bundles, and Model Serving capabilities. This chapter will cover the tools and practices to take your models from development to production.
Chapter 8, Monitoring, Evaluating, and More, covers how to create visualizations for dashboards in both the new Lakeview dashboards and the standard DBSQL dashboards. Deployed models can be shared via a web application. Therefore, we will not only introduce Hugging Face Spaces but also deploy the RAG chatbot using a Gradio app to apply what we have learned.
Software/hardware covered in the book
Operating system requirements
Databricks
Windows, macOS, or Linux
Python and its associated libraries
Windows, macOS, or Linux
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
This book contains a few long screenshots which have been captured to show the overview of workflows and also the UI. Due to this, the content in this images may appear small at 100% zoom. Please check out the PDF copy provided with the book to zoom in for clearer images.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Databricks-ML-In-Action. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “ For example, you could select the ml_in_action.favorita_forecasting.train_set table.”
A block of code is set as follow:
import opendatasets as od od.download("https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data",raw_data_path)dbutils.fs.ls(raw_data_path + "/store-sales-time-series-forecasting/")Any command line input or output is written as follow:
/usr/local/bin/databricks configureBold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “ Once you have a dataset, return to the Canvas tab.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Databricks ML in Action, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/978-1-80056-489-3
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyThe goal of this part is not to unveil insights into data never seen before. If that were the case, this would be an academic paper. Instead, the goal is to use open and free data to demonstrate advanced technology and best practices. This part will list and describe each dataset present in the book. It also introduces you to the successful dynamics as well as any configurations that accompany the insights in this part. This part covers workspace design, model life cycle practices, naming conventions, what not to put in DBFS, and other preparatory topics.
This part has the following chapters:
Chapter 1, Getting Started and Lakehouse ConceptsChapter 2, Designing Databricks: Day OneChapter 3, Building the Bronze Layer“Give me six hours to chop down a tree, and I will spend the first four sharpening the axe.”
– Abraham Lincoln
We will start with a basic overview of how DatabricksData Intelligence Platform (DI) is an open platform on a lakehouse architecture and the advantages of this in developing machine learning (ML) applications. For brevity, we will use terms such as Data Intelligence Platform and Databricks interchangeably throughout the book. This chapter will introduce the different projects and associated datasets we’ll use throughout the book. Each project intentionally highlights a function or component of the DI Platform. Use the example projects as hands-on lessons for each platform element we cover. We progress through these projects in the last section of each chapter – namely, applying our learning.
Here is what you will learn in this chapter:
Components of the Data Intelligence PlatformAdvantages of the Databricks PlatformApplying our learningThe Data Intelligence Platform allows your entire organization to leverage your data and AI. It’s built on a lakehouse architecture to provide an open, unified foundation for all data and governance layers. It is powered by a Data Intelligence Engine, which understands the context of your data. For practical purposes, let’s talk about the components of the Databricks Data Intelligence Platform:
Figure 1.1 – The components of the Databricks Data Intelligence Platform
Let’s check out the following list with the descriptions of the items in the figure:
Delta Lake: The data layout within the Data Intelligence Platform is automatically optimized based on common data usage patternsUnity Catalog: A unified governance model to secure, manage, and share your data assetsData Intelligence Engine: This uses AI to enhance the platform’s capabilitiesDatabricks AI: ML tools to support end-to-end ML solutions and generative AI capabilities, including creating, tuning, and serving LLMsDelta live tables: Enables automated data ingestion and data qualityWorkflows: A fully integrated orchestration service to automate, manage, and monitor multi-task workloads, queries, and pipelinesDatabricks SQL (DBSQL): An SQL-first interface, similar to how you would interact with a data warehouse, and with functionality such as text-to-SQL, which lets you use natural language to generate queriesNow that we have our elements defined, let’s discuss how they help us achieve our ML goals.
Databricks’ implementation of a lakehouse architecture is unique. Databricks’ foundation is built on a Delta-formatted data lake that Unity Catalog governs. Therefore, it combines a data lake’s scalability and cost-effectiveness with a data warehouse’s governance. This means not only are table-level permissions managed through access control lists (ACLs) but file and object-level access are also regulated. This change in architecture from a data lake and/or a data warehouse to a unified platform is ideal – a lakehouse facilitates a wide range of new use cases for analytics, business intelligence, and data science projects across an organization. See the Introduction to Data Lakes blog post in the Further reading section for more information on lakehouse benefits.
This section will discuss the importance of open source frameworks and two critical advantages they provide – transparency and flexibility.
How open source features relate to the Data Intelligence Platform is unique. This uniqueness lies in the concepts of openness and transparency, often referred to as the “glass box” approach by Databricks. It means that when you use the platform to create assets, there’s no inscrutable black box that forces you to depend on a specific vendor for usage, understanding, or storage. A genuinely open lakehouse architecture uses open data file formats to make accessing, sharing, and removing your data simple. Databricks has optimized the managed version of Apache Spark to leverage the open data format Delta (which we’ll cover in more detail shortly). This is one of the reasons why the Delta format is ideal for most use cases. However, nothing stops you from using something such as the CSV or Parquet format. Furthermore, Databricks introduced Delta Lake Universal Format (Delta Lake UniForm) to easily integrate with other file formats such as Iceberg or Hudi. For more details, check out the Further reading section at the end of this chapter.
Figure 1.2 illustrates the coming together of data formats with UniForm.
Figure 1.2 – Delta Lake UniForm makes consuming Hudi and Iceberg file formats as easy as consuming Delta
The ability to use third-party and open source software fuels rapid innovation. New advances in data processing and ML can be quickly tested and integrated into your workflow. In contrast, proprietary systems often have longer wait times for vendors to incorporate updates. Waiting for a vendor to capitalize on open source innovation may seem rare, but it is the rule rather than the exception. This is especially true for data science. The speed of software and algorithmic advances is incredible. Evidence of this frantic pace of innovation can be seen daily on the Hugging Face community website. Developers share libraries and models on Hugging Face; hundreds of libraries are updated daily on the site alone.
Delta, Spark, the Pandas API on Spark (see Figure 1.3), and MLflow are notable examples of consistent innovation, largely driven by their transparency as open source projects. We mention these specifically because they were all initially created by either the founders of Databricks or company members following its formation.
ML developers benefit significantly from this transparency, as it provides them with unparalleled flexibility, easy integration, and robust support from the open source community – all without the overhead of maintaining an open source full stack.
Starting development as a contractor using Databricks is super-fast compared to when companies require a fresh development environment to be set up. Some companies require a service request to install Python libraries. This can be a productivity killer for data scientists. In Databricks, many of your favorite libraries are pre-installed and ready to use, and of course, you can easily install your own libraries as well.
Additionally, there is a large and vibrant community of Databricks users. The Databricks community website is an excellent resource to ask and answer questions about anything related to Databricks. We’ve included a link in the Further reading section at the end of this chapter.
Figure 1.3 – The pandas API on Spark
The pandas API on Spark is nearly identical syntax to standard pandas, making distributed computing with Spark easier to learn for those who have written pandas code in Python
While continuing with a focus on transparency, let’s move on to Databricks AutoML.
Databricks refers to its AutoML solution as a glass box. This terminology highlights the fact that there is nothing hidden from the user. This feature in the Data Intelligence Platform leverages an open source library, Hyperopt, in conjunction with Spark for hyperparameter tuning. It intelligently explores different model types in addition to optimizing the parameters in a distributed fashion. The use of Hyperopt allows each run within the AutoML experiment to inform the next run, reducing the overall number of runs needed to reach an optimal solution compared to a grid search. Each run in the experiment has an associated notebook with the code for the model. This method increases productivity, reduces unnecessary computing, and lets scientists perform experiments instead of writing boilerplate code. Once AutoML has converged on the algorithmically optimal solution, there is a “best notebook” for the best scoring model. We’ll expand on AutoML in several chapters throughout this book.
As data scientists, transparency is especially important. We do not trust black box models. How do you use them without understanding them? A model is only as good as the data going in. In addition to not trusting the models, black boxes create concerns about our research’s reproducibility and model drivers’ explainability.
When we create a model, who does it belong to? Can we get access to it? Can we tweak, test, and, most importantly, reuse it? The amount of time put into the model’s creation is not negligible. Databricks AutoML gives you everything to explain, reproduce, and reuse the models it creates. In fact, you can take the model code or model object and run it on a laptop or wherever. This open source, glass-box, reproducible, and reusable methodology is our kind of open.
Flexibility is also an essential aspect of the Databricks platform, so let’s dive into the file format Delta, an open source project that makes it easy to adapt to many different use cases. For those familiar with Parquet, you can think of Delta as Parquet-plus – Delta files are Parquet files with a transaction log. The transaction log is a game changer. The increased reliability and optimizations make Delta the foundation of Databricks’ lakehouse architecture. The data lake side of the lakehouse is vital to data science, streaming, and unstructured and semi-structured data formats. Delta has also made the warehouse side possible. There are entire books on Delta; see the Further reading section for some examples. We are focusing on the fact that it is an open file format with key features that support building data products.
Having an open file format is essential to maintain ownership of your data. Not only do you want to be able to read, alter, and open your data files, but you also want to keep them in your cloud tenant. Maintaining control over your data is possible in the Databricks Data Intelligence Platform. There is no need to put the data files into a proprietary format or lock them away in a vendor’s cloud. Take a look at Figure 1.4 () to see how Delta is part of the larger ecosystem.
Figure 1.4 – The Delta Kernel connection ecosystem
The Delta Kernel introduces a fresh approach, offering streamlined, focused, and reliable APIs that abstract away the intricacies of the Delta protocol. By simply updating the Kernel version, connector developers can seamlessly access the latest Delta features without needing to modify any code.
The freedom and flexibility of open file formats make it possible to integrate with new and existing external tooling. Delta Lake, in particular, offers unique support to create data products thanks to features such as time-travel versioning, exceptional speed, and the ability to update and merge changes. Time travel, in this context, refers to the capability of querying different versions of your data table, allowing you to revisit the state of the table before your most recent changes or transformations (see Figure 1.5). The more obvious use is to back up after making a mistake rather than writing out multiple copies of the table as a safety measure. A possibly less obvious use for time travel is reproducible research. You can access the data your model was trained on in the previous week without creating an additional copy of the data. Throughout the book, we will detail features of the Data Intelligence Platform you can use to facilitate reproducible research. The following figure shows you how the previous version of a table, relative to a timestamp or a version number, can be queried.
Figure 1.5 – A code example of the querying techniques to view previous versions of a table
Next, let us discuss the speed of Databricks’ lakehouse architecture. In November 2021, Databricks set a new world record for the gold standard performance benchmark for data warehousing. The Barcelona Computing Group shared their research supporting this finding. This record-breaking speed resulted from the Databricks’ engines (Spark and Photon) paired with Delta (see the Databricks Sets Official Data Warehousing Performance Record link in the Further reading section).
Delta’s impressive features include change data feed (CDF), change data capture (CDC), and schema evolution. Each plays a specific role in data transformation in support of ML.
Starting with Delta’s CDF capability, it is exactly what it sounds like – a feed of the changed data. Let’s say you have a model looking for fraud, and that model needs to know how many transaction requests have occurred in the last 10 minutes. It is not feasible to rewrite the entire table each time a value for an account needs to be updated. The feature value, or in this case, the number of transactions that occurred in the last 10 minutes, needs to be updated only when the value has changed. The use of CDF in this example enables updates to be passed to an online feature store; see Chapters 5 and 6 for more details.
Finally, let’s talk about change data capture, a game-changer in the world of data management. Unlike traditional filesystems, CDC in Delta has been purposefully designed to handle data updates efficiently. Let’s take a closer look at CDC and explore its capabilities through two practical scenarios:
Scenario 1 – effortless record updates: Picture a scenario involving Rami, one of your customers. He initially made a purchase in Wisconsin but later relocated to Colorado, where he continued to make purchases. In your records, it’s essential to reflect Rami’s new address in Colorado. Here’s where Delta’s CDC shines. It effortlessly updates Rami’s customer record without treating him as a new customer. CDC excels at capturing and applying updates seamlessly, ensuring data integrity without any hassles.Scenario 2 – adapting to evolving data sources: Now, consider a situation where your data source experiences unexpected changes, resulting in adding a new column containing information about your customers. Let’s say this new column provides insights into the colors of items purchased by customers. This is valuable data that you wouldn’t want to lose. Delta’s CDC, combined with its schema evolution feature, comes to the rescue.Schema evolution, explored in depth in Chapter 3, enables Delta to gracefully adapt to schema changes without causing any disruptions. When dealing with a new data column, Delta smoothly incorporates this information, ensuring your data remains up to date while retaining its full historical context. This ensures that you can leverage valuable insights for both present and future analyses.
This book is heavily project-based. Each chapter starts with an overview of the important concepts and Data Intelligence Platform features that will prepare you for the main event – the Applying our learning sections. Every Applying our learning section has a Technical requirements
