Data Observability for Data Engineering - Michele Pinto - E-Book

Data Observability for Data Engineering E-Book

Michele Pinto

0,0
26,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

In the age of information, strategic management of data is critical to organizational success. The constant challenge lies in maintaining data accuracy and preventing data pipelines from breaking. Data Observability for Data Engineering is your definitive guide to implementing data observability successfully in your organization.
This book unveils the power of data observability, a fusion of techniques and methods that allow you to monitor and validate the health of your data. You’ll see how it builds on data quality monitoring and understand its significance from the data engineering perspective. Once you're familiar with the techniques and elements of data observability, you'll get hands-on with a practical Python project to reinforce what you've learned. Toward the end of the book, you’ll apply your expertise to explore diverse use cases and experiment with projects to seamlessly implement data observability in your organization.
Equipped with the mastery of data observability intricacies, you’ll be able to make your organization future-ready and resilient and never worry about the quality of your data pipelines again.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Observability for Data Engineering

Proactive strategies for ensuring data accuracy and addressing broken data pipelines

Michele Pinto

Sammy El Khammal

BIRMINGHAM—MUMBAI

Data Observability for Data Engineering

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Reshma Raman

Publishing Product Manager: Heramb Bhavsar

Content Development Editor: Joseph Sunil

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Shambhavi Mishra

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Prafulla Nikalje

Marketing Coordinator: Vinishka Kalra

First published: December 2023

Production reference: 1151223

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-80461-602-4

www.packtpub.com

Contributors

About the authors

Michele Pinto is the Chief Technology Officer at Sustainable Brand Platform. With over 15 years of experience, Michele has a keen sense of how data observability and data engineering are closely linked. He started his career as a software engineer and has worked since then in various positions such as Big Data Engineer, Big Data Architect, Head of Data and, in recent years, in the role of Head of Engineering as well as CTO. He strongly believes in hard work and teamwork and loves to create the best possible conditions for his teams to work in an environment that inspires and motivates them to perform at their best.

To Sara, Giulia and Pierino for the time I took from you while working on this book and for the love you give me every day. To my mother Antonia and my father Giuseppe, who have believed in me for more than 40 years.

Thanks to Andy Petrella, Kensu and the Data Owls team for the great challenges and good times we had together, which contributed to the content and soul of this book.

Sammy El Khammal is a Product Manager at Kensu. After studying business across three continents at Solvay, Monash and Waseda, he dove headfirst into the data realm. Over the past five years, he evolved from customer success to Product Management, also making his mark as a public speaker at O’Reilly. His debut book is a testament to this journey, aiming to demystify Data Observability for a broader audience. Beyond his professional pursuits, Sammy is a guitar aficionado and a triathlon aspirant, showcasing a life enriched by varied interests and continuous learning.

I would like to extend my heartfelt thanks to those who have stood by me and offered their unwavering support during this journey. A special acknowledgment goes to my parents, Martine and Abdelmajid, their guidance and love have shaped the person I am today. Additionally, I am immensely grateful to my friends whose companionship and encouragement have been invaluable. Their belief in me and my work has been a constant source of motivation.

About the reviewers

Andy Rabone is an experienced Data Engineering manager, having worked in the fields of data warehousing, data engineering and business intelligence across healthcare, legal, and financial services for the past 15 years. He is currently engineering lead for a critical regulatory transformation project for a large FTSE 100 company, and passionately promotes the value of data observability and data quality in his work. He lives in South Wales with his wife and son, and enjoys indulging his son’s early fascination with robotics, following US sports, and playing guitar every now and then.

Sunil Mandowara is a Lead Data Analytics Architect who is helping organizations achieve their data analytics and AI vision by building scalable, observable, and secure data analytics platforms with his 20 years of hands-on experience carrying out 15+ complex data platform implementations in various domains using a variety of technologies, including distributed data processing, big data, REST, data lakes, data security, cloud, SAAS, and microservices.

Across industries, he has set Data teams from the ground up. He has also implemented the data governance and quality, which significantly boosted the productivity of the team and data value.

I want to thank my twins, Manan and Mahi, for allowing me to spend time reviewing this excellent book.

Table of Contents

Preface

Part 1: Introduction to Data Observability

1

Fundamentals of Data Quality Monitoring

Learning about the maturity path of data in companies

Identifying information bias in data

Data producers

Data consumers

The relationship between producers and consumers

Asymmetric information among stakeholders

Exploring the seven dimensions of data quality

Accuracy

Completeness

Consistency

Conformity

Integrity

Timeliness

Uniqueness

Consequences of data quality issues

Turning data quality into SLAs

An agreement as a starting point

The incumbent responsibilities of producers

Considerations for SLOs and SLAs

Indicators of data quality

Data source metadata

Schema

Lineage

Application

Statistics and KPIs

Examples of SLAs, SLOs, and SLIs

Alerting on data quality issues

Using indicators to create rules

The data scorecard

Summary

2

Fundamentals of Data Observability

Technical requirements

From data quality monitoring to data observability

Three principles of data observability

Data observability in IT observability

Key components of data observability

The contract between the application owner and the marketing team

Observing a timeliness issue

Observing a completeness issue

Observing a change in data distribution

Data observability in the enterprise ecosystem

Measuring the return on investment – defining the goals

Summary

Part 2: Implementing Data Observability

3

Data Observability Techniques

Analyzing the data

Monitoring data asynchronously

Monitoring data synchronously

Analyzing the application

The anatomy of an external analyzer

Pros and cons of the application analyzer method

Advantages

Disadvantages

Principles of monkey patching for data observability

Wrapping the function

Consolidating the findings

Pros and cons of the monkey patching method

Advanced techniques for data observability – distributed tracing

Summary

4

Data Observability Elements

Technical requirements

Prerequisites and installation requirements

Kensu – a data observability framework

kensu-py – an overview of the monkey patching technique

Static and dynamic elements

Defining the data observability context

Application or process

Code base

Code version

Project

Environment

User

Timestamp

The application run

Getting the metadata of the data sources

Data source

Schema

Mastering lineage

Types of lineage and dependencies

Lineage run

What’s in the log?

Computing observability metrics

What’s in the log?

Data observability for AI models

Model method

Model training

Model metrics

What’s in the log?

The feedback loop in data observability

Summary

5

Defining Rules on Indicators

Technical requirements

Determining SLOs

Project versus data source SLOs

Use case

Turning SLOs into rules

Different types of rules

Implementation of the rules

Project – continuous validation of the data

Concepts of CI/CD

Deploying the rules in a CI/CD pipeline

Summary

Part 3: How to adopt Data Observability in your organization

6

Root Cause Analysis

Data incident management

Detecting the issue

Impact analysis

Root cause analysis

Troubleshooting

Preventing further issues

Applying the method – a practical example

Anomaly detection

Simple indicator deterministic cases

Multiple indicators deterministic cases

Time series analysis

Case study

Summary

7

Optimizing Data Pipelines

Concepts of data pipelines and data architecture

What is a data pipeline?

Defining the types of data pipelines

The properties of a data pipeline

Rationalizing the costs

Data pipeline costs

Using data observability to rationalize costs

Summary

8

Organizing Data Teams and Measuring the Success of Data Observability

Defining and understanding data teams

The roles of a data team

Organizing a data team

Data mesh, data quality, and data observability – a virtuous circle

Data mesh

Building the virtuous circle

The first steps toward data observability and how to measure success

Measuring success

Summary

Part 4: Appendix

9

Data Observability Checklist

Challenges of implementing data observability

Costs

Overhead

Security

Complexity increase

Legacy system

Information overload

Checklist to implement data observability

Start with the right data or application

Choosing the right data observability tool

Selecting the metrics to follow

Compute the return on investment

Scaling with data observability

Summary

10

Pathway to Data Observability

Technical roadmap to include data observability

Allocating the right resources to your data observability project

Defining clear objectives with the team

Choosing a data pipeline

Setting success criteria with the team and stakeholders

Implementing data observability in applications

Continuously improving observability

Scaling data observability

Using observability for data catalogs

Using observability to ensure ML and AI reliability

Using observability to complete a data quality management program

Implementing data observability in a project

Resources and the first pipeline

Success criteria for PetCie’s implementation

The implementation phase at PetCie

Continuously improving observability at PetCie

Deploying observability at scale at PetCie

Outcomes

Summary

Index

Other Books You May Enjoy

Part 1: Introduction to Data Observability

In this section, we introduce data quality fundamentals, including key metrics and their application in Service Level Agreements to build trust in data pipelines. We then explore data observability, enhancing data quality monitoring with real-time insights for more effective management of data systems.

This part has the following chapters:

Chapter 1, Fundamentals of Data Quality MonitoringChapter 2, Fundamentals of Data Observability

1

Fundamentals of Data Quality Monitoring

Welcome to the exciting world of Data Observability for Data Engineering!

As you open the pages of this book, you will embark on a journey that will immerse you in data observability. The knowledge within this book is designed to equip you, as a data engineer, data architect, data product owner, or data engineering manager, with the skills and tools necessary to implement best practices in your data pipelines.

In this book, you will learn how data observability can help you build trust in your organization. Observability provides insights directly from within the process, offering a fresh approach to monitoring. It’s a method for determining whether the pipeline is functioning properly, especially in terms of adhering to its data quality standards.

Let’s get real for a moment. In our world, where we’re swimming in data, it’s easy to feel like we’re drowning. Data observability isn’t just some fancy term – it’s your life raft. Without it, you’re flying blind, making decisions based on guesswork. Who wants to be in that hot seat when data disasters strike? Not you.

This book isn’t just another item on your reading list; it’s the missing piece in your data puzzle. It’s about giving you the superpower to spot the small issues in your data before they turn into full-blown catastrophes. Think about the cost, not just in dollars, but in sleepless nights and lost trust, when data incidents occur. Scary, right?

But here’s the kicker: data observability isn’t just about avoiding nightmares; it’s about building a foundation of trust. When your data’s in check, your team can make bold, confident decisions without that nagging doubt. That’s priceless.

Data observability is not just a buzzword – we are deeply convinced it is the backbone of any resilient, efficient, and reliable data pipeline. This book will take you on a comprehensive exploration of the core principles of data observability, the techniques you can use to develop an observability approach, the challenges faced when implementing it, and the best practices being employed by industry leaders. This book will be your compass in the vast universe of data observability by providing you with various examples that allow you to bridge the gap between theory and practice.

The knowledge in this book is organized into four essential parts. In part one, we will lay the foundation by introducing the fundamentals of data quality monitoring and how data observability takes it to the next level. This crucial groundwork will ensure you understand the core concepts and will set the stage for the next topics.

In part two, we will move on to the practical aspects of implementing data observability. You will dive into various techniques and elements of observability and learn how to define rules on indicators. This part will provide you with the skills to apply data observability in your projects.

The third part will focus on adopting data observability at scale in your organization. You will discover the main benefits of data observability by learning how to conduct root cause analysis, how to optimize pipelines, and how to foster a culture change within your team. This part is essential to ensure the successful implementation of a data observability program.

Finally, the fourth part will contain additional resources focused on data engineering, such as a data observability checklist and a technical roadmap to implement it, leaving you with strong takeaways so that you can stand on your own two feet.

Let’s start with a hypothetical scenario. You are a data engineer, coming back from your holidays and ready to start the quarter. You have a lot of new projects for the year. However, the second you reach your desktop, Lucy from the marketing team calls out to you: “The marketing report of last month is totally wrong – please fix it ASAP. I need to update my presentation!”

This is annoying; all the work that’s been scheduled for the day is delayed, and you need to check the numbers. You open your Tableau dashboard and start a Zoom meeting with the marketing team. The first task of the day: understand what she meant by wrong. Indeed, the turnover seems odd. It’s time for you to have a look at the SQL database feeding the dashboard. Again, you see the same issue. This is strange and will require even more investigation.

After hours of manual and tedious checks, contacting three different teams and sending 12 emails, you finally found the culprit: an ingestion script, feeding the company’s master database, was modified to express the turnover in thousands of dollars instead of units. Because the data team didn’t know that the metric would be used by the marketing team, the information did not pass and the pipeline was fed with the wrong data.

It’s not the first time this has happened. Hours of productivity are ruined by firefighting data issues. It’s decided – you need to implement a new strategy to avoid this.

Observability is intimately correlated with the notions of data quality. The latter is often defined as a way of measuring data indicators. Data quality is one thing, but monitoring it is something else! Through this chapter, we will explore the principles of data quality and understand how those can guide you on the data observability journey and how the information bias between stakeholders is key to understanding the need for data quality and observability in the data pipeline.

Data quality comes from the need to ensure correct and sustainable data pipelines. We will look at the different stakeholders of a data pipeline and describe why they need data quality. We will also define data quality through several concepts, which will lead to you understanding how a common base can be created between stakeholders.

By the end of this chapter, you will understand how data quality can be monitored and turned into metrics, preparing the ground for data observability.

In this chapter, we’ll cover the following topics:

Learning about the maturity path of data in companiesIdentifying information bias in dataExploring the seven dimensions of data qualityTurning data quality into SLAsIndicators of data qualityAlerting on data quality issues

Learning about the maturity path of data in companies

The relationship between companies and data started a long time ago, at least from the end of the 1980s, with the first large diffusion of computers in offices. Since computers and data became more and more widespread in subsequent years, the usage of data in companies has gone through a very long period, of at least two decades, during which investments in data have grown, but this was done linearly. We cannot speak of a data winter, but we can consider it as a long wait for the spring that led to the explosion of data investments that we have experienced since the second half of the 2000s. This period was interrupted by at least three fundamental factors:

The collapse of the cost of the resources necessary to historicize and process data (memories and CPUs)The advent of IoT devices, the widespread access to the internet, and the subsequent tsunami of available dataThe diffusion and accessibility of relatively simple and advanced technologies dedicated to processing large amounts of data, such as Spark, Delta Lake, NoSQL databases, Hive, and Kafka

When these three fundamental pillars became accessible, the most attentive companies embarked on a complex path in the world of data, a maturity path that is still ongoing today, with several phases, each with its challenges and problems:

Figure 1.1 – An example of the data maturity path

Each company started this path differently, but usually, the first problem to solve was managing the continuously growing availability of data coming from increasingly popular applications, such as websites for e-commerce, social platforms, or the gaming industry, as well as apps for mobile devices. The solution to these problems has been to invest in small teams of software engineers who have experimented with the use and integration of big data technologies and platforms, among which there’s Hadoop with its main components, HDFS, MapReduce, and YARN, which are responsible for historicizing enormous volumes of data, processing them, and managing the resources, respectively, all in a distributed system. The more recent advanced technologies, such as Spark, Flink, Kafka, NoSQL, and Parquet, provided a further boost to this process. These software engineers were unaware that they were the first generation of a new role that is now one of the most popular and in-demand roles in software engineering – the data engineer.

These primal teams have often been seen as research and development teams and the expectations of them have grown with increasing investments. So, the next step was to ask how these teams could express their potential. Consequently, the step after that was to invest in an analytics team that could work alongside or as a consumer of the data engineers’ team. The natural way to start extracting value from data was with the adoption of advanced analytics and the introduction of techniques and solutions based on machine learning. Then, companies began to acquire a corporate culture and appreciate the great potential and competitiveness that data could provide. Whether they realized it or not, they were becoming data-driven companies, or at least data-informed; in the meanwhile, data began to be taken seriously – as a real asset, a critical component, and not just a mysterious box from which to take some insight only when strictly necessary.

The first results and the constant growth of the available data triggered a real race that has pushed companies to invest more and more in personnel and data technologies. This has led to the proliferation of new roles (data product manager, data architect, machine learning engineer, and so on) and the explosion of data experts in the company, which led to new and unexplored organizational problems. The centralized data team model revealed all its limits in terms of scalability and the lack of readiness to support the real problems of the business. Therefore, the process of decentralizing these data experts has begun and, in addition to solving these issues, has introduced new challenges, such as the need to adopt data governance processes and methodologies. Consequently, with this decentralization and having the data more and more central in the company, paired with the need to increase the skills of data quality, what was only important yesterday is becoming more and more of a priority today: to govern and monitor the quality of data.

The spread of teams and data in companies has led to an increase in the data culture in companies. Interactions between decentralized actors are increasingly entrusted via contracts that various teams make between them. Data is no longer seen as an unreliable dark object to rely on if necessary. Each team works daily with data, and data is now a real product that must comply with quality standards that are on par with any other product generated in the company. The quality of data is of extreme importance; it is no longer one problem of many, it isthe problem.

In this section, we learned about the data maturity path that many companies are facing and understood the reasons that are pushing companies to invest more and more in data quality.

In the next section, we will understand how to identify information bias in data, introduce the roles of data producers and data consumers, and cover the expectations and responsibilities of these two actors toward data quality.

Identifying information bias in data

Let’s talk about a sneaky problem in the world of data: information bias. This bias arises from a misalignment between data producers and consumers. When the expectations and understanding of data quality are not in sync, information bias manifests, distorting the data’s reliability and integrity. This section will unpack the concept of information bias in the context of data quality, exploring how discrepancies in producers’ and consumers’ perspectives can skew the data landscape. By delving into the roles, relationships, and responsibilities of these key stakeholders, we’ll shed light on the technical intricacies that underpin a successful data-driven ecosystem.

Data is a primary asset of a company’s intelligence. It allows companies to get insights, drive projects, and generate value. At the genesis of all data-driven projects, there is a business need:

Creating a sales report to evaluate top-performing employeesEvaluating the churn of young customers to optimize marketing effortsForecasting tire sales to avoid overstocking

These projects rely on a data pipeline, a succession of applications that manipulate raw data to create the final output, often in the form of a report:

Figure 1.2 – Example of a data pipeline

In each case, the produced data serves the interests of a consumer, which can be, among others, a manager, an analyst, or a decision-maker. In a data pipeline, the applications or processes, such as Flink or Power BI in Figure 1.2, consume and produce data sources, such as JSON files or SQL databases.

There are several stakeholders in a pipeline at each step or application: the producers on one hand and the consumers on the other hand. Let’s look at these stakeholders in detail.

Data producers

The producer creates the data and makes it available to other stakeholders. By definition, a producer is not the final user of the data. It can be a data engineering team serving the data science team, an analyst serving the board of managers, or a cross-functional team that produces data products available for the organization. In our pipeline, for instance, an engineer coding the Spark ingestion job is a producer.

As a data producer, you are responsible for the content you serve, and you are concerned about maintaining the right level of service for your consumers. Data producers also need to create more projects to fulfill a maximum amount of needs coming from various teams, so producers need to deal with maintaining quality for existing projects and delivering new projects.

As a data producer, you have to maintain a high level of service. This can be achieved by doing the following:

Defining clear data quality targets: Understand what is required to maintain high quality, and communicate those standards to all the data source stakeholdersEnsuring those targets are met thanks to a robust validation process: Put the quality targets into practice and verify the quality of the data, from extraction to transformation and deliveryKeeping accurate and up-to-date data documentation: Document how the process modified the data with instruments such as data lineage and metricsCollaborating with the data consumers: Ensure you set the right quality standards so that you can correctly maintain them and adapt to evolving needs

We emphasize that collaboration with consumers is key to fulfilling the producer’s responsibilities. Let’s see the other end of the value chain: data consumers.

Data consumers

The consumer uses the data created by one or several producers. It can be a user, a team, or another application. They may or may not be the final user of the data, or may just be the producer of another dataset. This means that a consumer can become a producer and vice versa. Here are some examplesof consumers:

Data or business analysts: They use data produced by the producers to extract insights that will support business decisionsBusiness managers: They use the data to make (strategic) decisions or follow indicatorsOther producers: The consumer is a new intermediary in the data value chain who uses produced data to create new datasets