Driving Data Quality with Data Contracts - Andrew Jones - E-Book

Driving Data Quality with Data Contracts E-Book

Andrew Jones

0,0
32,39 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value.
With Driving Data Quality with Data Contracts, you’ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You’ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best—the data generators—and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products.
By the end of this book, you’ll have gained a comprehensive understanding of how data contracts can revolutionize your organization’s data culture and provide a competitive advantage by unlocking the real value within your data.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Driving Data Quality with Data Contracts

A comprehensive guide to building reliable, trusted, and effective data platforms

Andrew Jones

BIRMINGHAM—MUMBAI

Driving Data Quality with Data Contracts

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Reshma Raman

Publishing Product Manager: Apeksha Shetty

Senior Editor: Nathanya Dias

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Manju Arasan

Production Designer: Prashant Ghare

Marketing Coordinator: Nivedita Singh

First published: July 2023

Production reference: 1300623

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-83763-500-9

www.packtpub.com

To my wife, Debs, for all your love and support, and my children, Alex and Rosie – your laughter is my favorite sound.

– Andrew Jones

Foreword

An ounce of prevention is worth a pound of cure. This nugget of wisdom holds true in both the worlds of health and data. Unfortunately, trust in data is easily lost and hard to regain. Data contracts have taken our community by storm as a socio-technical solution to achieve and maintain high levels of trust.

My first exposure to data contracts was through Andrew’s presentation at a data quality meetup in London, hosted at the headquarters of a popular grocery chain. He and I both spoke at the meetup, although I was happy to be overshadowed. The room was rapt as he painted a picture of the whys, whats, and hows of data contracts at GoCardless. Most of us were only familiar with data contracts as a buzzword, but here Andrew is showing us the actual YAML specifications.

I remember that the Q&A portion was overflowing with questions for Andrew and we had to cut the section short. Thankfully, this book answers all of the questions I had.

Like his talk, this book is a bridge between theory and practice. Chapter 1, A Brief History of Data Platforms, paired with Chapter 7, Contract-Driven Data Architecture, provide a strong conceptual foundation for data contracts. The final two chapters, Chapter 9, Implementing Data Contracts in Your Organization, and Chapter 10, Data Contracts in Practice, provide powerful tools to think about the practice of data contracts.

Along the way, as a reader, I am grateful for how the book progressively introduces complexity, interweaves real examples between explanations, and leaves me with opportunities to learn further. Whether you’re a data practitioner who is tired of being blamed for data quality issues or a business stakeholder who wants to promote data trust, this book is the gold standard for learning about data contracts.

Kevin Hu, PhD

Co-founder and CEO at Metaplane

Contributors

About the author

Andrew Jones is a principal engineer at GoCardless, one of Europe’s leading Fintech’s. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he’d faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.

Thanks to everyone at GoCardless who supported me in taking data contracts from a one-pager to successful implementation, and then an industry topic worthy of an entire book! Also, a big thank you to everyone in the data community who has been so generous with their time and helped develop the ideas that became data contracts.

About the reviewers

John Thomas, a data analytics architect and dedicated book reviewer, combines his passion for data and technology in his work. He has successfully designed and implemented data warehouses, lakes, and meshes for organizations worldwide. With expertise in data integration, ETL processes, governance, and streaming, John’s eloquent book reviews resonate with both tech enthusiasts and book lovers. His reviews offer insights into the evolving technological landscape shaping the publishing industry.

Animesh Kumar is the CTO and co-founder @Modern, and the co-creator of the Data Developer Platform (DDP) infrastructure specification. With over 20 years in the data engineering space, Animesh has donned multiple hats, including those of architect, VP engineer, CTO, CPO, and founder, across a wide range of technology firms. He has architected engineering solutions for several A-players, including the likes of the NFL, Gap, Verizon, Rediff, Reliance, SGWS, Gensler, and TOI. He is presently dedicated to building DataOS as a direct implementation of the DDP infrastructure specification.

Table of Contents

Preface

Part 1: Why Data Contracts?

1

A Brief History of Data Platforms

The enterprise data warehouse

The big data platform

The modern data stack

The state of today’s data platforms

The lack of expectations

The lack of reliability

The lack of autonomy

The ever-increasing use of data in business-critical applications

Summary

Further reading

2

Introducing Data Contracts

What is a data contract?

An agreed interface between the generators of data, and its consumers

Setting expectations around that data

Defining how the data should be governed

Facilitating the explicit generation of quality data

The four principles of data contracts

When to use data contracts

Data contracts and the data mesh

Domain ownership

Data as a product

Self-serve data platform

Federated computational governance

Data contracts enable a data mesh

Summary

Further reading

Part 2: Driving Data Culture Change with Data Contracts

3

How to Get Adoption in Your Organization

Using data contracts to change an organization

Articulating the value of your data

Building data products

What is a data product?

Adopting a data product mindset

Designing a data product

Walking through an example of a data product

Summary

Further reading

4

Bringing Data Consumers and Generators Closer Together

Who is a consumer, and who is a generator?

Data consumers

Data generators

Assigning responsibility and accountability

Feeding data back to the product teams

Managing the evolution of data

Summary

Further reading

5

Embedding Data Governance

Why we need data governance

The requirements of data governance

How data governance programs are typically applied

Promoting data governance through data contracts

Assigning responsibility for data governance

Responsibilities of the data generators

Introducing the data architecture council

Working together to implement federated data governance

Summary

Further reading

Part 3: Designing and Implementing a Data Architecture Based on Data Contracts

6

What Makes Up a Data Contract

The schema of a data contract

Defining a schema

Using a schema registry as the source of truth

Evolving your data over time

Evolving your schemas

Migrating your consumers

Defining the governance and controls

Summary

Further reading

7

A Contract-Driven Data Architecture

A step-change in building data platforms

Building generic data tooling

Introducing a data infrastructure team

A case study from GoCardless in promoting autonomy

Promoting autonomy through decentralization

Introducing the principles of a contract-driven data architecture

Automation

Guidelines and guardrails

Consistency

Providing self-served data infrastructure

Summary

Further reading

8

A Sample Implementation

Technical requirements

Creating a data contract

Providing the interfaces to the data

Introducing IaC

Creating the interfaces from the data contract

Creating libraries for data generators

Populating a central schema registry

Registering a schema with the Confluent schema registry

Managing schema evolution

Implementing contract-driven tooling

Summary

Further reading

9

Implementing Data Contracts in Your Organization

Getting started with data contracts

The ability to define a data contract

The ability to provision an interface for the data for consumers to query

The ability of generators to write data to the interface

Migrating to data contracts

Discovering data contracts

What is a data catalog?

Why are data catalogs important for discovering data contracts?

What is data lineage?

Why is data lineage important for data contracts?

Building a mature data contracts-backed data culture

Summary

Further reading

10

Data Contracts in Practice

Designing a data contract

Identifying the purpose

Considering the trade-offs

Defining the data contract

Deploying the data contract

Monitoring and enforcing data contracts

The data contract’s definition

The quality of the data

The performance and dependability of the data

Data contract publishing patterns

Writing directly to the interface

Materialized views on CDC

The transactional outbox pattern

The listen-to-yourself pattern

Summary

Further reading

Index

Other Books You May Enjoy

Part 1: Why Data Contracts?

In this part, we will look briefly at the history of data platforms and how that led us to our current state, where data is unreliable, untrustworthy, and unable to drive real business value. We’ll then introduce data contracts, what they are, their guiding principles, and how they solves those problems.

This part comprises the following chapters:

Chapter 1, A Brief History of Data PlatformsChapter 2, Introducing Data Contracts

1

A Brief History of Data Platforms

Before we can appreciate why we need to make a fundamental shift to a data contracts-backed data platform in order to improve the quality of our data, and ultimately the value we can get from that data, we need to understand the problems we are trying to solve. I’ve found the best way to do this is to look back at the recent generations of data architectures. By doing that, we’ll see that despite the vast improvements in the tooling available to us, we’ve been carrying through the same limitations in the architecture. That’s why we continue to struggle with the same old problems.

Despite these challenges, the importance of data continues to grow. As it is used in more and more business-critical applications, we can no longer accept data platforms that are unreliable, untrusted, and ineffective. We must find a better way.

By the end of this chapter, we’ll have explored the three most recent generations of data architectures at a high level, focusing on just the source and ingestion of upstream data, and the consumption of data downstream. We will gain an understanding of their limitations and bottlenecks and why we need to make a change. We’ll then be ready to learn about data contracts.

In this chapter, we’re going to cover the following main topics:

The enterprise data warehouseThe big data platformThe modern data stackThe state of today’s data platformsThe ever-increasing use of data in business-critical applications

The enterprise data warehouse

We’ll start by looking at the data architecture that was prevalent in the late 1990s and early 2000s, which was centered around an enterprise data warehouse (EDW). As we discuss the architecture and its limitations, you’ll start to notice how many of those limitations continue to affect us today, despite over 20 years of advancement in tools and capabilities.

EDW is the collective term for a reporting and analytics solution. You’d typically engage with one or two big vendors who would provide these capabilities for you. It was expensive and only larger companies that could justify the investment.

The architecture was built around a large database in the center. This was likely an Oracle or MS SQL Server database, hosted on-premises (this was before the advent of cloud services). The extract, transform, and load (ETL) process was performed on data from source systems, or more accurately, the underlying database of those systems. That data could then be used to drive reporting and analytics.

The following diagram shows the EDW architecture:

Figure 1.1 – The EDW architecture

Because this ETL ran against the database of the source system, reliability was a problem. It created a load on the database that could negatively impact the performance of the upstream service. That, and the limitations of the technology we were using at the time, meant we could do few transforms on the data.

We also had to update the ETL process as the database schema and the data evolved over time, relying on the data generators to let us know when that happened. Otherwise, the pipeline would fail.

Those who owned databases were somewhat aware of the ETL work and the business value it drove. There were few barriers between the data generators and consumers and good communication.

However, the major limitation of this architecture was the database used for the data warehouse. It was very expensive and, as it was deployed on-premises, was of a fixed size and hard to scale. That created a limit on how much data could be stored there and made available for analytics.

It became the responsibility of the ETL developers to decide what data should be available, depending on the business needs, and to build and maintain that ETL process by getting access to the source systems and their underlying databases.

And so, this is where the bottleneck was. The ETL developers had to control what data went in, and they were the only ones who could make data available in the warehouse. Data would only be made available if it met a strong business need, and that typically meant the only data in the warehouse was data that drove the company KPIs. If you wanted some data to do some analysis and it wasn’t already in there, you had to put a ticket in their backlog and hope for the best. If it did ever get prioritized, it was probably too late for what you wanted it for.

Note

Let’s illustrate how different roles worked together with this architecture with an example.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She’s aware that some of the data from that database is extracted by a data analyst, Bukayo, and that is used to drive top-level business KPIs.

Bukayo can’t do much transformation on the data, due to the limitations of the technology and the cost of infrastructure, so the reporting he produces is largely on the raw data.

There are no defined expectations between Vivianne and Bukayo, and Bukayo relies on Vivianne telling him in advance whether there are any changes to the data or the schema.

The extraction is not reliable. The ETL process could affect the performance of the database, and so can be switched off when there is an incident. Schema and data changes are not always known in advance. The downstream database also has limited performance and cannot be easily scaled to deal with an increase in the data or usage.

Both Vivianne and Bukayo lack autonomy. Vivianne can’t change her database schema without getting approval from Bukayo. Bukayo can only get a subset of data, with little say over the format. Furthermore, any potential users downstream of Bukayo can only access the data he has extracted, severely limiting the accessibility of the organization’s data.

This won’t be the last time we see a bottleneck that prevents access to, and the use of, quality data. Let’s look now at the next generation of data architecture and the introduction of big data, which was made possible by the release of Apache Hadoop in 2006.

The big data platform

As the internet took off in the 1990s and the size and importance of data grew with it, the big tech companies started developing a new generation of data tooling and architectures that aimed to reduce the cost of storing and transforming vast quantities of data. In 2003, Google wrote a paper describing their Google File System, and in 2004 followed that up with another paper, titled MapReduce: Simplified Data Processing on Large Clusters. These ideas were then implemented at Yahoo! and open sourced as Apache Hadoopin 2006.

Apache Hadoop contained two core modules. The Hadoop Distributed File System (HDFS) gave us the ability to store almost limitless amounts of data reliably and efficiently on commodity hardware. Then the MapReduce engine gives us a model on which we could implement programs to process and transform this data, at scale, also on commodity hardware.

This led to the popularization of big data, which was the collective term for our reporting, ML, and analytics capabilities with HDFS and MapReduce as the foundation. These platforms used open source technology and could be on-premises or in the cloud. The reduced costs made this accessible to organizations of any size, who could either implement it themselves or use a packaged enterprise solution provided by the likes of Cloudera and MapR.

The following diagram shows the reference data platform architecture built upon Hadoop:

Figure 1.2 – The big data platform architecture

At the center of the architecture is the data lake, implemented on top of HDFS or a similar filesystem. Here, we could store an almost unlimited amount of semi-structured or unstructured data. This still needed to be put into an EDW in order to drive analytics, as data visualization tools such as Tableau needed a SQL-compatible database to connect to.

Because there were no expectations set on the structure of the data in the data lake, and no limits on the amount of data, it was very easy to write as much as you could and worry about how to use it later. This led to the concept of extract, load, and transform (ELT), as opposed to ETL, where the idea was to extract and load the data into the data lake first without any processing, then apply schemas and transforms later as part of loading to the data warehouse or reading the data in other downstream processes.

We then had much more data than ever before. With a low barrier to entry and cheap storage, data was easily added to the data lake, whether there was a consumer requirement in mind or not.

However, in practice, much of that data was never used. For a start, it was almost impossible to know what data was in there and how it was structured. It lacked any documentation, had no set expectations on its reliability and quality, and no governance over how it was managed. Then, once you did find some data you wanted to use, you needed to write MapReduce jobs using Hadoop or, later, Apache Spark. But this was very difficult to do – particularly at any scale – and only achievable by a large team of specialist data engineers. Even then, those jobs tended to be unreliable and have unpredictable performance.

This is why we started hearing people refer to it as the data swamp. While much of the data was likely valuable, the inaccessibility of the data lake meant it was never used. Gartner introduced the term dark data to describe this, where data is collected and never used, and the costs of storing and managing that data outweigh any value gained from it (https://www.gartner.com/en/information-technology/glossary/dark-data). In 2015, IDC estimated 90% of unstructured data could be considered dark (https://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html).

Another consequence of this architecture was that it moved the end data consumers further away from the data generators. Typically, a central data engineering team was introduced to focus solely on ingesting the data into the data lake, building the tools and the connections required to do that from as many source systems as possible. They were the ones interacting with the data generators, not the ultimate consumers of the data.

So, despite the advance in tools and technologies, in practice, we still had many of the same limitations as before. Only a limited amount of data could be made available for analysis and other uses, and we had that same bottleneck controlling what that data was.

Note

Let’s return to our example to illustrate how different roles worked together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that some of the data from that database is extracted in a raw form, and is unlikely to know exactly what the data is. Certainly, she doesn’t know why.

Ben is a data engineer who works on the ELT pipeline. He aims to extract as much of the data as possible into the data lake. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.