Managing Data as a Product - Andrea Gioia - E-Book

Managing Data as a Product E-Book

Andrea Gioia

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Traditional monolithic data platforms struggle with scalability and burden central data teams with excessive cognitive load, leading to challenges in managing technological debt. As maintenance costs escalate, these platforms lose their ability to provide sustained value over time. With two decades of hands-on experience implementing data solutions and his pioneering work in the Open Data Mesh Initiative, Andrea Gioia brings practical insights and proven strategies for transforming how organizations manage their data assets.
Managing Data as a Product introduces a modular and distributed approach to data platform development, centered on the concept of data products. In this book, you’ll explore the rationale behind this shift, understand the core features and structure of data products, and learn how to identify, develop, and operate them in a production environment. The book guides you through designing and implementing an incremental, value-driven strategy for adopting data product-centered architectures, including strategies for securing buy-in from stakeholders. Additionally, it explores data modeling in distributed environments, emphasizing its crucial role in fully leveraging modern generative AI solutions.
By the end of this book, you’ll have gained a comprehensive understanding of product-centric data architecture and the essential steps needed to adopt this modern approach to data management.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 582

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Managing Data as a Product

Design and build data-product-centered socio-technical architectures

Andrea Gioia

Managing Data as a Product

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

The author acknowledges the use of cutting-edge AI, such as ChatGPT, with the sole aim of enhancing the language and clarity within the book, thereby ensuring a smooth reading experience for readers. It’s important to note that the content itself has been crafted by the author and edited by a professional publishing team.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Manager: Nilesh Kowadkar

Book Project Manager: Aparna Nair

Senior Content Development Editor: Priyanka Soam

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Proofreader: Priyanka Soam

Indexer: Rekha Nair

Production Designer: Joshua Misquitta

Senior DevRel Marketing Executive: Nivedita Singh

First published: November 2024

Production reference: 2291124

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83546-853-1

www.packtpub.com

To my parents, Delia and Salvatore, for your endless encouragement, patience, and love. This is for you.

– Andrea Gioia

Foreword

I have known and worked with Andrea for more than 7 years. He is undoubtedly recognized as an innovation leader in the field of data management and one of the most active thinkers in the global data community.

In my professional career, Andrea has been, and still is, a mentor, an example to follow for the passion he transmits in the continuous search for advancements in all aspects of data management.

We live in the era of the Fourth Industrial Revolution: in this scenario of uncertainty, companies must be able to continuously build new, dynamic capabilities to obtain competitive advantage and survive. From this point of view, core enterprise data is an essential resource that companies can leverage; however, data is a liquid asset and only generates value if it is actually used and, even more so, reused. Storing data without using it does not generate value, but only liability. To encourage the reuse of data and derive value from it, it is necessary to view data no longer as a by-product of software applications but as a first-class product in the enterprise architecture.

In this volume, Andrea highlights that managing data as a product is not a purely technological issue but requires an organic understanding of the company’s business strategy and operations at 360 degrees, considering the market context, the business architecture, the organizational architecture, the operating model, the state of the IT architecture and many other aspects. In the current post-digitalization era, a data management practitioner needs more and more T-shaped transversal skills to successfully implement the change of perspective that leads to managing data as a product.

Andrea has been an example of a complete professional for years, with an open mind, always aimed at considering all these aspects. Through his advisory experience alongside customers, he has been able to understand that modularity in architecture design is an essential feature to manage the intrinsic complexity of a modern data ecosystem. This belief led him to define the paradigm described in this book for the development of modern data architectures.

However, as a profound expert of modern data management paradigms, by which he was inspired for the birth of this book, Andrea understood that the modularity of architecture alone is not enough to guarantee effectiveness in the use of data assets if not supported by a shared enterprise knowledge model. He realized that data management must go beyond the pure management of elementary data and extend to organically manage multiple types of metadata and corporate knowledge, focusing on the explicit modeling of domain semantics. This is a human-centric process that can boost the accuracy of new generative AI techniques applied to data but requires the active contribution of domain experts. If this company-specific, non-replaceable knowledge base is evolved in synergy with the development of data as products, it’s easier to enable the actual composability of the data products that make the modular architecture and encourage their reuse, maximizing overall value for the company.

This book disseminates the main elements of the vision that the author has developed over the years, enriched with the fruit of the learnings that he has collected during his journey in advisory. It is a piece of work that will inspire many people in the evolution of their professional lives – not only data engineers but also data managers, chief data officers, chief information officers, and enterprise architects.

Finally, I would like to thank Andrea for asking me to be a reviewer of this volume. It was a privilege for me!

Giulio Scotti Governance Advisor, Quantyca

Contributors

About the author

Andrea Gioia is a partner and CTO at Quantyca, a consulting firm specializing in data management, and co-founder of blindata.io, a SaaS platform for data governance and compliance. With over 20 years of experience, Andrea has led cross-functional teams delivering complex data projects across multiple industries. As CTO, he advises clients on defining and executing their data strategies. Andrea is a frequent speaker and writer, serving as the main organizer of the Data Engineering Italian Meetup and leading the Open Data Mesh Initiative. He is an active DAMA member and has been part of the DAMA Italy Chapter's scientific committee since 2023.

This book would not have been possible without the collaboration and support of everyone I work with daily to drive innovation in the data management market. I am deeply grateful to all my colleagues at Quantyca and Blindata, as well as to our partners and clients.

About the reviewer

Andrew Jones is an independent data consultant, helping data leaders transform their organizations into ones where data drives revenue - is dependable, governed, and valuable, and is guaranteed with data contracts. In 2021, he created data contracts, an architectural pattern that brings data teams closer to the business, drives team autonomy, and embeds quality and governance as part of the data platform.

In 2023, he wrote the definitive book on data contracts, Driving Data Quality with Data Contracts. Through his independent consulting practice, he helps organizations big and small adopt data products, data meshes, and data contracts. He is based near London, UK, and is a regular writer and public speaker.

Table of Contents

Preface

Part 1: Data Products and the Power of Modular Architectures

1

From Data as a Byproduct to Data as a Product

Reviewing the history of monolithic data platforms

Data warehouse

Data lake

DWH versus data lake

Modern data stack

Understanding why monolithic data platforms fail

Monolithic versus modular architecture

The power of modularity

Failure loops

Exploring why we need to manage data as a product

Being data-centric

Data is everybody’s business

Data product thinking

Putting it all together

Summary

Further reading

2

Data Products

Defining a data product

Pure data products versus analytical applications

Why do we need pure data products?

Pure data product definition

The rise of data-driven applications

Exploring key characteristics of pure data product

Popular ilities

Relevance

Accuracy

Reusability

Composability

Dissecting the anatomy of a pure data product

Anatomy overview

Data

Metadata

Application and infrastructure

Interfaces

Classifying pure data products

Source-aligned versus consumer-aligned

Domain-aligned versus value stream-aligned

Other classifications

Summary

Further reading

3

Data Product-Centered Architectures

Designing a data-product-centric architecture

System architecture

Socio-technical architecture

Architectural principles

Architectural components

Dissecting the architecture’s operational plane

Core capabilities

Data product development

Governance policy-making

XOps platform engineering

Data transformation enabling

Dissecting the architecture’s management plane

Identity system

Intelligent system

Control system

Coordination system

Operating model

Exploring alternative approaches to modern data management

Data mesh

Data fabric

Data-centric approach and dataware

Summary

Further reading

Part 2: Managing the Data Product Lifecycle

4

Identifying Data Products and Prioritizing Developments

Modeling a business domain

Introducing DDD

Connecting problem and solution spaces

Identifying subdomains

Identifying bounded contexts

Mapping business capabilities

Discovering data products with event storming

Understanding a business strategy

Turning business strategy into actionable business cases

Analyzing processes with event storming

Managing the data product portfolio

Validating data product proposals

Describing data product with data product canvas

Optimizing a data product portfolio

Summary

Further reading

5

Designing and Implementing Data Products

Designing data products and their interactions

Understanding the data product local environment

Understanding the data product global ecosystem

Understanding data product internals

Managing data product metadata

Describing data products

Introducing the Data Product Descriptor Specification

Describing interface components in DPDS

Describing internal components in DPDS

Managing data product data

Sourcing data

Processing data

Serving data

Summary

Further reading

6

Operating Data Products in Production

Deploying data products

Understanding continuous integration

Automating the build and release processes

Understanding CD

Defining the deployment pipeline

Governing data products

Collecting and sharing metadata

Implementing computational policy

Observing data products

Controlling data products

Consuming data products

Discovering data products

Accessing data products

Composing data products

Evolving data products

Versioning data products

Deprecating data products

Summary

Further reading

7

Automating Data Product Lifecycle Management

Understanding the XOps platform

Mobilizing the data ecosystem

Understanding platform value engines

Exploring platform architecture

Boosting developer experience

Implementing data product building blocks

Defining data product modules and blueprints

Leveraging sidecars to manage cross-cutting concerns

Supporting computational policy and ontology development

Boosting operational experience

Orchestrating the data product deployment pipeline

Controlling data product operations

Boosting consumer experience

Managing the data product marketplace

Supporting data product composition

Evaluating make-or-buy options

Deciding when to make the call

Solving the dilemma between make or buy

Future-proofing your investments

Summary

Further reading

Part 3: Designing a Successful Data Product Strategy

8

Moving through the Adoption Journey

Understanding adoption phases

Tracing the journey ahead

Getting key stakeholders on board

Delving into the assessment phase

Preparing for the assessment

Understanding the why

Defining the how

Delving into the bootstrap phase

Setting the foundation

Building the first data products

Defining key governance policies

Implementing the thinnest viable platform

Enabling the enablers

Transitioning to the next phase

Delving into the expand phase

Crossing the chasm

Scaling the adoption

Evolving governance policies and platform services

Dealing with legacy systems

Delving into the sustain phase

Becoming the new normal

Remaining capable to adapt

Driving the adoption with an adaptive data strategy

Understanding evolutionary strategy pillars

Keeping strategy and execution aligned with EDGE

Planning the execution with EDGE

Evolving the strategy with EDGE

Monitoring the adoption process with fitness functions

Summary

Further reading

9

Team Topologies and Data Ownership at Scale

Introducing Team Topologies

Dissecting organizational architecture

Leveraging Team Topologies

Understanding team types

Understanding interaction modes

Mapping the organizational architecture

Delving into the fractal organization

Defining operational teams

Data product teams

Platform teams

Governance teams

Enabling team

Defining management teams

Data strategy committee

Data portfolio management committee

Operations management committee

Putting it all together

Evaluating decentralization strategies

Understanding when to decentralize

Understanding what to decentralize

Moving toward decentralization

Summary

Further reading

10

Distributed Data Modeling

Introducing data modeling

What is a data model?

Implicit and explicit models

Data modeling process

Data model representations

Operational and analytical data modeling methodologies

Exploring distributed physical modeling

Dimensional modeling

Centralized dimensional modeling

Distributed dimensional modeling

Data Vault modeling

Unified star schema modeling

Managing the physical model life cycle

Exploring distributed conceptual modeling

Ubiquitous language

From string to things

Managing the conceptual model life cycle

Summary

Further reading

11

Building an AI-Ready Information Architecture

Exploring information architecture

Data assets

Information architecture pyramid

Information Plane

Knowledge plane

Managing enterprise knowledge

Enterprise ontology

Federated modeling team

Managing knowledge as a product

Building an enterprise knowledge graph

Knowledge Graph Architectures

Connecting data and knowledge plane

Knowledge-driven data management

Leveraging modern AI

The generative AI revolution

Boosting generative AI with domain knowledge

Future-proof your AI investment

Summary

Further reading

12

Bringing It All Together

Core beliefs shaping the future of data management

Data management is not a supporting function anymore

Data management is not just about data

Data management is not just an IT responsibility

Setting yourself up for success

Be optimistic, not naïve

Be a reflective practitioner, not a methodological purist

Focus on the system, not on the parts

Decentralize to scale, do not scale to decentralize

Be a change agent, not a change manager

Be curious, not obsessed

Final remarks

The evolution of data management

Data-centered and people-driven organizations

Summary

Further reading

Index

Other Books You May Enjoy

Preface

Hello, and welcome to Managing Data as a Product! I’m excited to share everything I’ve learned about managing data as a product and how this new paradigm can solve recurrent problems in data architectures that, despite huge investment, periodically collapse under the weight of their own complexity, making sustainable evolution a real challenge.

Ironically, the most successful data platforms, those that bring the greatest value to an organization, are often the first to struggle. Their success drives rapid growth in both the number of managed data assets and users, which leads to complexity. This complexity gradually slows down their growth until the platforms become too costly to maintain and too slow to evolve. However, this march toward self-destruction isn’t inevitable. We can rethink how we design data management solutions, so they don’t fall victim to their success but instead exploit it, multiplying the value they generate for the organization while growing.

Managing data as a product allows us to handle growing complexity by modularizing the data management architecture. Each data product is a modular unit that helps isolate complexity into smaller, manageable parts. Over time, the collection of developed data products forms a portfolio of building blocks that can be easily recombined to support new use cases. This way, while the platform’s complexity remains stable as it grows, the value derived from the managed data assets increases. Implementing new business cases becomes simpler, as existing data products can be reused rather than creating new ones from scratch.

However, managing data as a product is a profound paradigm shift from traditional monolithic data architectures, impacting not only technology but also, and especially, the organization. Throughout this book, chapter by chapter, we’ll explore practical, actionable steps to adopt this new paradigm, addressing all key aspects from both a technical and organizational perspective.

As we’ll see, adopting a data-as-a-product approach is challenging, but it’s well worth the effort. This book is a travel guide inspired by my experience, aimed at helping you find the best path for your unique context to successfully navigate this paradigm shift.

What this book covers

Chapter 1, From Data as a Byproduct to Data as a Product, shows how modularizing data architecture with data products solves recurring problems that make its sustainable evolution challenging over time.

Chapter 2, Data Products, defines what a data product is, outlining its key characteristics and explaining the essential components that make it up, highlighting how each element contributes to its overall function and value.

Chapter 3, Data Product-Centered Architectures, explores the foundational principles of a data product-centered architecture, analyzing the key operational and organizational capabilities required to manage it. We also compare other modern approaches such as data meshes and data fabrics with the data-as-product paradigm to highlight their similarities and key differences.

Chapter 4, Identifying Data Products and Prioritizing Developments, explains how to identify and prioritize data products using a value-driven approach. It starts by identifying relevant business cases through domain-driven design and event storming, then shows how to define the data products needed to support those business cases.

Chapter 5, Designing and Implementing Data Products, explores the process of designing a data product based on identified requirements, starting with techniques for defining scope, interfaces, and ecosystem relationships. It then examines the core components of a data product, their development process, and how to describe them with machine-readable documents. Finally, it analyzes the data flow, focusing on components responsible for sourcing, processing, and serving data.

Chapter 6, Operating Data Products in Production, covers the entire lifecycle of a data product, from release to decommissioning. It introduces CI/CD methodologies, explores managing a data product in production with a focus on governance, observability, and access control, and discusses techniques for evolving and reusing data products in a distributed environment.

Chapter 7, Automating Data Product Lifecycle Management, explains how to speed up the adoption of a data product-centric paradigm by creating a self-serve platform to mobilize the entire data ecosystem. It covers the platform’s main features, how it improves the experience for developers, operators, and consumers, and the key factors in deciding whether to build, buy, or use a hybrid approach in implementing it.

Chapter 8, Moving through the Adoption Journey, covers the adoption of the data-as-a-product paradigm. It outlines the key phases of the process, exploring objectives, challenges, and activities for each stage. Finally, it discusses how to create a flexible data strategy that evolves with each phase, building on previous learnings.

Chapter 9, Team Topologies and Data Ownership at Scale, explains how to design an organizational structure for managing data as a product. It introduces the team topologies framework, including team types and interaction modes, and explores how to organize teams for efficient data product delivery. Finally, it looks at how to integrate these teams into the organization and decide between the centralized or decentralized data management model.

Chapter 10, Distributed Data Modeling, examines data modeling in a decentralized, data product-centered architecture. It defines data models and emphasizes intentionality in modeling, then examines physical modeling techniques for distributed environments. Finally, it covers conceptual data modeling and its role in guiding the design and evolution of data products within a cohesive ecosystem.

Chapter 11, Building an AI-Ready Information Architecture, explores how to build an information architecture that maximizes the value of managed data, starting with developed data products. It covers how different planes of the information architecture add context to data and focuses especially on the knowledge plane, where shared conceptual models ensure semantic interoperability between data products. Finally, it explores how federated modeling teams can create and link conceptual models to physical data, forming an enterprise knowledge graph crucial for unlocking the potential of generative AI.

Chapter 12, Bringing It All Together, revisits key concepts from earlier chapters, tying them to the core beliefs about data management that inspired this book. It wraps up with practical advice for becoming a more successful data management practitioner.

To get the most out of this book

In this book, both data products and the self-serve platform needed to support their development and operation are described at a logical level, without reference to any specific technology stack. Therefore, no prior knowledge of specific technologies is required to read and understand the content.

In some chapters, examples of metadata are provided to describe the components of a data product. This metadata is generally represented as JSON snippets. To use and modify them, we suggest a text editor that can recognize JSON syntax, such as Visual Studio Code.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Managing-Data-as-a-Product. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “The Promises object contains all the metadata through which the data product declares the intent of the port.”

A block of code is set as follows:

:purchases a rdf:Property ;     rdfs:domain :Customer ;     rdfs:range [ a owl:Class ;         owl:unionOf ( :Product :Service ) ]

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the Administration panel.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share your thoughts

Once you’ve read Managing Data as a Product, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-83546-853-1

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Data Products and the Power of Modular Architectures

In this part, we’ll explore the rationale and implications of treating data as a product. We’ll break down the key components of a data product, starting with interfaces accessible to external consumers, moving through internal data management applications, and ending with the infrastructure needed for production use. Finally, we’ll highlight the qualities and capabilities that a data architecture centered on data products must embody to balance the agility needed for scalability with the governance essential for sustainability.

This part has the following chapters:

Chapter 1, From Data as a Byproduct to Data as a ProductChapter 2, Data ProductsChapter 3, Data Product-Centered Architectures

1

From Data as a Byproduct to Data as a Product

In this book, we will explore how to transition from managing data merely as a byproduct that supports applications to managing data as a product in its own right. Before tackling the various aspects that contribute to this paradigm shift, it’s crucial to understand why managing data as a product is important and how this practice enables us to surpass the limits of today’s data platforms.

In this chapter, we will explore the history of monolithic data platforms, which have characterized the evolution of data management over the last 30 years. We will seek to understand the common problems that make them incapable of sustainably managing the accidental complexity they generate as they grow. Finally, we will see why addressing the fundamental issues, instead of merely treating surface-level symptoms, requires more than just technological innovations. It calls for a paradigm shift that leads us toward more sustainable socio-technical architectures, based on the key practice of managing data as a product.

This chapter will cover the following main topics:

Reviewing the history of monolithic data platformsUnderstanding why monolithic data platforms failExploring why we need to manage data as a product

Reviewing the history of monolithic data platforms

Managing the substantial amount of data that’s generated by every company daily is a complex endeavor. It calls for dedicated resources and technological support in the form of a specific data platform.

Nowadays, data platforms often fall short in delivering the expected value compared to the investments made, primarily due to organizations’ inability to sustainably manage the complexity they generate over time.

System complexity

The complexity of a system is determined by the number of its components multiplied by the number of correlations between them. A database with 10,000 tables is not much more complex than a database with 100 tables if the tables themselves are not correlated. Each table tells its own story. It can be manipulated without concern for the meaning of other tables and the potential impacts that the executed action may have on them. However, the complexity between the two databases is very different if the tables are highly correlated. In systems with high levels of correlation, complexity generally grows quadratically with the number of components.

Not all the complexity of a system is related to the complexity of the specific function for which the system is designed (essential complexity). Part of the complexity is also derived from the approaches and tools used to develop it (accidental complexity). Essential complexity is incompressible, but accidental complexity can be reduced or at least kept under control to prevent it from growing to the point where the entire system becomes unmanageable and incapable of evolving.

To create systems capable of supporting the complexity they generate, it is necessary to intervene in tools and development approaches that generate accidental complexity. Rarely in my experience have I seen a data platform fail due to a wrong technological choice (that is, tools). Even when it did happen, the blame could not be squarely placed on individual tools but on the incorrect manner of using them within a coherent and purposeful architecture. If data platforms fail, crushed under the weight of their complexity, it is predominantly due to the socio-technical architecture approaches that were employed to develop and evolve them over time (that is, approaches).

For this reason, modern approaches to data management focus on these architectural aspects rather than individual technology tools. However, it has not always been this way. The subsequent approaches to data management that we have seen over the years have been primarily driven by technologies, seen as key to solving data management issues. Before delving into the common elements leading to the failure of monolithic data architectures in more detail, let’s explore the main approaches that have evolved over time and on which most monolithic platforms are based.

Data warehouse

The concept of a data warehouse (DWH) was first introduced in the late 80s by IBM researchers Barry Devlin and Paul Murphy. They proposed the idea of a centralized repository to store and manage large volumes of data from various sources to support decision-making processes.

The adoption of DWH began to spread widely in the 90s, thanks to the conceptual work of authors such as Bill Inmon and Ralph Kimball. Before that, data integrations were predominantly carried out using point-to-point logic, while analyses were mostly performed by directly extracting data from source systems.

This tactical approach to data management made perfect sense in the early days of the third industrial revolution when the digitalization of business processes was just beginning. The applications were still few, and the produced data was limited. The focus of IT strategy was rightly on automating core business processes through new applications, rather than managing the data assets they made available.

However, as digitalization progressed, the number of applications and generated data quickly reached a level where managing integrations with point-to-point logic became too complex and no longer sustainable. The DWH emerged as a response to these scalability issues, introducing the idea of having a dedicated platform and team for data management. The DWH represents a significant shift in both technological and organizational architecture. With the introduction of the DWH, data management becomes a full-fledged organizational function with dedicated resources, processes, and objectives.

From a technological architecture viewpoint, the DWH is based on the idea of collecting all data generated by digitalized processes in a single repository, consolidating them, and making them available for analysis through a unified model. The DWH assumes the role of a single source of truth for company data, serving as the foundation for all analyses and decision-making processes. It also decouples data producers from consumers, allowing the costs of integration and consolidation for a specific data domain to be incurred only once and then reused for the implementation of multiple analytical use cases.

However, like any architecture, the DWH has its limitations. Specifically, it prioritizes the quality and reusability of consolidated data in the unified model over agility in development. Consequently, while integration costs decrease, and data consumption is simplified on one hand, on the other hand, the lead time for integrating new data sources and releasing new analyses increases.

The end of the millennium witnessed a constant growth in the number of applications within organizations, leading to a tremendous increase in the volume, variety, and velocity of data to manage. This shift resulted in a transition from simply talking about data to addressing the new challenges of big data. Simultaneously, the number and variety of data consumers for analytical purposes significantly grew. Traditional directional reporting expanded to include analyses for a broader audience, machine learning (ML)-based models, and various data-driven applications.

Due to these trends, at the beginning of the millennium, the advantages of DWHs began to be overshadowed by their disadvantages. The DWH became a bottleneck, squeezed between data producers and consumers, increasingly seen not as an enabler but as a hindrance to digital transformation initiatives.

Moreover, the growth in the volume, variety, and velocity of data, along with diverse consumption patterns, made DWHs unsuitable for optimally managing all types of data and workloads. Finally, the surge in data volume led to a spike in the costs of the hardware and software that was used to implement DWH platforms – costs that were increasingly less justified by the return on investment (ROI) of analytical initiatives.

Data lake

The data lake approach began gaining traction in the early 2000s as a response to the challenges faced by the DWH in managing big data and new analytical workloads, particularly those related to ML and artificial intelligence (AI). From a logical architecture perspective, a data lake, like a DWH, serves as a centralized repository of data. However, from a technological architecture standpoint, it relies on distributed storage and computing solutions, such as HDFS and MapReduce, among the first available on the market. These distributed technologies are built on commodity hardware and open source software, making the costs for managing big data volumes significantly lower than those associated with traditional analytical databases used by DWHs.

The adoption of this approach in the early 2000s was not solely driven by cost considerations. In general, vendors entering this market positioned data lakes as platforms for more agile data management compared to traditional DWHs. Specifically, the separation between storage and computation allows for direct querying of raw data (schema on read), theoretically eliminating the need for extensive upfront processing to make data available to downstream consumers.

The combination of cost savings and the promise of agility led to the rapid uptake of this approach, even in many companies that didn’t necessarily have truly big data to manage.

The initial idea of addressing the need for agility in developing new analytical solutions by focusing almost exclusively on data collection, with minimal interventions in terms of structuring, cleaning, normalizing, and enriching the data, did not prove to be highly effective. Many early data lake projects that heavily emphasized this agile and destructured approach resulted in systems that were complex to manage and difficult to use, where data fragmentation from source systems was quickly reproduced within the data lake, turning it into a data swamp.

Today, modern data lakes strike a more balanced approach between the need for agility and the need for control. They generally rely on medallion architecture, a layered architecture where raw data is transformed and enriched progressively as it flows through different layers before being made available for consumption.

Data lakes have been a key element in the evolution of data management technologies, with significant impacts also on DWHs. The most notable impacts are as follows:

Horizontal scaling on commodity hardware to effectively manage large quantities of dataSeparation between storage and computation and the corresponding ability to reuse the same data for different workloadsUtilization of enterprise-level open source software supported by vendors through diverse commercial models, promoting a reduction in lock-in and accelerating innovation in the industryReduction of costs for implementing analytics solutions alongside market democratization

DWH versus data lake

The data lake approach has not replaced the DWH and vice versa. Both have evolved, addressing their respective limitations.

Modern DWHs are built on distributed, horizontally scalable architectures. Storage and computation are separated and independently scalable. They have become multimodal data management platforms capable of handling various workloads on stored data. Additionally, they enable the management of diverse data types: structured, semi-structured, and even non-structured.

On the other hand, modern data lakes have progressively acquired functionalities typical of a DWH, such as transactionality, time travel, and SQL-based query interfaces. Query engines have now achieved performance levels comparable to those of DWHs. Many aspects related to data management optimization are automatically handled by the platform, broadening the user base and reducing barriers in terms of technical knowledge required for implementation.

The dichotomy that once existed no longer holds. The technologies that support both approaches can effectively handle all types of data across major workloads in a highly scalable manner. Even the approaches to data modeling and governance have converged. Modern data lake management approaches have become more organized and structured, while those for DWH management have become more flexible and faster.

It is possible to build a data platform based solely on one of these approaches and their respective technologies without worrying about potential limitations. However, many organizations still prefer a hybrid approach, leveraging the strengths of data lake technologies for flexibility and scalability in managing raw data, and the strengths of DWH technologies for achieving the best performance in analyzing enriched data.

Modern data stack

In the early 2010s, major cloud providers released their DWH in a fully managed mode. Simultaneously, they introduced fully managed versions of data lake platforms. In the same period, pure-player vendors also entered the market, offering cloud solutions for DWHs (for example, Snowflake) and the data lake (for example, Databricks). Within a short time, the market was flooded with solutions capable of ensuring the necessary scalability to handle large amounts of data in an elastic and fully managed manner. Traditional vendors initiated a process of porting their products to the cloud to avoid falling behind and becoming legacy. The momentary lack of cloud-native solutions for managing data on new cloud-based platforms led to the emergence of many startups. These covered the essential functionalities constituting the technological stack needed to manage the data life cycle in the cloud, from acquisition to analysis.

The second half of the 2010s witnessed a race between traditional players committed to transitioning their suites to the cloud and new players eager to gain as much market share as possible in specific functionalities. Taking advantage of the transition period to the cloud for traditional vendors and then focusing intensely on specific functionalities, the new players managed to capture significant market shares among organizations determined to migrate their analytical workloads to the cloud. This marked the beginning of a technological unbundling cycle characterized by an explosion of point solutions, a landscape that we still find ourselves in today. The modern data stack (MDS) is the term that’s used to refer to this new ecosystem of specific and cloud-native solutions supporting data management.

The MDS has brought considerable technological innovation, further reducing accidental complexity associated with tools and consequently increasing the productivity of data teams. However, the extreme fragmentation of tools required for data management has increased the complexity of developing and operationally managing the core capabilities of a data platform. Today, data platform architectures are based on many more tools than in the past. These tools must be selected, integrated, and governed from both security and cost perspectives. In summary, while the innovative drive of the MDS ecosystem has reduced development times and analysis maintenance costs, on the other hand, it has increased operational costs in terms of developing and maintaining the underlying platform. It is not always clear whether the balance between advantages and disadvantages is positive or negative. It is likely that in the coming years, after a strongly expansive phase (unbundling), the offering will converge again toward a rationalization phase (bundling), where we may see some MDS vendors merging, others being acquired by big tech, and some potentially failing after the driving force of the collected investments diminishes.

The MDS is not a completely new approach to data management. As we have seen, it is a new technological proposition that integrates with the popular approaches mentioned previously.

Understanding why monolithic data platforms fail

If we look at the evolution of data management over the last 40 years, we’ll see a story of incredible technological revolutions and just as many project failures. At the beginning of this chapter, we mentioned that the main reason for these failures is the complexity generated by data management platforms, and this complexity grows approximately quadratically with the size of the platform. Therefore, these are not typical project failures as we are accustomed to understanding them. Data platforms rarely fail before their launch, never making it into production. Instead, they often experience failures related to their ability to evolve and survive over time. Platforms don’t fail immediately but over time, as they struggle to deliver the expected value in proportion to the constantly increasing maintenance costs they generate.

Like a Jenga tower becoming increasingly unstable as more pieces are added until it collapses, data platforms often implode under the weight of their own complexity (complexity catastrophes). At this point, you must start over and build a new platform from scratch. Many organizations have gone through different generations of data platforms throughout their history, trying new technological stacks offered by a constantly evolving market, but facing the same problems again and again.

Technology has helped solve some symptoms of the problems but not the root causes of accidental complexity. If not managed, this complexity inevitably brings us back to the starting point. These causes can be traced back to the monolithic architecture, both from a technological and organizational perspective, common to all data management approaches presented in the previous section. But what exactly does it mean for a data platform to have a monolithic architecture? Let’s explore that together.

Monolithic versus modular architecture

A data platform, when it is first created, is generally not a complex system. It becomes complex as it grows because its components increase and the interconnections between them multiply. In nature, complex systems are either governed or self-governed by structuring their components into subsystems, organized in hierarchies or more complex topologies. By doing so, they manage to redistribute the total complexity of the system among its parts and govern it. Similarly, a data platform is not monolithic or less monolithic based on its level of socio-technical centralization. A data platform is considered monolithic when it cannot reorganize its socio-technical components into modules that act as distinct, autonomous subsystems, each capable of masking part of the system's overall complexity.

It’s important not to confuse the concept of modules with the concepts of architectural components. Generally, monolithic platforms have (at least on slides) a clear architecture broken down into layers, with macro components for each layer. However, they are not modular. Modules are units of a larger system, structurally independent from each other but capable of working together. A modular system must be organized in an architecture that allows modules to maintain their structural independence on one hand but cooperate on the other. A module is a component of a system, but it’s not necessarily true the other way around. Similarly, a modular system must have a clear architecture to ensure cooperation between modules, but it’s not necessarily true that a system with a clear architecture is modular.

It’s also crucial to observe that for a module to be considered as such and to preserve its structural integrity over time, it must not only have a clear interface that distinguishes its internal components from the rest of the system but must also ensure that its internal components are strongly connected and relatively loosely coupled to the other modules of the system. Some modules will be more connected than others, but overall, a modular system ensures decoupling between its modules. Therefore, it’s not enough to draw circles around the components of a system to make the system modular. To become a module, components must be grouped based on their structural coherence.

The lack of modularity, not the lack of decentralization, as is often believed, is the main characteristic of a monolithic platform. Both monolithic and modular platforms can be more or less centralized at the technological and organizational levels without fundamentally changing their nature. Even a distributed platform, composed of components with a clear interface, can lack modularity if the level of coupling between modules is so high that it erodes the autonomy of each module being a distributed monolith.

So, we can define a monolithic platform as a non-modular platform. I introduced the concept of modularity as a tool to govern the accidental complexity generated by the growth of the platform, suggesting accordingly that the lack of modularity is one of the main structural problems leading to the failure of monolithic platforms. Let’s learn how modularization can help us build platforms that do not reproduce the problems of the past.

The power of modularity

Modularization – in other words, the process of articulating a complex system into subsystems – is a practice that we constantly see applied in nature. The human body, for example, is an emblem of modularization, being organized into subsystems that work together at different levels of abstraction to ensure the correct functioning of the entire organism.

The benefits of modularization are evident not only in nature but also in many complex human activities. Since humans have limited cognitive capacities, to manage a complex system, they must divide it into smaller and sufficiently decoupled parts that can be analyzed separately. This modular structure allows humans to encapsulate the complexity of the system into functional parts defined by a clear interface. The interface, in addition to delimiting the perimeter of the module, provides an abstraction that allows part of the complexity to be hidden from the external observer (information hiding). It becomes possible to reduce cognitive load by deciding whether to reason about the internal functioning of a module, ignoring the rest of the system, or to reason about the interactions between the modules of the system, ignoring their internal functioning.

Modules can also be structured in a hierarchy with successive levels of abstraction. Modules at one level are composed of modules from the level below and define more abstract functions or concepts. For example, the human body is composed of various types of cells, cells can be grouped into organs, and organs can be grouped to form organ systems such as the respiratory system and the circulatory system. The structuring of modules in a modular system is another key element that can reduce cognitive load. It allows reasoning to be shifted across different levels of abstraction according to objectives, ignoring the details of the functioning of modules present in other levels.

The structure of modular systems can be regulated and evolved based on human cognitive capabilities. When a module becomes too complex, it can be divided into smaller modules. When modules become too numerous, they can be regrouped into macro modules belonging to a higher level of abstraction in the hierarchy.

Failure loops

The principles of encapsulation, information hiding, and abstraction, typical of a modular structure, are strategies that humans have always used, borrowing them from nature, to overcome our cognitive limitations that would not allow us to successfully manage activities or systems that are too complex.

These principles are systematically employed in many engineering practices, including software engineering. So, why do data platforms traditionally have a monolithic structure and not a modular one? Let’s try to understand this together using systems thinking, which provides an excellent conceptual framework for analyzing the structure, interactions, and dynamics of a complex system.

In a system, components are interconnected. There are interactions between components, meaning that when one component acts on another, it changes its state and, consequently, its behavior. The component performing the action can, in turn, receive feedback following the executed action, modifying its state and, consequently, future behavior. Feedback can be direct when it comes without intermediaries from the component on which the action was performed, or indirect when it comes from another component following a more elaborate cycle of feedback. Feedback loops are the main tool used in systems thinking to holistically study the emergent behaviors of a system based on the interactions between its components.

There are two main types of feedback loops: reinforcing loops and balancing loops. Reinforcing loops amplify a specific behavior of the system while balancing loops reduce it. Therefore, feedback loops are control mechanisms that regulate the overall functioning of a system. For a system to survive, it must structure the interactions between its components in such a way that the resulting composition of these feedback loops does not make it too rigid and incapable of adapting but, at the same time, does not make it unstable and out of control. Systems are constantly seeking a balance between stability and the ability to change to survive. This equilibrium isn’t static; rather, it’s dynamic, with continual changes unfolding within the system, maintaining overall stability.

Unlike the ways we are accustomed to modeling reality, in complex systems, cause and effect and action and reaction are separated in time and space. They are separated in space because the action of one component can trigger feedback loops that create repercussions on components even very distant within the connection space. They are separated in time because an action on one component can lead to consequences on another component after some time; feedback can be delayed and asynchronous to the action. These nonlinear dynamics, orthogonal to our classical way of thinking, often lead to errors when analyzing the causes that lead to the emergence of certain behaviors in a system. System thinking seeks to provide more aware modeling methods of the actual functioning of a system to avoid these types of analysis errors. Let’s look at a generic example of modeling a system based on the building blocks provided by systems thinking described so far before applying them to the systemic analysis of data platforms:

Figure 1.1 – Causal loop diagram

In the causal loop diagram (CLD) shown in Figure 1.1, relationships among different components of a system (population, births, and deaths) are depicted. Between births and population, there is a reinforcing loop – the more births there are, the larger the population becomes, and the larger the population, the more births there will be over time. Similarly, between the population and deaths, there is a balancing loop – the larger the population, the more deaths there will be over time, and the more deaths, the smaller the population becomes. Any action, whether from internal or external components of the system, causing an increase or decrease in births or deaths, leads to variations in the population. The population can change over time, growing or decreasing, so long as there is always a dynamic balance within the system that prevents it from going out of control (that is, extinction or unsustainable overpopulation).

The population example is based on a feedback loop structure that’s common in many systems. This is known as the balancing process. Systems thinking has identified many of these recurring patterns, recognizable in various types of systems. To explain what makes modularization of data platforms challenging, we’ll turn to two of these archetypes: shifting the burden and limits to growth.

Shifting the burden is a pattern that occurs when a quick fix is used to solve a problem, temporarily alleviating the symptoms without addressing the underlying issue. This makeshift solution is faster in relieving the symptoms of the problem and is therefore preferred over the actual cure, which requires more time and effort. Eventually, the problem reemerges, and the makeshift solution becomes the habitual response to deal with it. In the data platform context, this archetype happens when technology is used to mitigate all emerging problems, without focusing on the root causes. Several times, we believed we had found the ultimate solution to problems in a new technology, only to inevitably reproduce the same problems we hoped to solve each time. Over time, we have become overly dependent on technology, transforming data management into data technology management in recent years.

A CLD of the shifting the burden archetype, when applied to data platforms, is shown in Figure 1.2:

Figure 1.2 – Shifting the burden

Limits to growth is a pattern that manifests when a system has a limit to its growth. The system grows until it reaches the limit, at which point the actions that led to the growth start yielding diminishing returns. The cost to maintain the same levels of past growth becomes unsustainable, jeopardizing the stability of the system.

Within a data platform, the resource limiting its growth is the cognitive load that the data team can bear. Initially, it’s not a problem. The platform is small, and the speed of integrating new data sources and developing new analyses is high. However, as the platform grows, so does its complexity. As mentioned earlier, in a data platform, complexity can increase rapidly with its size due to the high number of connections between data. Since the value of the platform lies more in the connections between data than in the data itself, this is the essential complexity inherent in its development. With the increase in complexity, the cognitive load on the development team also increases, and they would naturally be forced to slow down the delivery speed. As organizational pressures are generally high, instead of slowing down delivery, the team starts reducing the quality of developments, increasingly resorting to workarounds that introduce technical debt. Unmanaged technical debt leads to unorganized and often unnecessary growth of the platform, increasing its complexity – the accidental complexity in this case. Complexity, in turn, increases the cognitive load, triggering the reinforcing feedback loop R2, shown by the dashed line in Figure 1.3:

Figure 1.3 – Limits to growth

Externally, the delivery speed remains unchanged, but internally, the cognitive load on the development team grows rapidly until it saturates their capacity. At this point, efforts to maintain the past delivery speed become challenging. Shortcuts are no longer sufficient to compensate. The platform becomes a legacy that is difficult to evolve and costly to maintain. It’s time to consider replatforming on a new technological stack. The solution to these challenges is modularization.

The modularization of the platform would reduce the cognitive load on the development team (technological architecture). Modularization also allows the cognitive load to be distributed across multiple independent teams (organizational architecture) if necessary. However, modularization is an activity that involves upfront costs with returns in terms of sustainability not being immediately evident. Modularization is also not a one-time activity but a process that must be carried out in tandem with the platform’s evolution, slowing down its growth rate. Optimizing a system as a whole often involves balancing activities – in our case, modularization – that slow down growth to preserve sustainability.

“The optimal rate in a system is far less than the fastest possible growth. Optimal growth is the one that allows for the enhancement and resilience of the system, fostering a sustainable and harmonious balance over time.”

– Donella Meadows

Without an understanding of the dynamics at play at the systemic level, obtaining the sponsorship and buy-in necessary to initiate such an activity at the organizational level becomes challenging. Why embark on a complex socio-technological transformation when we can solve the problems we’ve had so far using new, more innovative technologies (that is, shifting the burden)? Why deliberately slow down the delivery speed for potential medium-term sustainability benefits? We are aligned with current goals, and end-of-year incentives are tied to current performance, not to a promise of future value (that is, limits to growth).

The story of continuous stop-and-go over the last 30 years, a story dotted with new platforms built on the ashes of old ones unable to evolve, tells us that as difficult as it may be, it’s time to change course and try new approaches to data management – approaches more focused on socio-technological architectures that address foundational problems condemning us to failure rather than seeking easy band-aid solutions in the latest technology trends. We can also consider approaches that allow us to break free from the chains of the failure cycles described in this section, in which we’ve been trapped like a hamster on its wheel for too many years. In this book, we will explore a possible way out centered around the idea of managing data as a product. But what does that exactly mean, and what is its relationship with what has been discussed so far? Let’s explore this together.

Exploring why we need to manage data as a product

To escape the quagmire we find ourselves in, it is necessary to radically change the mental model we use to approach data management and, consequently, the organizational structures and associated operational practices. It’s a systemic change – a paradigm shift in data management practice.

As we’ve seen in the previous sections, attempts to address these problems have been predominantly cosmetic, not radical. We’ve tried to modify the system tactically, reacting to surface-level problems as they arise.

In system thinking, a system can be changed from the outside by acting on parts of it where small changes can lead to significant and lasting changes over time; these parts are called leverage points. Donella Meadows, a renowned researcher in this field, has classified possible leverage points into 12 categories, ranking them by effectiveness. It’s not necessary to delve into the details of each. Suffice it to know that the categories identified by Meadows can be summarized into four main macro-categories generally represented by a model known as the iceberg model (see Figure 1.4):

Figure 1.4 – The iceberg model

The visible events, and symptoms of deeper problems, on which we have predominantly focused until now, are the least effective leverage points. It’s difficult to durably change the system by intervening on them. To make a qualitative leap in the system – that is, to make it capable of handling greater complexity – it is necessary to attack the levels of the iceberg that are below the waterline, outside the realm of immediately visible. The deeper you go, the more effective the leverage points become. A substantial change in the system’s functioning coincides with a paradigm shift. It requires joint and coherent interventions on multiple leverage points, starting from the base of the iceberg (that is, mental models) and then moving up to higher levels (that is, organization and procedures). Let’s see how.

Being data-centric

The third industrial revolution has led to a complete digitization of the core processes that govern our organizations. Applications, in addition to making processes more efficient, generate enormous volumes of data. This data has potentially immense value for organizations. It is a common belief that they represent a real asset, a strategic element to differentiate and compete in the market.

In 2016, Klaus Schwab, founder and executive chairman of the World Economic Forum (WEF), introduced the concept of the fourth industrial revolution to describe the period we are in now, a time of unprecedented changes and innovations reshaping the economic landscape. Data plays a fundamental role in this context. With many existing processes already digitized, the focus shifts to creating new ones, innovating, and evolving business models in directions unimaginable until a few years ago. And it is precisely here that data comes into play. Data is central to enabling these processes of innovative creation.

With the fourth industrial revolution, we are experiencing a transition from a model centered around applications, crucial for digitizing processes, to a model centered around data. The data produced by applications increasingly holds more value than the applications themselves. The lifespan of an application within an organization has significantly reduced. Applications today come and go under the pressure of constant technological innovation. Data, on the other hand, remains and is the true core asset for organizations.

While conceptually, the focus has shifted from applications to data, in practice, this transition has not fully taken place. Our mental models have not made this leap, and consequently, neither have the ways in which we manage data.

Applications still effectively drive the evolution of our IT architectures. Data outside of applications (data on the outside), despite being the real asset, is still treated as a second-class citizen. First, we choose applications, after which we understand what data they produce and how to integrate it with the information assets already available, not the other way around.

In summary, there is a disconnect between what we say when we talk about data and what we do in practice. It is essential to start from here, from the awareness of this fork, and adapt our mental model accordingly by placing data and its management at the center of the organization.

“Data is the center of the universe; applications are ephemeral.”

– The Data-Centric Manifesto

Data is everybody’s business

Moving up from the mental model to organizational structures, we can observe how data management in companies is heavily centralized in one or more dedicated teams. If we consider data management as a socio-technical system, data teams are parts of the system, while all other teams are not. The other teams, including those in IT managing infrastructure and applications, are part of the external context in which the system operates and interacts. This organizational model for distributing responsibilities related to data management conflicts with the mental model that places data at the center of the organization. It is primarily based on two mistaken assumptions. Let’s see what they are.

The first assumption stems from a misinterpretation of data as an asset. If data is an asset, then it has value. If it has value, collecting more data adds more value, additively. However, data is a unique type of asset. Its value is not immediately fungible. It must be managed appropriately to effectively produce value. Data management is not comparable to refining crude oil. It is not a one-time transformation activity that turns raw data into valuable information. The value of data, even when properly transformed and enriched, tends to degrade over time. Therefore, data management is an ongoing effort. Given these premises, it becomes clear that thinking that accumulating data from sources is sufficient to generate value is incorrect. In reality, as we have seen, it is the opposite. As accumulated data grows, the system’s complexity increases until a point is reached where management costs surpass the value produced, a point where the value of the collected data plummets (that is, limits to growth).