35,99 €
Decentralizing data and centralizing governance are practical, scalable, and modern approaches to data analytics. However, implementing a data mesh can feel like changing the engine of a moving car. Most organizations struggle to start and get caught up in the concept of data domains, spending months trying to organize domains. This is where Engineering Data Mesh in Azure Cloud can help.
The book starts by assessing your existing framework before helping you architect a practical design. As you progress, you’ll focus on the Microsoft Cloud Adoption Framework for Azure and the cloud-scale analytics framework, which will help you quickly set up a landing zone for your data mesh in the cloud.
The book also resolves common challenges related to the adoption and implementation of a data mesh faced by real customers. It touches on the concepts of data contracts and helps you build practical data contracts that work for your organization. The last part of the book covers some common architecture patterns used for modern analytics frameworks such as artificial intelligence (AI).
By the end of this book, you’ll be able to transform existing analytics frameworks into a streamlined data mesh using Microsoft Azure, thereby navigating challenges and implementing advanced architecture patterns for modern analytics workloads.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 417
Veröffentlichungsjahr: 2024
Engineering Data Mesh in Azure Cloud
Implement data mesh using Microsoft Azure's Cloud Adoption Framework
Aniruddha Deswandikar
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Yasir Ali Khan
Book Project Manager: Kirti Pisat
Senior Editor: Tazeen Shaikh
Technical Editor: Seemanjay Ameriya
Copy Editor: Safis Editing
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Production Designer: Joshua Misquitta
DevRel Marketing Coordinator: Vinishka Kalra
First published: March 2024
Production reference: 1150324
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80512-078-0
www.packtpub.com
To my dear father, Ashok, and my cherished late mother, Asha, who have been guiding lights on my life's journey. To my beloved wife, Reshma, whose unwavering support and encouragement have been my constant source of strength.
Aniruddha Deswandikar has more than three decades of industry experience working with start-ups, enterprises, and software companies. He has been an architect and a technology leader at Microsoft for almost two decades, helping Microsoft customers build scalable applications and analytical solutions. He has spent the past three years helping customers adopt and implement the data mesh architecture. He is one of the subject matter experts on data mesh and cloud-scale analytics at Microsoft Europe, helping both customers and internal teams to understand and deploy data mesh on Azure.
Vinod Kumar is a customer success leader for Microsoft Global Accounts with over 25 years of industry experience delivering end-to-end cloud solutions to customers. Based in Singapore, he leads the Asia team to help customers build resilient cloud architectures, embrace digital innovation and transformation, secure their use of the cloud, and make informed decisions using AI and data solutions. He is a mechanical engineer from the College of Engineering, Guindy. He is a passionate technology leader who inspires people, embraces tech, and champions inclusion. He is an author of multiple books on SQL Server and an avid community speaker.
I'd like to thank my daughter, Saranya, and my whole family for giving me the space I needed to contribute to this book.
In 2019, Zhamak Dehghani published her whitepaper on data mesh during her time at Thoughtworks. While it caught the attention of many large corporations, adopting data mesh was not easy. Most large companies have a strong legacy of analytical systems, and migrating them to a mesh architecture can be a daunting task. At the same time, the theoretical concepts of data mesh can be confusing when you map them to an actual analytical system.
In 2021, I started working with a large Microsoft customer that was struggling with their centralized data analytics platform. The platform was based on a central data lake and a single technology stack. It was rigid and was hard for all the stakeholders to adopt. As a result, many projects were creating their own siloed infrastructure, producing islands of data, technology, and expertise. We observed the dilemma the central analytics team was facing and proposed the data mesh architecture. It seemed that data mesh would solve most of their challenges around agility and adoption, as well as opening the doors to some other challenges, such as federated governance.
In the next year, we helped onboard this customer to data mesh. It was a long journey of multiple workshops followed by a consulting engagement where we built data mesh artifacts for them. Since then, I have been engaged with multiple customers on data mesh projects. As a member of a team of subject-matter experts on data mesh at Microsoft Europe, I have also guided other Microsoft team members on how to engage, design, and manage a data mesh project.
Along the way, I have realized that translating the theory of data mesh into a practical, production-ready system can be a challenge. A lot of terms get thrown around that actually can represent large projects in themselves.
This book consolidates information on all the challenges (and their solutions) involved in implementing data mesh on Microsoft Azure, going from understanding data mesh terminology and mapping it to Microsoft Azure artifacts to all those unknown things that only get mentioned as topics for you to look up for yourself in other data mesh resources. Some of these topics, such as master data management, data quality, and monitoring, can be large, complex systems in themselves.
The driving motivation behind writing this book is to help you understand the concepts of data mesh and to dive into their practical implementation. With this book, you will focus more on the benefits of a decentralized architecture and apply them to your own analytical landscape, rather than getting caught up in all the data mesh terminology.
This book is for individuals who manage centralized analytical systems built on Microsoft Azure for medium-sized or large corporations and are looking to offer more agility and flexibility to their stakeholders.
This book is also ideal for small companies that currently do not have a well-designed analytical system and want to explore the idea of building a distributed analytical system to handle future growth and agility requirements.
Chapter 1, Introducing Data Meshes, briefly covers the concepts from Zhamak Dehghani's original whitepaper and book on data mesh.
Chapter 2, Building a Data Mesh Strategy, guides you in evaluating your company’s current maturity level where analytics is concerned, aligning the company’s strategy with the business strategy, and how data mesh architecture could play a role in that.
Chapter 3, Deploying Data Mesh Using the Azure Cloud-Scale Analytics Framework, covers Microsoft’s own cloud-scale analytics framework for implementing data mesh.
Chapter 4, Building a Data Mesh Governance Framework Using Microsoft Azure Services, talks about how the key to a successful data mesh implementation is managing federated governance. This chapter will cover all the aspects of data mesh governance and align it with Microsoft Azure services that can be used to implement it.
Chapter 5, Security Architecture for Data Meshes, covers how with distributed data comes security challenges. Chapter 4discusses network security. In this chapter, we will discuss various aspects of data security, such as access control and retention.
Chapter 6, Automating Deployment through Azure Resource Manager and Azure DevOps, looks at how with distributed data and analytics comes distributed environments and products. The key to efficiently managing your environment is automation. This chapter walks you through all the aspects of automating the deployment and management of data mesh.
Chapter 7, Building a Self-Service Portal for Common Data Mesh Operations, explores how data mesh promotes agility and innovation by democratizing data and analytical technologies. One of the ways to empower data mesh users is to give them tools to discover data and deployment environments. A common practice is to build a self-service data mesh portal. This chapter provides guidance on how to design and build a self-service portal.
Chapter 8, How to Design, Build, and Manage Data Contracts, looks at how data mesh federates data ownership. Each team is responsible for the quality and reliability of their own data. In such a scenario, how do you build trust? This chapter discusses the formal method and process of maintaining data contracts and SLAs that help build trust and increase the reliability of data mesh.
Chapter 9, Data Quality Management, explores how, as data mesh grows, data products become dependent on each other for their outcomes. Some of these products deliver key analytics that is critical to business operations. The bad data quality of one data product could impact multiple products. This chapter showcases how to build/buy an enterprise-class data quality management system.
Chapter 10, Master Data Management, looks at Master Data Management (MDM), which provides a unified, consistent view of critical data entities across the organization; this is essential for data mesh’s principle of domain-oriented decentralized data ownership and architecture. In this chapter, we will look at buy-and-build options for MDM for data mesh.
Chapter 11, Monitoring and Data Observability, covers monitoring and data observability, which are crucial for data mesh as they enable real-time insights into the health, performance, and reliability of data across decentralized domains. It is also one of the most challenging features to implement. It involves monitoring data products and data. In this chapter, we will design a Data Mesh Operations Center (DMOC) to consolidate all the monitoring aspects into one pane of glass.
Chapter 12, Monitoring Data Mesh Costs and Building a Cross-Charging Model, covers how analytical systems are typically cost centers. They are investments, and there are many ways to manage and distribute costs. This chapter looks at various cost models, systems of monitoring costs, and ways of distributing the costs of shared and individual components.
Chapter 13, Understanding Data-Sharing Topologies in a Data Mesh, looks at how one of the features of data mesh is to minimize the movement of data across the enterprise. It introduces the concept of in-place sharing. However, in-place sharing has its limitations and challenges. This chapter discusses various data-sharing topologies and describes the different scenarios for using each topology.
Chapter 14, Advanced Analytics Using Azure Machine Learning, Databricks, and the Lakehouse Architecture, is a reference chapter that describes one of the most commonly used architectures for advanced analytics: the lakehouse architecture. The lakehouse architecture combines the scalable storage capabilities of a data lake with the data management and ACID transaction features of a data warehouse, enabling both analytical and transactional workloads on the same platform.
Chapter 15, Big Data Analytics Using Azure Synapse Analytics, covers how big data processing is a common scenario in most companies today. This reference chapter discusses a possible architecture with Azure Synapse Analytics.
Chapter 16, Event-Driven Analytics Using Azure Event Hubs, Azure Stream Analytics, and Azure Machine Learning, looks at how certain areas, such as social media data analysis, logistics, and supply chain, require the real-time or near-real-time analysis of data. This kind of data processing needs different kinds of services and storage. This chapter discusses these event processing components and how to lay them out in a real-time analytics architecture.
Chapter 17, AI Using Azure Cognitive Services and Azure OpenAI, looks at how AI and machine learning have very different needs when it comes to data processing. They need quick cycles of training and re-training as data and models drift with time. Large language models bring in concepts such as prompt engineering and chaining. This chapter describes modern architectures for how to build Azure Cognitive Services- and Azure OpenAI-based models for natural-language-based interactions with your corporate data.
While having read the original data mesh materials by Zhamak Dehghani would definitely be an advantage, it’s not a must. This book provides documentation references for all the Microsoft Azure services mentioned, but some working knowledge of Microsoft Azure will help you save time reading the docs.
Software/hardware covered in the book
Operating system requirements
Microsoft Azure
Azure SQL DatabaseAzure SynapseAzure Data Lake Gen2Microsoft PurviewMicrosoft Active DirectoryAzure Resource ManagerAzure Log Analytics WorkspaceAzure Data ExplorerNA
Great Expectations
Windows or Linux with Python 3.8 to 3.11
Profisee
Deployed using Azure Marketplace
PowerShell, Azure Command Line Interface
Windows
Python 3.8 to 3.11
Windows 11 or Ubuntu 22.04
SQL Server Management Studio
Windows 10 or Windows 11
For installation and setup of the preceding tools and platforms, please see the following references:
Installing PowerShell: https://learn.microsoft.com/en-us/powershell/scripting/install/installing-powershell?view=powershell-7.4Installing Azure Command Line Interface: https://learn.microsoft.com/en-us/cli/azure/install-azure-clInstalling SQL Server Management Studio: https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16Setting up SQL Server Management Studio to query Azure SQL Database: https://learn.microsoft.com/en-us/azure/azure-sql/database/connect-query-ssms?view=azuresqlInstalling Python: https://www.python.org/downloads/release/python-3110/Profisee SaaS Enterprise Data Management: "https://profisee.com/#Note that the format of Chapters 14, 15, 16, and 17 is different from those of the previous chapters. That is because these chapters are architectural references. The aim of these chapters is to provide guidance on how to set up analytics for a given workload. You might also observe portions of text being repeated across those chapters. This is also by design. At a later point, you might want to refer to a specific reference chapter directly. In order to make sure you have everything you need in those four chapters, we repeat some of the text in them. Each reference chapter is designed to be a quick read and lets you explore all the components of the architecture using the reference links provided.
If you are using the digital version of this book, we advise you to type the code yourself. Doing so will help you avoid any potential errors related to the copying and pasting of code.
The GitHub repository functions as a valuable resource for future reference, enabling you to report any issues. Furthermore, the author can upload extra updates and examples to the repository, providing ongoing support. Access the repository on GitHub at https://github.com/PacktPublishing/Engineering-Data-Mesh-in-Azure-Cloud.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “It sets a mask on Column01 as a default number mask, from the sixth digit to the fourteenth digit.”
A block of code is set as follows:
1# Grant access to individual user at a Subscription Level 2 function GrantAccessAtSubscription ($userID, $roleDef, $subScope) { 3 New-AzRoleAssignment -SignInName $userID ` 4 -RoleDefinitionName $roleDef ` 5 -Scope $subScope 6}Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Each Azure service comes with its own built-in roles. Azure Data Lake comes with three built-in roles: Reader, Contributor, and Owner.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Engineering Data Mesh in Azure Cloud, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781805120780
2. Submit your proof of purchase
3. That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1 starts with the theory of the data mesh architecture as described by Zhamak Dehghani in her original whitepaper (https://www.thoughtworks.com/insights/whitepapers/the-data-mesh-shift) and maps it to Microsoft Azure’s Well-Architected Framework, Cloud Adoption Framework, and cloud-scale analytics framework. Crossing this chasm is difficult for companies. This section will make it easier to understand the theory and apply it to your Microsoft Azure-based analytical systems. Whether you already have an existing central analytical system that you wish to migrate to a data mesh architecture or you are building an analytical system from the ground up, this part of the book will help you pave the way forward to adopt the data mesh architecture on Microsoft Azure.
This part has the following chapters:
Chapter 1, Introducing Data MeshesChapter 2, Building a Data Mesh StrategyChapter 3, Deploying a Data Mesh Using the Azure Cloud-Scale Analytics FrameworkChapter 4, Building a Data Mesh Governance Framework Using Microsoft Azure ServicesChapter 5, Security Architecture for Data MeshesChapter 6, Automating Deployment through Azure Resource Manager and Azure DevOpsChapter 7, Building a Self-Service Portal for Common Data Mesh OperationsBefore we start designing and implementing a data mesh architecture, it is important to understand Why consider a data mesh? This chapter briefly walks through the history of business intelligence (BI) and analytics. We will go through the events and transitions of how analytics has evolved over the last few decades and the current challenges that make a data mesh architecture an alternative to traditional centralized analytical systems.
In this chapter, we’re going to cover the following main topics:
Exploring the evolution of modern data analyticsDiscovering the challenges of modern-day enterprisesData as a product (DaaP)Data domainsThe data mesh solutionAfter the advent of databases in the late 1970s and early 1980s, databases were treated as a central source of truth (SOT) and designed to record transactions and produce daily, weekly, and monthly financial reports. These are largely termed online transaction processing (OLTP) systems.
In the late 1980s, businesses felt the need to understand how their business was performing and investigate any changes to sales, production, revenue, or any other important aspects of the business so that they could run their businesses more efficiently. But in order to conduct this investigation, they had to run complex queries across all tables in their database and be able to slice and dice the data to dig deeper into it. They also had to aggregate values in order to find totals and averages across a period of time. A relational model that spread data across multiple tables was needed to aggregate and join data across these tables. As a result of these complex joins and aggregations, the queries started getting expensive and more demanding in terms of execution time and resources. Database engineers soon realized that they needed a new method of storing the same data so that it would be easier to query and aggregate. This was the birth of online analytical processing (OLAP) systems, and the database transformed into a data warehouse.
Data warehouses dominated the analytics world for over three decades and are still the analytics tool of choice for small and medium businesses.
At the turn of the millennium, as the dot-com revolution started flourishing and businesses went online, computer engineers realized that data is not always tabular and structured. The requirements of an online business changed dynamically. Formats for user profiles, product data, and user interaction data changed constantly and could not be stored against a fixed schema definition. The volumes of data that an online business needed to handle were also exponentially higher than for a traditional business. Computer hardware had also advanced, providing more compute, storage, and memory in smaller and cheaper machines that could be racked on top of each other, forming what we now call data centers.
The main challenge with an OLAP approach was that data had to move from storage to memory for joining tables, aggregations, and calculations. As businesses got more complex and with the advent of the internet and the dot-com boom, the amount of data collected started getting larger and structurally more fluid in nature. Engineers during this time challenged the traditional concept of moving structured data to memory and processing it in a central unit. They started experimenting with storing data as text files with a fluid structure and querying these large volumes of files using a distributed storage and compute architecture. They brought the compute to where the data was residing. Thus started the era of Hadoop, MapReduce, and big data processing using Apache Spark and NoSQL. These technologies split the data into smaller pieces, processed them on parallel compute nodes, and then combined the results.
This completely changed the way data was stored and processed. Storage systems used for storing these semi-structured files were called data lakes. They were built on a distributed file storage mechanism called Hadoop Distributed File System (HDFS).
Amazon launched the Elastic Compute Cloud (EC2) cloud platform in 2006, followed by Microsoft Azure in 2010. This further revolutionized the computing landscape.
Up until public cloud platforms such as Amazon Web Services (AWS) and Microsoft Azure were launched, data centers and the hardware that went into the data centers were all purchased by companies. The data center was a capital expenditure (CapEx). However, after the launch of the public cloud, servers were now available as pay-as-you-go subscriptions. Renting servers in the cloud became an operational expense (OpEx). Hardware expenditure went from CapEx to OpEx, providing practically unlimited compute and storage whenever you needed it, without having to raise a purchase order and go through procurement cycles. This also provided a boost to artificial intelligence (AI) and machine learning (ML). AI and ML further added more semi-structured and unstructured data such as images, sound, and video files. The data lake, with all this structured, semi-structured, and unstructured data, now moved to the cloud and was used to store massive amounts of data. For large enterprises, this data went into terabytes and petabytes.
The data warehouse was still a part of every enterprise. While storage and processing evolved, most of the final processed models were still stored in a data warehouse. This created a bottleneck as the performance of the data warehouse drove the speed of analytics for an enterprise. This architecture also created a lot of moving parts in terms of separate components for big data processing, the data warehouse, and real-time data processing.
For some time (around the 2010s), enterprises used a combination of a data lake for semi-structured and unstructured data and a data warehouse for structured data. It was a complex cloud data platform with multiple moving parts to be maintained.
Today, enterprises are moving to the data lakehouse architecture. A lakehouse combines the flexibility and scalability of a data lake and the modeling and analytics efficiency of a data warehouse. It maintains semi-structured data along with a log of all Create, Read, Update, and Delete (CRUD) operations and uses the combination of the data and log files to perform complex data warehouse queries without having to store the data in a relational or star-schema format. This technology provides many benefits. It’s low-cost, as storing flat files is cheaper. It allows for flexibility as the table schema can change over time without having to modify the entire database design. Traditional BI and advanced ML-based analytics can be performed on the same physical data. We don’t need to have separate stores for data warehouse and ML jobs.
Data lakehouses are still evolving, and enterprises are in the process of adopting this new analytics architecture.
Figure 1.1 shows a timeline of how BI and modern-day analytics evolved:
Figure 1.1 – A centralized analytics system
Next, we’ll look at some challenges that enterprises face today with the traditional analytical systems described in this section.
As enterprises get bigger, the ability to be agile and competitive becomes challenging. Some enterprises have large departments for sales, marketing, engineering, and so on. Many large corporations split their global businesses into regions or zones. Each of these sub-organizations operates as an independent unit. They have their own business complexity, analytics needs, and speed at which analytical output is required. They choose their tools according to their requirements. But at the end of the day, they are asked to move their data to a central lakehouse or data warehouse for enterprise-wide analytics, which uses a specific set of tools and mandates a certain data format. This strategy has multiple challenges:
The sub-organizations are forced to use the tools in the central analytical system to perform their analytics.The sub-organizations then start building their own local analytical platforms to speed up their analytics using tools that best suit them. These analytical islands start producing their own output and deliver it directly to the business. No one else has access to their data. Neither is the data managed under any common standards. This creates analytical silos all across the organization, hindering collaboration and innovation.The enterprise is missing out on all local innovation and is rigidly tied to what the central analytical system provides, making it less agile and competitive and hindering innovation.A lot of organizational effort is spent on extract, transform, and load (ETL) pipelines and Hadoop, MapReduce modules to move data to the central data lake or data warehouse and analytical stores. Figure 1.2 depicts what a centralized analytical system looks like in an enterprise:
Figure 1.2 – A centralized analytical system in an enterprise
These challenges mandate a new strategy and architecture on how an enterprise should look at data analytics.
In 2019, Zhamak Dehghani, while working as a principal consultant at Thoughtworks, coined the term data mesh (https://www.thoughtworks.com/what-we-do/data-and-ai/data-mesh): a decentralized data analytics architecture that allows large, distributed organizations to scale their analytics using domain-oriented decentralization.
Now that you understand the current architecture of a typical centralized analytical system, in the next section, we will talk about the various challenges faced by a growing large organization and how a centralized system can hinder the speed of innovation that the enterprise needs to keep itself ahead of the curve. We will also dive deeper into different elements of a data mesh architecture and how it solves challenges faced by enterprises.
Historically, data has always been treated as a backend. It was used by the middle tier and then surfaced to the frontend. Applications did not do a lot with the data other than aggregating and presenting it with better visuals. Relational database systems also ensured that data adhered to a schema and format and that all mandatory fields were populated. As a result, applications received quality data and had to do minimal checks on quality. But with semi-structured data, this equation changes. Semi-structured data does not comply with fixed schemas and rules of how data is formatted and populated. Advanced analytics, ML, and big data analytics need a lot of processing on the data before it’s consumed by any algorithm and application. ML algorithms provide exponentially accurate output as the volume of quality data increases.
In a paper published in 2001 (https://homl.info/6), Microsoft researchers Michele Banko and Erik Brill showed that different ML algorithms performed equally well as long as ample quality and labeled data was provided. This paper proves that quality and quantity of data have a bigger role in determining the accuracy and performance of ML models than the models themselves. As a result, data scientists are continuously seeking quality data. This also makes the role of data engineers critical as they curate and engineer data for the data scientists.
Another aspect of modern analytics with ML is that it needs continuous training. ML algorithms are trained on historical data, and as the business collects new data, new patterns may be introduced, which starts making the trained algorithms inaccurate over time. This is typically referred to as data drift. There are also situations where the business requirements change that demand the models to be retuned and retrained. This is called concept drift. To make retraining efficient, data analytics teams build automated operations around this retraining process. This process is called ML operations or MLOps (https://learn.microsoft.com/en-us/azure/machine-learning/concept-model-management-and-deployment?view=azureml-api-2).
For MLOps to work consistently, the quality and availability of data used for training becomes critical. This requirement leads to versioning and service-level agreements (SLAs) on data being fed into the training process. At this point, data starts to acquire the characteristics of an application or a product. Data itself needs a DevOps-like process to ensure consistent quality and availability to ensure that all downstream processes work reliably.
And hence, the concept of DaaP was born:
Figure 1.3 – DaaP
In the next section, we will learn about how a network of these data products creates a data mesh and all the advantages and challenges that come with it.
One of the concepts defined in https://www.thoughtworks.com/what-we-do/data-and-ai/data-mesh is the concept of data domains. Data domains are defined as a logical grouping of data and teams aligned with certain business domains, such as sales, marketing, or production. While each of these domains may have multiple data products, all the teams and the data used to build these products fall under the same domain. This domain team is responsible for managing and maintaining the data in their domain. This is described as domain ownership.
However, in reality, we have found that adopting a domain as a concept could be challenging for many companies as every company has its own structure. For example, large global companies that run their business in different geological zones have sales, marketing, production, and local finance departments in every location. Each of these departments works independently based on their local market and country requirements. Clubbing all sales teams across the world into a domain is not practical. Hence, these large companies might choose to make their geographical zones their domains. And while sales teams from North America might want to get data from European sales teams to analyze similar trends, they need not belong to the same domain. This can be further complicated for companies with multiple lines of products that need to be separated but could have common domains (finance, sales) crossing the lines of products.
To simplify this, a domain could be referred to as just a logical grouping of data products that need to be managed together because they have very similar needs or access common data and resources:
Figure 1.4 – Data domains and data products
The focus of building a data mesh architecture should be to decentralize data, centralize governance, and monitor and improve the collaboration and agility of enterprise analytics.
As we learned in the Discovering the challenges of modern-day enterprises section, having a central data lake or data warehouse has several disadvantages, especially for large organizations. In the previous section, we learned about changes in data processing requirements driven by ML and advanced analytics and how data now needs to be treated like a product with its own complete life cycle.
To explain a data mesh in one sentence, a data mesh is a centrally managed network of decentralized data products. The data mesh breaks the central data lake into decentralized islands of data that are owned by the teams that generate the data. The data mesh architecture proposes that data be treated like a product, with each team producing its own data/output using its own choice of tools arranged in an architecture that works for them. This team completely owns the data/output they produce and exposes it for others to consume in a way they deem fit for their data. Here are some examples.
A team from marketing gathers social media data for their products and curates it into a clean dataset that can be used by other marketing and sales teams for analytics. Instead of moving this data to a central data lake, they make the dataset available as raw JSON files in a data lake managed by this team. Other teams can reference this data, import it into their ML notebooks, or copy it into their local storage and transform it in some useful way.
Another team generates a real-time Sales Volume by Month by Productkey performance indicator (KPI). This value is made available through an API that can be called with parameters of Date and Product Identifier.
In each of the preceding examples, the team that generates the data is responsible for the quality, consistency, and availability of the data. For other teams to reliably use their data, they need to provide some guarantees of quality and availability.
Also, the fact that the data is available for others to use needs to be announced in some way. This means that there needs to be a way for people to search for and discover this data.
These data quality and availability guarantees and the ability to discover data need to be managed centrally by some common team: a team of data governors.
In summary, a data mesh decentralizes data and data responsibilities and centralizes management and governance. Each data product team needs to maintain, manage, and provide access to their data, just like developers manage their code, build products, and provide access to these products. A data mesh brings aspects of application life cycle management, application versioning, and DevOps to the world of data.
At a high level, the data mesh architecture proposes the following:
Decentralizing data to the department/team where it originates (data products)Decentralizing the decision of selecting tools used by each department/team to build their analytical outputDecentralizing the responsibility of data quality and life cycle management to individual departments/teams (data products and data domains)Centralizing data access management to allow different teams to access each other’s data in a secure and standardized mannerCentralizing data governance tools such as data catalogs and common pipelines to get data from legacy and external systemsCentralizing infrastructure deployment as Infrastructure-as-Code (IaC) to bring agility, standardization, and centralized management to the infrastructure that is deployed to the individual poolsProviding a self-service platform for data producers and consumers to develop, manage, and share data productsFigure 1.5 shows a high-level data mesh concept sketch as depicted in Zhamak Dehghani’s original text on a data mesh:
Figure 1.5 – The data mesh architecture
This architecture has the following advantages:
Democratizes and streamlines access to enterprise-wide data, thus increasing the speed of innovation around dataBrings agility to creating new analytical products or changing existing productsPromotes sharing and discoverability of dataBrings a culture of responsibility toward data quality and life cycle managementStandardizes infrastructure deployment through centrally managed IaC templatesPromotes the reuse of common pipelines and processing modulesCentralizes data governance tools such as data catalogs, master data management (MDM) tools, and data quality management (DQM) toolsDecentralizing data and centralizing access management does bring about some challenges, especially around data movement. How do different teams access each other’s data for their processing? Do they copy the data locally to their storage? Do they access the data directly from the source and copy it into memory structures such as DataFrames in Python?
One of the characteristics of a data mesh is to minimize data movement by keeping data at the source and accessing it directly. This is called in-place sharing. However, in-place sharing is not always viable as distances between producer and consumer could be far, and network latencies could make it impractical. We need to strike a balance between in-place sharing and data movement based on what is best suited for the specific scenario.
Implementing a data mesh architecture also involves a lot of non-technical changes to the organization. It’s a cultural mind-shift from centralized to decentralized, from authoritative to autonomous. While it democratizes data access, it also puts a lot of responsibility on individual teams to ensure that their data is maintained, versioned, available, and reliable. The data mesh through its design and processes must build a system of trust among these teams so that they can confidently use each other’s data products and accelerate innovation.
We saw in this chapter how data analytics evolved over time as technology advanced and as business needs changed. One of the main objectives of walking through this history is to realize that, once again, we are at the cusp of a change. Data-driven organizations are putting pressure on data products to deliver faster innovation to keep the company ahead of the competitive curve. We also saw how data preprocessing has become critical to modern-day analytics, which uses machine learning for accurate predictions and forecasting. Clean, curated data itself becomes like a product that other products can consume to get innovative insights. This drives the need for a more collaborative and agile analytical environment where data can be discovered and used to build data products, as opposed to the centralized dashboards and reports of the past. A data mesh is one of the ways to bring this agile and collaborative framework to life.
However, a data mesh is a long-term strategy and not a quick solution. In Chapter 2, we will look at building a data analytics strategy that leads to a data mesh architecture.
A data mesh may not be helpful to everybody, and adopting it as hype could be overkill. This chapter will discuss the conditions of when a data mesh is applicable and, for those who can benefit from a data mesh architecture, what should be considered before adopting it. In order to build a data mesh, a company needs to first recognize the current state of its analytical solutions and define its future state. This chapter will walk through the main strategic areas to consider when building your data analytics strategy.
In this chapter, we’re going to cover the following main topics:
Is a data mesh for everybody?Aligning your analytics strategy with your business strategyUnderstanding data maturity modelsBuilding the technology stackThe analytics teamData governanceApproaches to building your data meshThe answer is no. So, who should adopt a data mesh architecture?
Medium-size companies that have autonomous departments (sales, marketing, finance, human resources) that have their own analytical needs but are forced to centralize data to a central locationLarge multi-national companies that have business across multiple geographical zones and run as independent businesses catering to local market needsSmall companies and start-ups forecasting exponential growth that rely on data for their businessWhich companies do not need a data mesh architecture?
Small companies that don’t see exponential growth in data should continue using a central data lake or data warehouse.Companies that by design or by regulation are prohibited from sharing data across intra-business boundaries will see benefits from some characteristics of a data mesh, but not all. For example, pharma companies working with highly sensitive patient information do not allow data to be exchanged between departments. We will discuss mesh topologies for such companies later in this book.Companies where the current analytics platform is providing all the required agility and innovation that the company needs. Any inefficiencies are just a matter of minor technological decisions.In order to understand how a data mesh can help an enterprise, it is important to understand and build a data analytics strategy. The remaining part of this chapter will discuss the various aspects of building a data analytics strategy before you consider implementing a data mesh.
A successful data strategy is one that aligns with the business strategy, delivering business outcomes. Depending on the nature of the business and the industry it operates in, there can be different business strategies. A business operating in a very competitive space might want to have a pricing advantage, and hence reducing manufacturing or service costs might be the core strategy for the business. An online business might have a strategy around engaging its customers or marketing the right products to the right audience. It’s important to ensure that the results of your data analytics are providing the right key performance indicators (KPIs) and answering the required questions for your business to align with this strategy. Because, let’s face it, any technology initiative will only get buy-in when it supports the goals of the company.
Understanding and aligning your technology strategy with your business strategy is beyond the scope of this book. However, we will focus on aligning your analytics strategy, which is a crucial step to complete before building a technology strategy. You need to dig deep into your business and decide on the operating model of your analytical framework. You need to organize your stakeholders into domains. Look at their technical knowledge and understand how they will participate in your analytical process.
A traditional operating model divides the company into business and IT:
Figure 2.1 – Traditional operating model
As tools and technologies improve with natural-language interactions and friendly user interfaces, knowledge gaps become increasingly insignificant. Business analysts can build complex online analytical processing (OLAP) structures by dragging and dropping entities and columns onto a canvas, something that had to be put in as a request to a data engineer to build. As a result of this, IT teams and business teams can collaborate better and be more agile. Hence, a more modern operating model has hybrid and virtual teams overlapping business and technology knowledge:
Figure 2.2 – A collaborative operating model
The operating model needs to be understood from a present and future perspective as part of the data strategy.
In the next section, we will discuss data analytics maturity models for a company so that you can analyze and understand how mature your current analytical platform is before you can start architecting your data mesh.
Data analytics maturity models