31,19 €
The Data Lakehouse architecture is a new paradigm that enables large-scale analytics. This book will guide you in developing data architecture in the right way to ensure your organization's success.
The first part of the book discusses the different data architectural patterns used in the past and the need for a new architectural paradigm, as well as the drivers that have caused this change. It covers the principles that govern the target architecture, the components that form the Data Lakehouse architecture, and the rationale and need for those components. The second part deep dives into the different layers of Data Lakehouse. It covers various scenarios and components for data ingestion, storage, data processing, data serving, analytics, governance, and data security. The book's third part focuses on the practical implementation of the Data Lakehouse architecture in a cloud computing platform. It focuses on various ways to combine the Data Lakehouse pattern to realize macro-patterns, such as Data Mesh and Data Hub-Spoke, based on the organization's needs and maturity level. The frameworks introduced will be practical and organizations can readily benefit from their application.
By the end of this book, you'll clearly understand how to implement the Data Lakehouse architecture pattern in a scalable, agile, and cost-effective manner.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 206
Veröffentlichungsjahr: 2022
Architecting a modern and scalable data analytics platform
Pradeep Menon
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Sunith Shetty
Senior Editor: David Sugarman
Content Development Editor: Priyanka Soam
Technical Editor: Sonam Pandey
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Joshua Misquitta
Marketing Coordinator: Abeer Riyaz Dawe
First published: March 2022
Production reference: 1070222
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80181-593-2
www.packt.com
Many people have contributed to the creation of this book. From its inception to its publishing, my mentors, friends, colleagues, and family have constantly motivated, guided, and supported me. Unfortunately, there is not enough space to thank all of them. However, I will make five key mentions that were absolutely pivotal for creating this book.
Firstly, I want to thank my parents, who have supported me through the thick and thin of life. Their upbringing ensured that I was capable enough to undertake the Herculean task of writing a book.
Secondly, I want to thank my wife, Archana, and my daughter, Anaisha. They have constantly supported me while writing this book. They ensured that the boat was afloat as I burnt the midnight oil.
Thirdly, I want to thank my colleague and an accomplished architect, Debananda Ghosh. His technical knowledge, understanding of the complex dynamics of data, and honest feedback helped me make manifold improvements to this book's contents.
Fourthly, I want to thank the Packt Publishing team: Sunith Shetty, Priyanka Soam, Aishwarya Mohan, and David Sugarman. This team is an author's dream – open to ideas, dedicated, and diligent. I'm thankful for the fantastic support provided by the team that made the writing process an absolute pleasure.
And finally, I want to thank my best friend and beloved pet, Pablo (a beagle). Without him, I wouldn't have had a chance to complete any book. He has single-handedly made me disciplined in my approach to life. The dedication and focus required to complete a book are directly attributable to the discipline instilled in me by him.
Pradeep Menon is a seasoned data analytics professional with more than 18 years of experience in data and AI.
Pradeep can balance business and technical aspects of any engagement and cross-pollinate complex concepts across many industries and scenarios.
Currently, Pradeep works as a data and AI strategist at Microsoft. In this role, he is responsible for driving big data and AI adoption for Microsoft's strategic customers across Asia.
Pradeep is also a distinguished speaker and blogger and has given numerous keynotes on cloud technologies, data, and AI.
Debananda Ghosh is a senior specialist and global black belt (Cloud Analytics Asia) at Microsoft. He completed his bachelor's at Jadavpur University in B.Engineering, pursuing his postgraduate in data science and business analytics from McCombs School of Business at the University of Texas at Austin. He specializes in the fields of data and AI. His expertise includes data warehouses, DBA, data engineering, machine learning, data science product innovation, data and AI architecture and presales, and cloud analytics product sales. He has worked with customers in finance, manufacturing, utilities, telecoms, retail, e-commerce, and aviation. Currently working in the Microsoft Cloud Analytics product field, he helps industry partners achieve their digital transformation projects using advanced analytics and AI capabilities.
Digital transformation is a reality. All organizations, big or small, have to embrace this reality to be relevant in the future. Data is at the core of this, and data analytics is the catalyst for this transformation. Therefore, an agile, scalable, and robust data architecture for analytics is pivotal for forging data as a strategic asset.
However, very few organizations can successfully harness their data estate for analytics. Many of them grapple with obsolete enterprise data warehouse architectural patterns or have jumped onto the data lake bandwagon without a proper architectural framework. Also, the new trending term "Data Lakehouse" focuses on various vendors' product-centric views rather than an architectural paradigm. This book views the concept of Data Lakehouse through an architectural lens.
This book is a comprehensive framework for developing a modern data analytics architecture. While writing this book, I have focused on architectural constructs of a Data Lakehouse. The book covers different layers and components of architecture. It explores how these different layers interoperate to form a robust, scalable, and modular architecture that can be deployed on any platform.
By the end of this book, you will understand the need for a new data architecture pattern called Data Lakehouse, the details of the different layers and components of a Data Lakehouse architecture, and the methods required to deploy this architecture in a cloud computing platform and scale it to achieve the macro-patterns of Data Mesh and Hub-spoke.
This book is for people who want to understand how to architect modern analytics. This book targets anyone who wants to become well-versed with modern data architecture patterns to enable large-scale analytics. It explains concepts in a non-technical and straightforward manner. The book's target audience includes data architects, big data engineers, data strategists and practitioners, data stewards, and cloud computing practitioners.
Chapter 1, Introducing the Evolution of Data Analytics Patterns, provides an overview of the evolution of the data architecture patterns for analytics.
Chapter 2, The Data Lakehouse Architecture Overview, provides an overview of the various components that form the Data Lakehouse architecture pattern.
Chapter 3, Ingesting and Processing Data in a Data Lakehouse, deep dives into the methods of ingesting and processing data in a batch and streaming data in a Data Lakehouse.
Chapter 4, Storing and Serving Data in a Data Lakehouse, discusses the types of datastores of a data lake and various methods of serving data from a Data Lakehouse.
Chapter 5, Deriving Insights from a Data Lakehouse, discusses the ways in which business intelligence, artificial intelligence, and data exploration can be carried out.
Chapter 6, Applying Data Governance in a Data Lakehouse, discusses how data can be governed, how to implement and maintain data quality, and how data needs to be cataloged.
Chapter 7, Applying Data Security in a Data Lakehouse, discusses various components used to secure the Data Lakehouse and ways to provide proper access to the right users.
Chapter 8, Implementing a Data Lakehouse on Microsoft Azure, focuses on implementing a Data Lakehouse on the Microsoft Azure cloud computing platform.
Chapter 9, Scaling the Data Lakehouse Architecture, discusses how Data Lakehouses can be scaled to realize the macro-architecture patterns of Data Mesh and Hub-spoke.
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801815932_ColorImages.pdf.
Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "The two types of metadata that need to be cataloged include Functional and Technical."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you've read Data Lakehouse in Action, we'd love to hear your thoughts! Please click https://packt.link/r/1-801-81593-3 to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.
This section describes the evolution of data architecture patterns for analytics. It addresses the challenges posed by different architectural patterns and establishes a new paradigm, that is, the data lakehouse. An overview of the data lakehouse architecture is also provided, which includes coverage of the principles that govern the target architecture, the components that form the data lakehouse architecture, the rationale and need for those components, and the architectural principles adopted to make a data lake scalable and robust.
This section comprises the following chapters:
Chapter 1, Introducing the Evolution of Data Analytics PatternsChapter 2, The Data Lakehouse Architecture OverviewData analytics is an ever-changing field. A little history will help you appreciate the strides in this field and how data architectural patterns have evolved to fulfill the ever-changing need for analytics.
First, let's start with some definitions:
What is analytics? Analytics is defined as any action that converts data into insights.What is data architecture? Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.Analytics and the data architecture that enables analytics goes a long way. Let's now explore some of the patterns that have evolved over the last few decades.
This chapter explores the genesis of data growth and explains the need for a new paradigm in data architecture. This chapter starts by examining the predominant paradigm, the enterprise data warehouse, popular in the 1990s and 2000s. It explores the challenges associated with this paradigm and then covers the drivers that caused an explosion in data. It further examines the rise of a new paradigm, the data lake, and its challenges. Furthermore, this chapter ends by advocating the need for a new paradigm, the data lakehouse. It clarifies the key benefits delivered by a well-architected data lakehouse.
We'll cover all of this in the following topics:
Discovering the enterprise data warehouse eraExploring the five factors of changeInvestigating the data lake eraIntroducing the data lakehouse paradigmThe Enterprise Data Warehouse (EDW) pattern, popularized by Ralph Kimball and Bill Inmon, was predominant in the 1990s and 2000s. The needs of this era were relatively straightforward (at least compared to the current context). The focus was predominantly on optimizing database structures to satisfy reporting requirements. Analytics was synonymous with reporting. Machine learning was a specialized field and was not ubiquitous in enterprises.
A typical EDW pattern is depicted in the following figure:
Figure 1.1 – A typical EDW pattern
As shown in Figure 1.1, the pattern entailed source systems composed of databases or flat-file structures. The data sources are predominantly structured, that is, rows and columns. A process called Extract-Transform-Load (ETL) first extracts the data from the source systems. Then, the process transforms the data into a shape and form that is conducive for analysis. Once the data is transformed, it is loaded into an EDW. From there, the subsets of data are then populated to downstream data marts. Data marts can be conceived of as mini data warehouses that cater to the business requirements of a specific department.
As you can imagine, this pattern primarily was focused on the following:
Creating a data structure that is optimized for storage and modeled for reportingFocusing on the reporting requirements of the businessHarnessing the structured data into actionable insightsEvery coin has two sides. The EDW pattern is not an exception. It has its pros and it has its cons. This pattern has survived the test of time. It was widespread and well adopted because of the following key advantages:
Since most of the analytical requirements were related to reporting, this pattern effectively addressed many organizations' reporting requirements.Large enterprise data models were able to structure an organization's data into logical and physical models. This pattern gave a structure to manage the organization's data in a modular and efficient manner.Since this pattern catered only to structured data, the technology required to harness structured data was evolved and readily available. Relational Database Management Systems (RDBMSes) evolved and were juxtaposed appropriately to harness its features for reporting.However, it also had its own set of challenges that surfaced as the data volumes grew and new data formats started emerging. A few challenges associated with the EDW pattern are as follows:
This pattern was not as agile as the changing business requirements wanted it to be. Any change in the reporting requirement had to go through a long-winded process of data model changes, ETL code changes, and respective changes to the reporting system. Often, the ETL process was a specialized skill and became a bottleneck for reducing data to insight turnover time. The nature of analytics is unique. The more you see the output, the more you demand. Many EDW projects were deemed a failure. The failure was not from a technical perspective, but from a business perspective. Operationally, the design changes required to cater to these fast-evolving requirements were too difficult to handle.As the data volumes grew, this pattern proved too cost prohibitive. Massive parallel-processing database technologies started evolving that specialized in data warehouse workloads. The cost of maintaining these databases was prohibitive as well. It involved expensive software prices, frequent hardware refreshes, and a substantial staffing cost. The return on investment was no longer justifiable. As the format of data started evolving, the challenges associated with the EDW became more evident. Database technologies were developed to cater to semi-structured data (JSON). However, the fundamental concept was still RDBMS-based. The underlying technology was not able to effectively cater to these new types of data. There was more value in analyzing data that was not structured. The sheer variety of data was too complex for EDWs to handle.The EDW was focused predominantly on Business Intelligence (BI). It facilitated the creation of scheduled reports, ad hoc data analysis, and self-service BI. Although it catered to most of the personas who performed analysis, it was not conducive to AI/ML use cases. The data in the EDW was already cleansed and structured with a razor-sharp focus on reporting. This left little room for a data scientist (statistical modelers at that time) to explore data and create a new hypothesis. In short, the EDW was primarily focused on BI.While the EDW pattern was becoming mainstream, a perfect storm was flourishing that changed the landscape. The following section will focus on five different factors that came together to change the data architecture pattern for good.
The year 2007 changed the world as we know it; the day Steve Jobs took the stage and announced the iPhone launch was a turning point in the age of data. That day brewed the perfect "data" storm.
A perfect storm is a meteorological event that occurs as a result of a rare combination of factors. In the world of data evolution, such a perfect storm occurred in the last decade, one that has catapulted data as a strategic enterprise asset. Five ingredients caused the perfect "data" storm.
Figure 1.2 – Ingredients of the perfect "data" storm
As depicted in Figure 1.2, there were five factors to the perfect storm. An exponential growth of data and an increase in computing power were the first two factors. These two factors coincided with a decrease in storage cost. The rise of AI and the advancement of cloud computing coalesced at the same time to form the perfect storm.
These factors developed independently and converged together, changing and shaping industries. Let's look into each of these factors briefly.
The exponential growth of data is the first ingredient of the perfect storm.
Figure 1.3 – Estimated data growth between 2010 and 2020
According to the International Data Corporation (IDC), by 2025, the total data volumes generated will reach around 163 ZB (zettabytes), that is, a trillion gigabytes. In 2010, that number was approximately 0.5 ZB. This exponential growth of data is attributed to a vast improvement in internet technologies that have fueled the growth of many industries. The telecommunications industry was the major industry that was transformed. This, in turn, transformed many other industries. Data became ubiquitous and every business craved more data bandwidth. Social media platforms started to be used as well. The likes of Facebook, Twitter, and Instagram flooded the internet space with more data. Streaming services and e-commerce also generated tons of data. This generated data was used to forge and influence consumer behaviors. Last, but not least, the technological leaps in the Internet of Things (IoT) space generated loads of data.
The traditional EDW pattern was not able to cope with this growth in data. They were designed for structured data. Big data had changed the definition of usable data. The data now was big (volume); some of them were continuously flowing (velocity), generated in different shapes and forms (variety), and from a plethora of sources with noise (veracity).
The exponential increase in computing power is the second ingredient of the perfect storm.
Figure 1.4 – Estimated growth in transistors per microprocessors between 2010 and 2020
Moore's law is the prediction made by American engineer Gordon Moore in 1965 that the number of transistors per silicon chip doubles every year. This law has been faithful to its forecast so far. In 2010, the number of transistors in a microprocessor was around 2 billion. In 2020, that number stood at 54 billion. This exponential increase in computing power dovetails with the rise of cloud computing technologies that provide limitless compute at an affordable price point.
The increase in computing power at a reasonable price point provided a much-needed impetus for big data. Organizations can now procure more and more compute at a much lower price point. The compute available in cloud computing can now be used to process and analyze data on demand.
The rapid decrease in storage cost is the third ingredient of the perfect storm.
Figure 1.5 – The estimated decrease in storage cost between 2010 and 2020
The cost of storage has also exponentially decreased. In 2010, the average cost of storing a GB of data in a Hard Disk Drive (HDD) was around $0.1. That number has reduced to approximately $0.01 in 10 years. In the traditional EDW pattern, organizations had to be picky about which data they had to store for analysis and which data could be discarded. Holding data was an expensive proposition. However, the exponential decrease in storage cost meant that all data could now be stored at a fraction of the previous cost. There was now no need to pick and choose what should be stored and what should be discarded. Data in whatever shape or form could now be kept at a fraction of price. The mantra of store first and analyze later could now be implemented.
Artificial Intelligence (AI) systems are not new to the world. In fact, their genesis goes back to the 1950s, when statistical models were used to estimate values of data points based on past data. This field was out of focus for an extended period, as the computing power and large corpus of data required to run these models were not available.
Figure 1.6 – Timeline of the evolution of AI
However, after a long hibernation, AI technologies saw a resurgence in the early 2010s. This resurgence was partly due to the abundance of powerful computing resources and the equal availability of data. AI models now could be trained faster, and the results were stunningly accurate.
The factor of reduced storage cost and more available computing resources was a boon for AI. More and more complex models could now be trained.
Figure 1.7 – Accuracy of AI systems in matching humans for image recognition
This was especially true for deep learning algorithms. For instance, a deep learning technique called Convoluted Neural Networks (CNNs) has become very popular for detecting images. Over a period, deeper and deeper neural networks were created. Now, AI systems have surpassed human beings in detecting objects.
As AI systems became more accurate, they gained in popularity. This fueled cyclic behavior, and more and more businesses were employing AI in their digital transformation agenda.
The fifth ingredient