35,99 €
Revolutionize your approach to data processing in the fast-paced business landscape with this essential guide to data engineering. Discover the power of scalable, efficient, and secure data solutions through expert guidance on data engineering principles and techniques. Written by two industry experts with over 60 years of combined experience, it offers deep insights into best practices, architecture, agile processes, and cloud-based pipelines.
You’ll start by defining the challenges data engineers face and understand how this agile and future-proof comprehensive data solution architecture addresses them. As you explore the extensive toolkit, mastering the capabilities of various instruments, you’ll gain the knowledge needed for independent research. Covering everything you need, right from data engineering fundamentals, the guide uses real-world examples to illustrate potential solutions. It elevates your skills to architect scalable data systems, implement agile development processes, and design cloud-based data pipelines. The book further equips you with the knowledge to harness serverless computing and microservices to build resilient data applications.
By the end, you'll be armed with the expertise to design and deliver high-performance data engineering solutions that are not only robust, efficient, and secure but also future-ready.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 1178
Veröffentlichungsjahr: 2024
Data Engineering Best Practices
Architect robust and cost-effective data solutions in the cloud era
Richard J. Schiller
David Larochelle
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Manager: Nilesh Kowadkar
Book Project Manager: Hemangi Lotlikar
Senior Editor: David Sugarman
Technical Editor: Sweety Pagaria
Copy Editor: Safis Editing
Proofreader: David Sugarman
Indexer: Manju Arasan and Tejal Soni
Production Designer: Alishon Mendonca
DevRel Marketing Coordinator: Nivedita Singh
First published: September 2024
Production reference: 1060924
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-80324-498-3
www.packtpub.com
Richard J. Schiller is a chief architect, distinguished engineer, and startup entrepreneur with 40 years of experience delivering real-time large-scale data processing systems. He holds an MS in computer engineering from Columbia University’s School of Engineering and Applied Science and a BA in computer science and applied mathematics. He has been involved with two prior successful startups and has coauthored three patents. He is a hands-on systems developer and innovator.
David Larochelle has been involved in data engineering for startups, Fortune 500 companies, and research institutes. He holds a BS in computer science from the College of William & Mary, a Masters in computer science from the University of Virginia, and a Master’s in communication from the University of Pennsylvania. David’s career spans over 20 years, and his strong background has enabled him to work in a wide range of organizations, including startups, established companies, and research labs.
Kamal Baig has over 19 years of experience within the IT space. He has a solid background in data and application development integration and seamlessly transitioned into the Azure solutions architect role. Throughout his career, Kamal has consistently demonstrated a deep understanding of data architecture principles and best practices, leveraging Azure technologies to design and implement cutting-edge solutions that meet the complex needs of modern enterprises. His expertise spans data analytics modernization, data warehouses, data mesh, and data products. Coming from CPG, hospitality, and education domains, he has designed scalable data solutions to ensure security, compliance, and regulatory requirements to align with organizational goals.
John Bremer has 20 years of experience in the market research and data science space. A pioneer creating impactful innovation and value for clients and stakeholders, John has successfully designed and executed research and data strategies and projects for various industries and sectors, leveraging his expertise in data analysis, data mining, and data science. As the President of Phantom 4 Solutions, he provides on-demand support and consulting for organizations in many roles, including Chief Research Officer, Chief Data Science Officer, or Chief Data Analytics Officer. John has a proven track record of managing and transforming high-performance quant teams, and is a respected and valued consultant and decision-maker on data-related matters.
Lindsey Nix is an experienced product manager with a demonstrated history of working in the aerospace, finance, and semiconductors industries. Lindsey is skilled in management, system requirements, software documentation, technical writing, business development, strategic planning, and information assurance. She is a strong consulting professional with a Master’s degree in business administration, systems engineering, and data analytics from San Jose State University.
Shanthababu Pandian has over 23 years of IT experience, specializing in data architecting, engineering, analytics, DQ&G, data science, ML, and Gen AI. He holds a BA in electronics and communication engineering, three Master’s degrees (M.Tech, MBA, M.S.) from a prestigious Indian university, and has completed postgraduate programs in AIML from the University of Texas and data science from IIT Guwahati. He is a director of data and AI in London, UK, leading data-driven transformation programs focusing on team building and nurturing AIML and Gen AI. He helps global clients achieve business value through scalable data engineering and AI technologies. He is also a national and international speaker, author, technical reviewer, and blogger.
Marianna Petrovich brings over 30 years of experience to the table. Her passion for software engineering, cloud and data intricacies, quality, and governance is evident in her work. Marianna’s expertise in data engineering has made her a sought-after consultant and advisor. Trusted for her knowledge of modern data platforms and cloud tools, she guides clients with her exceptional skills in both data and engineering. Currently, she heads the enterprise data engineering team at Circana. Holding a Master’s degree in big data from ASU, Marianna resides in Northern California with her husband and eight children. Her aspiration is to inspire the next generation by teaching data engineering to children.
Bill Sun is a senior IT enterprise and solutions architect with expertise in cloud computing, big data, AI/ML, and DevOps. Known for his strong communication skills and leadership, Bill has driven significant projects at Fortune 500 companies. His accomplishments include cloud migrations, data pipeline optimizations, and the development of unified platform services. Bill holds a Master’s in computer science from Johns Hopkins, BA degrees from Tsinghua University, and multiple certifications, including Azure and AWS.
Are you an IT professional, IT manager, or business leader looking for an effective large-scale data engineering solution platform? Have you experienced the pain of slogging through piles of literature? Have you had to implement a series of painful proofs of concept? If so, this book is for you.
You will emerge on the other side able to implement correctly architected, data-engineered solutions that address real problems you will face in the development process.
Data engineering is rapidly evolving, and the modern data engineer needs to be equipped with software engineering practices to succeed in today’s fast-paced data-driven world. This hands-on book takes a practical approach to applying software and data engineering practices to modern use cases, including the following:
Migrating to cloud-based storage and processingApplying Agile methodologiesPrioritizing governance, privacy, and securityThis book is ideal for data engineers and analytics teams looking to enhance their skills and gain a competitive edge in the industry. While reading the book, you will be prompted with ideas, questions, and plans for implementation that would not have been considered prior to reading.
This book assumes that you have a foundational knowledge of at least one cloud vendor service, in particular, Amazon Web Services (AWS) or Microsoft’s Azure. Additionally, you should be well versed in a scripting language (such as Python) and a primary language (such as Java or C/C++), have encountered concurrent/distributed big data processing, and ideally have some experience with analytic services such as Azure Analysis Services (AAS), Microsoft Power BI, or other third-party analytic solutions. This book is largely aimed at developers and architects who understand Python and cloud computing but want a complete framework for future-proofing successful solutions.
The book is not proscriptive regarding IT solutions, but it does raise key considerations for evaluation as the technology field evolves. After reading this book, IT architects will be equipped to dialogue with cloud vendors and third-party vendors following best practices, so that any solution developed remains robust, of high quality, and cost-effective over time.
This book’s structure is as follows:
Mission/visionPrinciplesArchitectureBest practicesDesign patternsUse casesWhere pertinent, vendor selection criteria are presented wherein business value statements affect weighting, so that decisions are correctly made to implement an organization’s goals. Real-life examples and lessons sum up key points. The book is structured to enable you to envision a reference architecture for your organization and then see the implementation of the business solution in the context of the reference architecture. As the content of the chapters is absorbed, it is a best practice to organize the solution forming in your mind. This is our first key consideration:
“Envision what it means to my company’s goals.”
Organize your notes and takeaways from the perspective of “What does it mean for my goals?” while building up a reference architecture and solution strawman.
By the end of this book, you will be able to architect, design, and implement end-to-end cloud-based data processing pipelines. You will also be able to provide customers with access to data as a product supporting various machine learning, analytic, and big data use cases… all within a well-architected data framework. You will know how to build or buy logical components aligned to the architected data framework’s principles and best practices using Agile software development processes tuned to work for an organization. Although this book will not supply all the answers, it will shine a light on the path to success while avoiding the pitfalls encountered by many, including the author’s own experiences. It will save you countless hours of frustration and enable more rapid creation of better-architected systems.
If you are an IT professional, IT manager, or business leader looking to build a large-scale data engineering solution, then this book will provide you with a solid set of best practices. As a data engineer, it will give you details behind the best-practice recommendations so you assess the right approaches for your effort. All this should take many hours of pain out of your engineering efforts. If you have to implement a series of proofs of concepts, then this book points to the technologies and vendors that you should avoid so that the proof of concept does not become a proof of failure (POF). If all this is of interest to you, then this book is for you.
This book has been written at an intermediate level for data engineers, architects, and managers. There are no tools that you need on your desktop; however, if you want to become hands-on with the tools and technologies referenced, there will be short links (to the {https://packt-debp.link} domain) that are similar to traditional endnotes in each chapter. The journey toward best practices begins with the business context, the mission, vision, and principles that set the foundation for success, and then the development of an architecture. This is followed by engineering designs across a number of important areas driven by people, process, and technology needs.
As the book progresses, the technical topics get deeper, ending with machine learning, and GenAI with a practical look at how to tune LLMs with RAG and prompt engineering, and a good exploration of knowledge engineering.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Data Engineering Best Practices, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/978-1-80324-498-3
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyWe begin with the task of defining the business problem statement.
“Businesses are faced with an ever-changing technological landscape. Competition requires one to innovate at scale to remain relevant; this causes a constant implementation stream of total cost of ownership (TCO) budget allocations for refactoring and re-envisioning during what would normally be a run/manage phase of a system’s lifespan.”
This rapid rate of change means the goalposts are constantly moving. “Are we there yet?” is a question I heard from my kids constantly when traveling. It came from not knowing where we were or having any idea of the effort to get to where we were going, with a driver (me) who had never driven to that destination before. Thank goodness for Garmin (automobile navigation systems) and Google Maps, and not the outdated paper maps that were used in the past. See how technology even impacted that metaphor? Garmin is being displaced by Google for mapping use cases. This is not always because it is better but because it is free (if you wish to be subjected to data collection and advertising interruptions) and it is hosted on everyone’s smart device.
Now, I can tell my grandkids that in exactly 1 hour and 29 minutes, they will walk into their home after spending the weekend with their grandparents. The blank stare I get in response tells it all. Mapped data, rendered with real-time technology, has changed us completely.
Technological change can appear revolutionary when it’s occurring, but when looking back over time, the progression of change appears to be a no-brainer series of events that we take for granted, and even evolutionary. That is what is happening today with data, information, knowledge, and analytical data stores in the cloud. The term DataOps was popularized by Andy Palmer, co-founder and CEO of Tamr {https://packt-debp.link/MGj4EU}. The data management and analytics world has referenced the term often. In 2015, Palmer stated that DataOps is not just a buzzword, but a critical approach to managing data in today’s complex, data-driven world.
I believe that it’s time for data engineers and data scientists to embrace a similar (to DevOps) new discipline – let’s call it DataOps – that at its core addresses the needs of data professionals on the modern internet and inside the modern enterprise. (Andy Palmer {https://packt-debp.link/ihlztK})
In Figure 1.1, observe how data quality, integration, engineering, and security are tied together with a solid DataOps practice:
Figure 1.1 – DataOps in the enterprise
The goal of this chapter is to set up the foundation for understanding why the best practices presented are structured as they are in this book. This foundation will provide a firm footing to make the framework you adopt in your everyday engineering tasks more secure and well-grounded. There are many ways to look at solutions to data engineering challenges, and each vendor, engineering school, and cloud provider will have its own spin on the formula for success. That success will ultimately depend on what you can get working today and keep working in the future. A unique balance of various forces will need to be obtained. However, this balance may be easily upset if the foundation is not correct. As a reader, you will have naturally formed biases toward certain engineering challenges. These can force you into niche (or single-minded) focus directions – for example, a fixation on robust/highly available multi-region operations with a de-emphasized pipeline software development effort. As a result, you may overbuild robustness and underdevelop key features. Likewise, you can focus on hyper-agile streaming of development changes into production at the cost of consumer data quality. More generally, there is a significant risk from just doing IT and losing focus on why we need to carefully structure the processing of data in a modern information processing system. You must not neglect the need to capture data with its semantic context, thus making it true and relevant, instead of the software system becoming the sole interpretation of the data. This freedom makes data and context equal to information that is fit for purpose, now and in the future.
We can begin with the business problem statement.
Data engineering approaches are rapidly morphing today. They will coalesce into a systemic, consistent whole. At the core of this transformation is the realization that data is information that needs to represent facts and truths along with the rationalization that created those facts and truths over time. There must not be any false facts in future information systems. That term may strike you as odd. Can a fact be false? This question may be a bit provocative. But haven’t we often built IT systems to determine just that?
We process data in software systems that preserve business context and meaning but force the data to be served only through those systems. It does not stand alone and if consumed out of context, it would lead to these false facts propagating into the business environment. Data can’t stand alone today; it must be transformed by information processing systems, which have technical limitations. Pragmatic programmers’ {https://packt-debp.link/zS3jWY} imperfect tools and technology will produce imperfect solutions. Nevertheless, the engineer is still tasked with removing as many as possible, if not all, false facts when producing a solution. That has been elusive in the past.
We often take shortcuts. We also justify these shortcuts with statements like: “there simply is not enough time!” or “there’s no way we can get all that data!” The business “can’t afford to curate it correctly,” or lastly “there’s no funding for boiling the ocean.” We do not need to boil the ocean.
What we are going to think about is how we are going to turn that ocean directly into steam! This should be our response, not a rollover! This rethinking mindset is exactly what is needed as we engineer solutions that will be future-proof. What is hard is still possible if we rethink the problem fully. To turn that metaphor around – we will use data as the new fuel for the engine of innovation.
Fun fact
In 2006, mathematician Clive Humby coined the phrase “data is the new oil” {https://packt-debp.link/SiG2rL}.
Data systems must become self-healing of false facts to enable them to be knowledge-complete. After all, what is a true fact? Is it not just a hypothesis backed up by evidence until such time that future observations disprove a prior truth? Likewise, organizing information into knowledge requires not just capturing semantics, context, and time series relevance but also the asserted reason for a fact being represented as information truth within a dataset. This is what knowledge defines: truth! However, it needs correct representation.
Note
The truth of a knowledge base is composed of facts that are proven by assertions that withstand the test of time and do not hide information context that makes up the truth contained within the knowledge base.
But sometimes, when we do not have enough information, we guess. This guessing is based on intuition and prior experience with similar patterns of interconnected information from related domains. We humans can be very wrong with our guesses. But strongly intuited guesses can lead to great leaps in innovation which can later be backfilled with empirically collected data.
Until then, we often stretch the truth to span gaps of knowledge. Information relationship patterns need to be retained as well as the hypothesis recording accurate educated guesses. In this manner, data truths can be guessed. They can also be guessed well! These guesses can even be unwound when proven to be wrong. It is essential that data is organized in a new way to support intelligence. Reasoning is needed to support or refute hypotheses, and the retention of information as knowledge to form truth is essential. If we don’t address organizing big data to form knowledge and truth within a framework consumable by the business, we are just wasting cycles and funding on cloud providers.
This book will focus on best practices; there are a couple of poor practices that need to be highlighted. These form anti-patterns that have crept into the data engineer’s tool bag over time that hinder the mission we seek to be successful in. Let’s look into these anti-patterns next.
What are anti-patterns? These are architectural patterns that form blueprints for ease of implementation. Just like when building a physical building, a civil architect will use blueprints to definitively communicate expectations to the engineers. If a common solution is recurring and successful, it is reused often as a pattern, like the framing of a wall or a truss for a type of roofline. Likewise, an anti-pattern is a pattern to be avoided: like not putting plumbing on an outside wall in a cold climate, because the cold temperature could freeze those pipes.
The first anti-pattern we describe deals with retaining stuff as data that we think is valuable but can no longer even be understood or processed given how it was stored, and it’s contextual meaning gets lost since it was never captured when the data was first retained in storage (such as cloud storage).
The second anti-pattern involves not knowing the business owner’s meaning for column-formatted data, nor how those columns relate to each other to form business meaning because this meaning was only preserved in the software solution, not in the data itself. We rely on entity relationship diagrams (ERDs), that are not worth the paper they were printed on, to gain some degree of clarity that is lost the next time an agile developer does not update them. Knowing what we must avoid in the future as we develop a future-proof, data-engineered solution will help set the foundation for this book.
In order to get a better understanding of the two anti-patterns just introduced, the following specific examples should help illustrate what to avoid.
As an example of what not to do, in the past, I examined a system that retained years of data, only to be reminded that the data was useless after three months. This is because the processing code that created that data had changed hundreds of times in prior years and continued to evolve without being noted in the dataset produced by that processing. The assumptions put into those non-mastered datasets were not preserved in the data framework. Keeping that data around was a red herring, just waiting for some future big data analyst to try and reuse it. When I asked, “Why was it even retained?” I was told it had to be, according to company policy. We are often faced with someone who thinks piles of stuff are valuable, even if they’re not processable. Some data can be the opposite of valuable. It can be a business liability if reused incorrectly. Waterfall gathered business requirements or even loads of agile development stories will not solve this problem without a solid data framework for data semantics as well as data lineage for the data’s journey from information to knowledge. Without this smart data framework, the insights gathered would be wrong!
Likewise, as another not-to-do example, I once built an elaborate, colorful graphical rendering of web consumer usage across several published articles. It was truly a work of art, though I say so myself. The insight clearly illustrated that some users were just not engaging a few key classes of information that were expensive to curate. However, it was a work of pure fiction and had to be scrapped! This was because I misused one key dataset column that was loaded with data that was, in fact, the inverted rank of users access rather than an actual usage value.
During the development of the data processing system, the prior developers produced no metadata catalog, no data architecture documentation, and no self-serve textual definition of the columns. All that information was retained in the mind of one self-serving data analyst. The analyst was holding the business data hostage and pocketing huge compensation for generating insights that only that individual could produce. Any attempt to dethrone this individual was met with one key and powerful consumer of the insight overruling IT management. As a result, the implementation of desperately needed governance mandated enterprise standards for analytics was stopped. Using the data in such an environment was a walk through a technical minefield.
Organizations must avoid this scenario at all costs. It is a data-siloed, poor-practice anti-pattern. It arises due to individuals seeking to preserve a niche position or a siloed business agenda. In the case just illustrated, that anti-pattern was to kill the use of the governance-mandated enterprise standard for analytics. The problem can be protected from abuse by properly implementing governance in the data framework where data becomes self-explanatory.
Let’s consider a real-world scenario that illustrates both of these anti-patterns. A large e-commerce company has many years of customer purchase data that includes a field called customer_value. Originally, this field was calculated using the total amount the customer spent, but its meaning has changed repeatedly over the years without updates to the supporting documentation. After a few years, it was calculated as total_spending – total_returns. Later, it becomes predicted_lifetime_value based on a machine learning (ML) model. When a new data scientist joins the company and uses the field to segment customers for a marketing campaign, the results are disastrous! High value customers from early years are undervalued while new customers are overvalued! This example illustrates how retaining data without proper context (Anti-pattern #1) and lack of clear documentation for data fields (Anti-pattern #2) can lead to significant mistakes.
Our effort in writing this book is to strive to highlight for the data engineer the reality that in our current information technology solutions, we process data as information, when, in fact, we want to use it to inform the business knowledgably.
Today, we glue solutions together with code that manipulates data to mimic information for business consumption. What we really want to do is to retain the business information with the data and make the data smart so that information in context forms knowledge that will form insights for the data consumer. The progression of data begins with just raw data that is transformed into information, and then knowledge, through the preservation of semantics along with context; and finally, the development of analytic derived insights will be elaborated on in future chapters. In Chapter 18, we have included a number of use cases that you will find interesting. From my experience over the years, I’ve learned that making data smarter has always been rewarded.
The resulting insights may be presented to the business in new innovative manners when the business requires those insights from data. The gap we see in the technology landscape is that in order for data to be leveraged as an insight generator, its data journey must be an informed one. Innovation can’t be pre-canned by the software engineer. It is teased out of the minds of business and IT leaders from the knowledge the IT data system presents from different stages of the data journey. This requires data, its semantics, its lineage, its direct or inferred relationships to concepts, its time series, and its context to be retained.
Technology tools and data processing techniques are not yet available to address this need in a single solution, but the need is clearly envisioned. One monolithic data warehouse, data lake, knowledge graph, or in-memory repository can’t solve the total user-originated demand today. Tools need time to evolve. We will need to implement tactically and think strategically regarding what data (also known as truths) we present to the analyst.
Key thought
Implement: Just enough, just in time.
Think strategically: Data should be smart.
Applying innovative modeling approaches can bring systemic and intrinsic risk. Leveraging new technologies will produce key advantages for the business. Minimizing the risk of technical or delivery failure is essential. When thinking of the academic discussions debating data mesh versus data fabric, we see various cloud vendors and tool providers embracing the need for innovation… but also creating a new technical gravity that can suck in the misinformed business IT leader.
Remember, this is an evolutionary event and for some it can become an extinction level event. Microsoft and Amazon can embrace well architected best practices that foster greater cloud spend and greater cloud vendor lock-in. Cloud platform-as-a-service (PaaS) offerings, cloud architecture patterns, and biased vendor training can be terminal events for a system and its builders. The same goes for tool providers such as the creators of relational database management systems (RDBMS), data lakes, operational knowledge graphs, or real-time in-memory storage systems. None of the providers or their niche consulting engagements come with warning signs. As a leader trying to minimize risk and maximize gain, you need to keep an eye on the end goal:
“I want to build a data solution that no one can live without – that lasts forever!”
To accomplish this goal, you will need to be very clear on the mission and retain a clear vision going forward. With a well-developed set of principles, best practices, and clear position regarding key considerations, with an unchallenged governance model … the objective is attainable. Be prepared for battle! The field is always evolving and there will be challenges to the architecture over time, maybe before it is even operational. Our suggestion is to always be ready for these challenges and do not count on political power alone to enforce compliance or governance of the architecture.
You will want to consider these steps when building a modern system:
Collect the objectives and key results (OKRs) from the business and show successes early and often.Always have a demo ready for key stakeholders at a moment’s notice for key stakeholders.Keep those key stakeholders engaged and satisfied as the return on investment (ROI) is demonstrated. Also, consider that they are funding your effort.Keep careful track of the feature to cost ratio and know who is getting value and at what cost as part of the system’s total cost of ownership (TCO).Never break a data service level agreement (SLA) or data contract without giving the stakeholders and users enough time to accommodate impacts. It’s best to not break the agreement at all, since it clearly defines the data consumer’s expectations!Architect data systems that are backwardcompatible and never produce a broken contract once the business has engaged the system to glean insight. Pulling the rug out from under the business will have more impact than not delivering a solution in the first place, since they will have set up their downstream expectations based on your delivery.You can see that there are many patterns to consider and some to avoid when building a modern data solution. Software engineers, data admins, data scientists, and data analysts will come with their perspectives and technical requirements in addition to objectives and key results (OKRs) that the business will demand. Not all technical players will honor the nuances that their peers’ disciplines require. Yet, the data engineer has to deliver the future-proof solution while balancing on top of a pyramid change.
In the next section, we will show you how to keep the technological edge and retain the balance necessary to create a solution that withstands the test of time.
To future-proof a solution means to create a solution that is relevant to the present, scalable, and cost-effective, and will still be relevant in the future. This goal is attainable with a constant focus on building out a reference architecture with best practices and design patterns.
The goal is as follows:
Develop a scalable, affordable IT strategy, architecture, and design that leads to the creation of a future-proof data processing system.
When faced with the preceding goal, you have to consider that change is evolutionary rather than revolutionary. That means that a data architecture is solid and future-proof. Making a system 100% future-proof is an illusion; however, the goal of attaining a near future-proof system must always remain a prime driver of your core principles.
The attraction of shiny lights must never become bait to catch an IT system manager in a web of errors, even though cool technology may attract a lot of venture and seed capital or even create a star on one’s curriculum vitae (CV). It may just as well all fade away after a breakthrough in a niche area is achieved by a disrupter. Just look at what happened when OpenAI, ChatGPT, and related large language model (LLM) technology started to roll out. Conversational artificial intelligence (AI) has changed many systems already.
After innovation rollout, what was once hard is now easy and often available in open source to become commoditized. Even if a business software method or process-oriented intellectual property (IP) is locked away with patent protection… after some time – 10, 15, or 20 years – it is also free for reuse. In the filing disclosure of the IP, valuable insights are also made available to the competition. There can only be so many cutting-edge tech winners, and brilliant minds tend to develop against the same problem at the same time until a breakthrough is attained, often creating similar approaches. It is at this stage that data engineering is nearing an inflection point.
There will always be many more losers than winners. Depending on the size of an organization’s budget and its culture for risk/reward, there can arise a shiny light idea that becomes a blazing star. 90% of those who pursue the shooting star wind up developing a dud that fades away along with an entire IT budget. Our suggestion is to follow the business’s money and develop agilely to minimize the risk of IT-driven failure.
International Data Corporation (IDC) and the business intelligence organization Qlik came up with the following comparison:
“Data is the new water.”
You can say that data is oil or it is water – a great idea is getting twisted and repurposed, even in these statements. It’s essential that data becomes information and that information is rendered in such a way as to create direct, inferred, and derived knowledge. Truth needs to be defined as knowledge in context, including time. We need systems to be not data processing systems but knowledge aware systems that support intelligence, insight, and development of truths that withstand the test of time. In that way, a system may be future-proof. Data is too murky, like dirty water. It’s clouded by the following:
Nonsense structures developed to support current machine insufficiencyErrors due to misunderstanding of the data meaning and lineageDeliberate opacity due to privacy and securityMissing context or state due to missing metadataMissing semantics due to complex relationships not being recorded because of missing data and a lack of funding to properly model the data for the domain in which it was collectedData life cycle processes and costs are often not considered fully. Business use cases drive what is important (note: we will elaborate a lot more on how use cases are represented by conceptual, logical, and physical architectures in Chapters 5-7 of this book). Use cases are often not identified early enough. The data services that were implemented as part of the solution are often left undocumented. They are neither communicated well nor maintained well over the data’s timeframe of relevancy. The result is that the data’s quality melts down like a sugar cube left in the rain. It loses its worth organically as its value degrades in time. Data efficacy loses value over time. This may be accelerated by the business and technical contracts not being maintained, and without that maintenance comes the loss of trust in a dataset’s governance. The resulting friction between business silos becomes palpable. A potential solution has been to create business data services with data contracts. These contracts are defined by well-maintained metadata, and describe the dataset at rest (its semantics) as well as its origin (its lineage) and security methods. They also include software service contracts for the timely maintenance of the subscribed quality metrics.
Businesses need to enable datasets to be priced, enhanced as value-added sets, and even sold to the highest bidder. This is driven over time by the cost of maintaining data systems, which can only increase. The data’s relevance (correctness), submitted for value-added enrichment and re-integration into commoditized data exchanges, is a key objective:
Don’t move data; enrich it in place along with its metadata to preserve semantics and lineage!
The highest bidder builds on the data according to the framework architecture and preserves the semantic domain for which the data system was modeled. Like a ratchet that never loses its grip, datasets need to be correct and hold on to the grip of reality over time. This reality for which the dataset was created can be proposed by the value-added resellers without sacrificing the quality or data service level.
Observe that, over time, the cost of maintaining data correctness, context, and relevance will exceed any single organization’s ability to sustain it for a domain. Naturally, it remains instinctual for the IT leader to hold on to the data and produce a silo. This natural reality to hide imperfections for an established system that is literally melting down must be fixed in the future data architecture’s approach. Allowing the data to evolve/drift, be value-added, and yet remain correct and maintainable is essential. Imperfect alignment of facts, assertions, and other modeled relationships within a domain would be diminished with this approach.
Too often in today’s processing systems, the data is curated to the point where it is considered good enough for now. Yet, it is not good enough for future repurposing. It carries all the assumptions, gaps, fragments, and partial data implementations that made it just good enough. If the data is right and self-explanatory, its data service code is simpler. The system solution is engineered to be elegant. It is built to withstand the pressure of change since the data organization was designed to evolve and remain 100% correct for the business domain.
“There is never enough time or money to get it right… the first time! There is always time to get it right later… again and again!’”
This pragmatic approach can stop the IT leader’s search for a better data engineering framework. Best practices could become a bother since the solution just works, and we don’t want to fix what works. However, you must get real regarding the current tooling choices available. The cost to implement any solution must be a right fit, yet as part of the architecture due diligence process, you still need to push against the edge of technology to seize on innovation opportunities, when they are ripe for the taking.
Consider semantic graph technology in OWL-RDF and its modeling and validation complexities via SPARQL, compared to using the labeled property graphs with custom code for the semantic representation of data in a subject domain’s knowledge base. Both have advantages and disadvantages; however, neither scales without implementing a change-data-capture mechanism syncing an in-memory analytics storage area for real time analytics use case support. Cloud technology has not kept up with making a one-size-fits all, data store, data lake, or data warehouse. It’s better said that one technology solution to fit all use cases and operational service requirements does not exist.
Since one size does not fit all, one data representation does not fit all use cases.
A monolithic data lake, Delta Lake, raw data storage, or data warehouse does not fit the business needs. Logical segmentation and often physical segmentation of data are needed to create the right-sized solution needed to support required use cases. The data engineer has to balance cost, security, performance, scale, and reliability requirements, as well as provider limitations. Just as one shoe size does not fit all… the solution has to be implementable and remain functional over time.
One facet of the data engineering best practices presented in this book is the need for a primary form of data representation for important data patterns. A raw ingest zone is envisioned to hold input Internet of Things (IoT) data, raw retailer point-of-sale data, chemical property reference data, or web analytics usage data. We are proposing that the concept of the zone be a formalization of the layers set forth in the Databricks Medallion Architecture (https://www.databricks.com/glossary/medallion-architecture). It may be worth reading through the structure of that architecture pattern or waiting until you get a chance to read Chapter 6, where a more detailed explanation is provided.
Raw data may need data profiling systems processing applied as part of ingest processing, but that is to make sure that any input data is not rejected due to syntactic or semantic incorrectness. This profiled data may even be normalized in a basic manner prior to the next stage of processing in the data pipeline journey. Its transformation involves processing into the bronze zone and later into the silver zone, then the gold zone, and finally made ready for the consumption zone (for real-time, self-serve analytics use cases).
The bronze, silver, and gold zones host information of varying classes. The gold zone data organization looks a lot like a classic data warehouse, and the bronze zone looks like a data lake, with the silver zone being a cache enabled data lake with a lot of derived, imputed, and inferred data drawn from processing data in the bronze zone. This silver zone data supports online transaction processing (OLTP) use cases but stores processed outputs in the gold zone. The gold zone may also support OLTP use cases directly against information.
The consumption zone is enabled to provide for the measures, calculated metrics, and online analytic processing (OLAP) needs of the user. Keeping it all in sync can become a nightmare of complexity without a clear framework and best practices to keep the system correct. Just think about the loss of linear dataflow control in an AWS or Azure cloud PaaS solution required to implement this zone blueprint. Without a clear architecture, data framework, best practices, and governance… be prepared for many trials and errors.
Data engineering best practices must take into consideration current cloud provider limitations and constraints that drive cost for data movement and third-party tool deployment for analytics when architecting. Consider the ultimate: a zettabyte cube of memory with sub-millisecond access for terabytes of data, where compute code resides with data to support relationships in a massive fabric or mesh. Impossible, today! But wait… maybe tomorrow this will be reality. Meanwhile, how do you build today in order to effortlessly move to that vision in the future? This is the focus of the best practices of this book. All trends point to the eventual creation of big-data, AI enabled data systems.
There are some key trends and concepts forming as part of that vision. Data sharing, confidential computing, and concepts such as bring your algorithm to the data