E-Book
35,99 €

Data Engineering Best Practices E-Book

Richard J. Schiller

0,0

35,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Revolutionize your approach to data processing in the fast-paced business landscape with this essential guide to data engineering. Discover the power of scalable, efficient, and secure data solutions through expert guidance on data engineering principles and techniques. Written by two industry experts with over 60 years of combined experience, it offers deep insights into best practices, architecture, agile processes, and cloud-based pipelines.
You’ll start by defining the challenges data engineers face and understand how this agile and future-proof comprehensive data solution architecture addresses them. As you explore the extensive toolkit, mastering the capabilities of various instruments, you’ll gain the knowledge needed for independent research. Covering everything you need, right from data engineering fundamentals, the guide uses real-world examples to illustrate potential solutions. It elevates your skills to architect scalable data systems, implement agile development processes, and design cloud-based data pipelines. The book further equips you with the knowledge to harness serverless computing and microservices to build resilient data applications.
By the end, you'll be armed with the expertise to design and deliver high-performance data engineering solutions that are not only robust, efficient, and secure but also future-ready.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 1178

Veröffentlichungsjahr: 2024

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Power: Die 48 Gesetze der Macht

Robert Greene

The Truth About Employee Engagement

Patrick M. Lencioni

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

Leseprobe

Data Engineering Best Practices

Architect robust and cost-effective data solutions in the cloud era

Richard J. Schiller

David Larochelle

Data Engineering Best Practices

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Manager: Nilesh Kowadkar

Book Project Manager: Hemangi Lotlikar

Senior Editor: David Sugarman

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Proofreader: David Sugarman

Indexer: Manju Arasan and Tejal Soni

Production Designer: Alishon Mendonca

DevRel Marketing Coordinator: Nivedita Singh

First published: September 2024

Production reference: 1060924

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-80324-498-3

www.packtpub.com

Contributors

About the authors

Richard J. Schiller is a chief architect, distinguished engineer, and startup entrepreneur with 40 years of experience delivering real-time large-scale data processing systems. He holds an MS in computer engineering from Columbia University’s School of Engineering and Applied Science and a BA in computer science and applied mathematics. He has been involved with two prior successful startups and has coauthored three patents. He is a hands-on systems developer and innovator.

David Larochelle has been involved in data engineering for startups, Fortune 500 companies, and research institutes. He holds a BS in computer science from the College of William & Mary, a Masters in computer science from the University of Virginia, and a Master’s in communication from the University of Pennsylvania. David’s career spans over 20 years, and his strong background has enabled him to work in a wide range of organizations, including startups, established companies, and research labs.

About the reviewers

Kamal Baig has over 19 years of experience within the IT space. He has a solid background in data and application development integration and seamlessly transitioned into the Azure solutions architect role. Throughout his career, Kamal has consistently demonstrated a deep understanding of data architecture principles and best practices, leveraging Azure technologies to design and implement cutting-edge solutions that meet the complex needs of modern enterprises. His expertise spans data analytics modernization, data warehouses, data mesh, and data products. Coming from CPG, hospitality, and education domains, he has designed scalable data solutions to ensure security, compliance, and regulatory requirements to align with organizational goals.

John Bremer has 20 years of experience in the market research and data science space. A pioneer creating impactful innovation and value for clients and stakeholders, John has successfully designed and executed research and data strategies and projects for various industries and sectors, leveraging his expertise in data analysis, data mining, and data science. As the President of Phantom 4 Solutions, he provides on-demand support and consulting for organizations in many roles, including Chief Research Officer, Chief Data Science Officer, or Chief Data Analytics Officer. John has a proven track record of managing and transforming high-performance quant teams, and is a respected and valued consultant and decision-maker on data-related matters.

Lindsey Nix is an experienced product manager with a demonstrated history of working in the aerospace, finance, and semiconductors industries. Lindsey is skilled in management, system requirements, software documentation, technical writing, business development, strategic planning, and information assurance. She is a strong consulting professional with a Master’s degree in business administration, systems engineering, and data analytics from San Jose State University.

Shanthababu Pandian has over 23 years of IT experience, specializing in data architecting, engineering, analytics, DQ&G, data science, ML, and Gen AI. He holds a BA in electronics and communication engineering, three Master’s degrees (M.Tech, MBA, M.S.) from a prestigious Indian university, and has completed postgraduate programs in AIML from the University of Texas and data science from IIT Guwahati. He is a director of data and AI in London, UK, leading data-driven transformation programs focusing on team building and nurturing AIML and Gen AI. He helps global clients achieve business value through scalable data engineering and AI technologies. He is also a national and international speaker, author, technical reviewer, and blogger.

Marianna Petrovich brings over 30 years of experience to the table. Her passion for software engineering, cloud and data intricacies, quality, and governance is evident in her work. Marianna’s expertise in data engineering has made her a sought-after consultant and advisor. Trusted for her knowledge of modern data platforms and cloud tools, she guides clients with her exceptional skills in both data and engineering. Currently, she heads the enterprise data engineering team at Circana. Holding a Master’s degree in big data from ASU, Marianna resides in Northern California with her husband and eight children. Her aspiration is to inspire the next generation by teaching data engineering to children.

Bill Sun is a senior IT enterprise and solutions architect with expertise in cloud computing, big data, AI/ML, and DevOps. Known for his strong communication skills and leadership, Bill has driven significant projects at Fortune 500 companies. His accomplishments include cloud migrations, data pipeline optimizations, and the development of unified platform services. Bill holds a Master’s in computer science from Johns Hopkins, BA degrees from Tsinghua University, and multiple certifications, including Azure and AWS.

Preface

1 Overview of the Business Problem Statement

What is the business problem statement?

Anti-patterns to avoid

Patterns in the future-proof architecture

Future-proofing is …

Organization into zone considerations

Cloud limitations

The Intelligence Age

Use case definitions

The mission, the vision, and the strategy

Principles and the development life cycle

The architecture definition, best practices, and key considerations

The DataOps convergence

Summary

2 A Data Engineer’s Journey – Background Challenges

Challenge #1 – platform architectures change rapidly

Platform architectures in the 21st century

Impacts on business strategy

A flexible software development life cycle to manage platform risk

Challenge #2 – Total cost of ownership (TCO) is high

ETL architecture costs are high!

Buy versus build choices impact a solution’s longevity

Challenge #3 – Evolving data repository patterns – identifying big rocks for data engineers

Intake, integration, and storage challenges in data engineering

Identifying the big rocks to be placed first into your design

Being able to handle technology hype

Summary

3 A Data Engineer’s Journey – IT’s Vision and Mission

The vision

Develop the IT engineering vision

Vision summary

The mission and the IT strategy

IT’s vision

IT’s mission

IT mission summary

Principles, frameworks, and best practices

The architecture reflects the vision

Principles summary

Data engineering patterns for IT operability

What patterns are required and how are they specified?

Pattern summary

Summary

4 Architecture Principles

Architecture principles overview

Architecture foundation

Data lake, mesh, and fabric

Data immutability

Third party tool, cloud platform-as-a-service (PaaS), and framework integrations

Data mesh principles

Data mesh metadata

Data semantics in the data mesh

Data mesh, security, and tech stack considerations

What are the key foundational takeaways?

Architecture principles in depth

Principle #1 – Data lake as a centerpiece? No, implement the data journey!

Principle #2 – A data lake’s immutable data is to remain explorable

Principle #3 – A data lake’s immutable data remains available for analytics

Principle #4 – A data lake’s sources are discoverable

Principle #5 – A data lake’s tooling should be consistent with the architecture

Principle #6 – A data mesh defines data to be governed by domain-driven ownership

Principle #7 – A data mesh defines the data and derives insights as a product

Principle #8 – A data mesh defines data, information, and insights to be self-service

Principle #9 – A data mesh implements a federated governance processing system

Principle #10 – Metadata is associated with datasets and is relevant to the business

Principle #11 – Dataset lineage and at-rest metadata is subject to life cycle governance

Principle #12 – Datasets and metadata require cataloging and discovery services

Principle #13 – Semantic metadata guarantees correct business understanding at all stages in the data journey

Principle #14 – Data big rock architecture choices (time series, correction processing, security, privacy, and so on) are to be handled in the design early

Principle #15 – Implement foundational capabilities in the architecture framework first

Summary

5 Architecture Framework – Conceptual Architecture Best Practices

Conceptual architecture overview

Best practice organization

How does the conceptual architecture align with the logical architecture and physical architecture?

Conceptual architecture best practices

Conceptual architecture description

Conceptual architecture glossary

What are the data architecture’s key issues identified in the conceptual architecture?

Best practice composition of the conceptual architecture

Conceptual to logical architecture mapping

Summary

6 Architecture Framework – Logical Architecture Best Practices

Logical architecture overview

Organizing best practices

How does the logical architecture align with the conceptual and physical architecture?

Detailed capabilities of the ingestion zones

ETL data pipelines

Bronze standard datasets

Detailed capabilities of the transformation zones

Data quality features

Data lake house and warehouse

Gold and silver standard datasets

Detailed capabilities of the consumption zones

Data analytics

Accessing silver standard datasets from the consumption zone

Trade-offs between public cloud, on-premises, and multi-cloud

Cost of ingest or egress for cloud data

Cost of a dedicated network line to the point of service

Cost of provisioning

Cost of monitoring and observability

Hybrid or multi-cloud choices!

The benefits of a multi-cloud strategy

Summary

7 Architecture Framework – Physical Architecture Best Practices

Physical architecture overview

Best practice organization

How does the physical architecture align with the logical and conceptual architecture?

How should the physical architecture align with the operational processes/capabilities of the solution?

Examples of physical reference architectures

Summary

8 Software Engineering Best Practice Considerations

SBP 1 – follow the architecture!

The core value of architectural integrity

The downstream impact of deviating

Ensuring adherence in your data engineering team

Continuous evolution and architecture

Conclusion

SBP 2 – implement Agile methodology for your organization!

Introduction to Agile methodology

Agile principles and their significance in data engineering

Benefits of implementing Agile in data engineering

Challenges and considerations in Agile data engineering

Steps to implement Agile in data engineering

Tools and Agile practices tailored for data engineering

Conclusion

SBP 3 – generate objectives and key results (OKRs)!

Introduction and deep dive into OKRs

Crafting data-centric OKRs

Potential challenges with OKRs in data engineering

Reviewing and iterating on OKRs in a data context

SBP 4 – implement data as a product!

SBP 5 – implement shift left testing (SLT) processes!

Understanding SLT

Benefits of SLT in data engineering

Implementing shift left testing

Specific shift left testing strategies for data engineering

Challenges in shift left testing for data engineering

Tools and technologies to facilitate shift left in data engineering

Synergy with other data best practices

SBP 6 – implement the difficult first!

The philosophy of tackling the hard tasks first

How data engineers can prioritize difficult tasks

Implementing difficult data tasks

Synergy with other data best practices

Conclusion

SBP 7 – avoid premature optimization

The true cost of premature optimization

Recognizing and avoiding the trap in data engineering

Balancing performance needs and over-optimization in data engineering

Synergy with other data best practices

SBP 8 – automate cloud code snippet deployments with standard deployment scripted wrappers

The importance of deployment automation

The deployment model choices

Benefits of using scripted deployment wrappers

Version control – ensuring consistency and traceability

Relevance to data engineering in cloud environments

Practical implementation steps

Challenges and precautions

Synergy with other software and data best practices

SBP 9 – define and implement NFRs first

Distinguishing functional (FRs) from non-functional requirements (NFRs)

Relevance to data engineering

Key NFRs in cloud data engineering

Defining and implementing NFRs

Risks of neglecting early implementation of NFRs

SBP 10 – implement data journey journaling to facilitate future problem resolution

Relevance to data engineering

Challenges and considerations

SBP 11 – implement data journey pipelines that are experimental first!

Enabling data pipeline experimentation as datasets are readied

Releasing data like code

Challenges and considerations

SBP 12 – choose languages with solid reasoning

Key languages in data engineering and their roles

The pressures and limitations imposed by PaaS offerings

Pitfalls to avoid

SBP 13 – drive scripting and PaaS code with parameterization using a secure configuration management repository tool

The power of parameterization and configuration management

The growth of configuration complexity

Why parameterize?

Configuration management repositories and configuration management databases (CMDBs)

Best practices for secure configuration management

SBP 14 – be prepared to prune dead code over time

The accumulation of dead code in software and PaaS systems

The unique challenge of PaaS service configurations

Pruning dead code

SBP 15 – if it doesn’t fit, don’t force it; use a microservice

PaaS and its boundaries

Microservices as a contingency strategy

Challenges and considerations of this dual approach

Pitfalls to avoid

Summary

9 Key Considerations for Agile SDLC Best Practices

Prevent Agile from being fragile

Agile methodology

Core principles of Agile from the Agile Manifesto

Agile and data engineering

Why Agile?

The impact of a unified Agile approach on team performance and system quality

Software Development Lifecycle (SDLC) processes

Objectives and Key Results (OKRs)

Agile methodology (tuned to the organization)

Business development strategy

Test/quality strategy

Operational strategy

Data security strategy

Summary

10 Key Considerations for Quality Testing Best Practices

Quality testing overview

How not to test!

Evolving the test discipline

Test terminology

Key test definitions

Test driven development (TDD)

Acceptance test driven development (ATDD) versus developer test driven development (DTDD)

Behavioral driven development (BDD)

Shift left testing

Test framework example

Data wrangling and profiling

Deterministic data profiling

Machine learning driven data QA

Summary

11 Key Considerations for IT Operational Service Best Practices

IT operational best practices overview

IT operational best practices – introduction

Service Level Agreements (SLAs)

Data contract service level agreements/data contract management

Continuous integration/continuous deployment (CI/CD)

Observability with proactive alerting

Automated data and system anomaly detection and remediation

Data system anomaly detection

Application Performance Monitoring

Blue/green versus continuous deployment trade-offs

Key takeaways

Summary

12 Key Considerations for Data Service Best Practices

Data service best practices overview

Software and data engineering drivers for best practices

Zero trust versus defense in depth

National data localization

Privacy protection

Personally Identifiable Information (PII)

Data service engineering best practices

Data engineering best practice 1 – implement a data mesh, not just a data fabric

Data engineering best practice 2 – implement data pipelines (for analytics)

Data engineering best practice 3 – implement data pipelines (for machine learning)

Data engineering best practice 4 – use equivalent production and staging environments

Data engineering best practice 5 – a pipeline’s concurrent threads should run and scale in a distributed manner

Data engineering best practice 6 – data pipelines should run as streams making use of PaaS services where possible

Data engineering best practice 7 – create DataOps standards for data pipeline development

Data engineering best practice 8 – implement tool selection criteria with weighted selection toward built-in integration with core components of the architecture

Data engineering best practice 9 – implement Pub/Sub models for economy of scale when supporting customers with the same dataset subscription

Data engineering best practice 10 – data wrangling tool selection to create a clean gold zone copy of datasets

Data engineering best practice 11 – a data catalog is part of an essential metadata implementation

Data engineering best practice 12 – define data owners, security, rights, and access for consumers upfront

Data engineering best practice 13 – train your experts and let them train and retrain others

Data engineering best practice 14 – handle errors gracefully

Data engineering best practice 15 – run multiple data pipeline tasks as a directed acyclic graph (DAG)

Summary

13 Key Considerations for Management Best Practices

Niche focus areas – best practices overview

Data profiling

Heuristic data analysis

Gaps

Calendar trend deviations

Metadata strategy

Azure cloud provider perspective on metadata

A metadata tool provider’s perspective on metadata

A knowledge engineering provider’s perspective on metadata

Metadata perspectives summary

Summary

14 Key Considerations for Data Delivery Best Practices

Data delivery best practices overview

Streaming data delivery

Data delivery with publishing/subscribing (pub/sub) methods

Data delivery streaming with examples

Consumable data delivery as a repository

Custom-built analytics sandbox with Confidential Compute

Using a third-party aggregator for analytics

Data delivery into cloud service provided areas

Using cloud provider’s offerings

Bulk data delivery

Using various cloud provider offerings

Summary

15 Other Considerations – Measures, Calculations, Restatements, and Data Science Best Practices

Overview of other data engineering problems for consideration

Data engineer statement of value

Modern data science/analysis workbench

What are the capabilities of a modern data science/analyst’s workbench?

Using notebooks in production

Calculations and measures

Difficult analytics features

Key capabilities in the analytics workbench

Historical data correction processing

Notebooks

Notebook technologies

Summary

16 Machine Learning Pipeline Best Practices and Processes

Machine learning (ML)/artificial intelligence (AI) overview

The current and future state of AI

Machine learning pipeline

Model governance and compliance

Why are models rarely deployed?

What is the MLOps model life cycle?

Technology necessities

Data annotation

Sampling the data lake

Managing data annotation

Real data versus created (synthetic) data

Model training

Model testing and evaluation

Smoke tests

ML asset deployment

A/B testing

Handling regressions

MLOps frameworks

Summary

17 Takeaway Summary – Putting It All Together

People, technology, and processes

Working with other people

Develop your solutions’ technology with clear processes

Focus on the process of information value creation

Other important challenges

Business realities

Security and privacy

Knowledge engineering

Artificial intelligence

Summary

18 Appendix and Use Cases

Use cases overview

Technology background deep-dive

Prompt engineering frameworks with examples

Learn about new knowledge base generation tools with examples

High level use cases

Health Sciences Knowledge Graph (KG)

Life Sciences Semantic Information Engine (ELSSIE)

Microblog message analysis (Dataminr)

Summary

Index

Other Books You May Enjoy

Preface

Are you an IT professional, IT manager, or business leader looking for an effective large-scale data engineering solution platform? Have you experienced the pain of slogging through piles of literature? Have you had to implement a series of painful proofs of concept? If so, this book is for you.

You will emerge on the other side able to implement correctly architected, data-engineered solutions that address real problems you will face in the development process.

Data engineering is rapidly evolving, and the modern data engineer needs to be equipped with software engineering practices to succeed in today’s fast-paced data-driven world. This hands-on book takes a practical approach to applying software and data engineering practices to modern use cases, including the following:

Migrating to cloud-based storage and processingApplying Agile methodologiesPrioritizing governance, privacy, and security

This book is ideal for data engineers and analytics teams looking to enhance their skills and gain a competitive edge in the industry. While reading the book, you will be prompted with ideas, questions, and plans for implementation that would not have been considered prior to reading.

This book assumes that you have a foundational knowledge of at least one cloud vendor service, in particular, Amazon Web Services (AWS) or Microsoft’s Azure. Additionally, you should be well versed in a scripting language (such as Python) and a primary language (such as Java or C/C++), have encountered concurrent/distributed big data processing, and ideally have some experience with analytic services such as Azure Analysis Services (AAS), Microsoft Power BI, or other third-party analytic solutions. This book is largely aimed at developers and architects who understand Python and cloud computing but want a complete framework for future-proofing successful solutions.

The book is not proscriptive regarding IT solutions, but it does raise key considerations for evaluation as the technology field evolves. After reading this book, IT architects will be equipped to dialogue with cloud vendors and third-party vendors following best practices, so that any solution developed remains robust, of high quality, and cost-effective over time.

This book’s structure is as follows:

Mission/visionPrinciplesArchitectureBest practicesDesign patternsUse cases

Where pertinent, vendor selection criteria are presented wherein business value statements affect weighting, so that decisions are correctly made to implement an organization’s goals. Real-life examples and lessons sum up key points. The book is structured to enable you to envision a reference architecture for your organization and then see the implementation of the business solution in the context of the reference architecture. As the content of the chapters is absorbed, it is a best practice to organize the solution forming in your mind. This is our first key consideration:

“Envision what it means to my company’s goals.”

Organize your notes and takeaways from the perspective of “What does it mean for my goals?” while building up a reference architecture and solution strawman.

By the end of this book, you will be able to architect, design, and implement end-to-end cloud-based data processing pipelines. You will also be able to provide customers with access to data as a product supporting various machine learning, analytic, and big data use cases… all within a well-architected data framework. You will know how to build or buy logical components aligned to the architected data framework’s principles and best practices using Agile software development processes tuned to work for an organization. Although this book will not supply all the answers, it will shine a light on the path to success while avoiding the pitfalls encountered by many, including the author’s own experiences. It will save you countless hours of frustration and enable more rapid creation of better-architected systems.

Who this book is for

If you are an IT professional, IT manager, or business leader looking to build a large-scale data engineering solution, then this book will provide you with a solid set of best practices. As a data engineer, it will give you details behind the best-practice recommendations so you assess the right approaches for your effort. All this should take many hours of pain out of your engineering efforts. If you have to implement a series of proofs of concepts, then this book points to the technologies and vendors that you should avoid so that the proof of concept does not become a proof of failure (POF). If all this is of interest to you, then this book is for you.

What this book covers

Chapter 1, Overview of the Business Problem Statement, provides a definition of the business problem faced by the data engineer. It also provides an introduction to the entire book.Chapter 2, A Data Engineer’s Journey – Background Challenges, elaborates on the challenges faced when building a modern data system.Chapter 3, A Data Engineer’s Journey – IT’s Vision and Mission, illustrates various mission and vision statements and urges you to develop one if one does not already exist. This way, you can keep your focus on the end and not deviate from your strategy.Chapter 4, Architecture Principles, elaborates on the need to develop principles that keep you solidly grounded in reality. Many examples are provided and explained because they drive the best practices.Chapter 5, Architecture Framework – Conceptual Architecture Best Practices, depicts architecture as the framework for design engineering. Too often projects go off the rails because the architecture shifts and the structure of the engineering design falls apart. Architecture is a communication tool to keep consensus, especially when things go wrong – and they always do in any engineering effort.Chapter 6, Architecture Framework – Logical Architecture Best Practices, describes the need to formally define and document the architecture for all, thus tying the conceptual level to the physical level of the architecture.Chapter 7, Architecture Framework – Physical Architecture Best Practices, defines what will be built and eventually what was built and where it all operates.Chapter 8, Software Engineering Best Practice Considerations, elaborates on the software best practices needed for the data engineering effort to succeed.Chapter 9, Key Considerations for Agile SDLC Best Practices, discusses the project management and development processes needed to deliver a data solution.Chapter 10, Key Considerations for Quality Testing Best Practices, provides testing best practices for a data factory.Chapter 11, Key Considerations for IT Operational Service Best Practices, defines operational requirements for a data solution.Chapter 12, Key Considerations for Data Service Best Practices, elaborates on data services, where the focus is on refining raw data into a gem, like a diamond, with facets. It takes the focus away from servicing data as a blob. Examples are provided to illustrate this important message.Chapter 13, Key Considerations for Management Best Practices, gets into the details of data factory curation and processing with a focus on difficult problems to solve.Chapter 14, Key Considerations for Data Delivery Best Practices, continues Chapter 13’s theme but addresses difficult problem areas for a business and the impediments that can be overcome with the best practices presented.Chapter 15, Other Considerations – Measures, Calculations, Restatements and Data Science Best Practices, defines the analysis workbench and various tools and processes for the data consumer. This is what is necessary to deliver data at the end of the data factory.Chapter 16, Machine Learning Pipeline Best Practices and Processes, dives deeper into machine learning/deep learning, Generative AI (GenAI), and ways to apply knowledge engineering to cooperatively address the future vision where AI takes center stage.Chapter 17, Takeaway Summary – Putting It All Together, presents the book’s conclusion and parting wishes for the development of your future-proof data engineering designs.Chapter 18, Appendix and Use Cases, delivers on the promise to elaborate on a few high-level use cases with a primer on the technologies used in those use cases.

To get the most out of this book

This book has been written at an intermediate level for data engineers, architects, and managers. There are no tools that you need on your desktop; however, if you want to become hands-on with the tools and technologies referenced, there will be short links (to the {https://packt-debp.link} domain) that are similar to traditional endnotes in each chapter. The journey toward best practices begins with the business context, the mission, vision, and principles that set the foundation for success, and then the development of an architecture. This is followed by engineering designs across a number of important areas driven by people, process, and technology needs.

As the book progresses, the technical topics get deeper, ending with machine learning, and GenAI with a practical look at how to tune LLMs with RAG and prompt engineering, and a good exploration of knowledge engineering.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Data Engineering Best Practices, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-80324-498-3

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

1 Overview of the Business Problem Statement

We begin with the task of defining the business problem statement.

“Businesses are faced with an ever-changing technological landscape. Competition requires one to innovate at scale to remain relevant; this causes a constant implementation stream of total cost of ownership (TCO) budget allocations for refactoring and re-envisioning during what would normally be a run/manage phase of a system’s lifespan.”

This rapid rate of change means the goalposts are constantly moving. “Are we there yet?” is a question I heard from my kids constantly when traveling. It came from not knowing where we were or having any idea of the effort to get to where we were going, with a driver (me) who had never driven to that destination before. Thank goodness for Garmin (automobile navigation systems) and Google Maps, and not the outdated paper maps that were used in the past. See how technology even impacted that metaphor? Garmin is being displaced by Google for mapping use cases. This is not always because it is better but because it is free (if you wish to be subjected to data collection and advertising interruptions) and it is hosted on everyone’s smart device.

Now, I can tell my grandkids that in exactly 1 hour and 29 minutes, they will walk into their home after spending the weekend with their grandparents. The blank stare I get in response tells it all. Mapped data, rendered with real-time technology, has changed us completely.

Technological change can appear revolutionary when it’s occurring, but when looking back over time, the progression of change appears to be a no-brainer series of events that we take for granted, and even evolutionary. That is what is happening today with data, information, knowledge, and analytical data stores in the cloud. The term DataOps was popularized by Andy Palmer, co-founder and CEO of Tamr {https://packt-debp.link/MGj4EU}. The data management and analytics world has referenced the term often. In 2015, Palmer stated that DataOps is not just a buzzword, but a critical approach to managing data in today’s complex, data-driven world.

I believe that it’s time for data engineers and data scientists to embrace a similar (to DevOps) new discipline – let’s call it DataOps – that at its core addresses the needs of data professionals on the modern internet and inside the modern enterprise. (Andy Palmer {https://packt-debp.link/ihlztK})

In Figure 1.1, observe how data quality, integration, engineering, and security are tied together with a solid DataOps practice:

Figure 1.1 – DataOps in the enterprise

The goal of this chapter is to set up the foundation for understanding why the best practices presented are structured as they are in this book. This foundation will provide a firm footing to make the framework you adopt in your everyday engineering tasks more secure and well-grounded. There are many ways to look at solutions to data engineering challenges, and each vendor, engineering school, and cloud provider will have its own spin on the formula for success. That success will ultimately depend on what you can get working today and keep working in the future. A unique balance of various forces will need to be obtained. However, this balance may be easily upset if the foundation is not correct. As a reader, you will have naturally formed biases toward certain engineering challenges. These can force you into niche (or single-minded) focus directions – for example, a fixation on robust/highly available multi-region operations with a de-emphasized pipeline software development effort. As a result, you may overbuild robustness and underdevelop key features. Likewise, you can focus on hyper-agile streaming of development changes into production at the cost of consumer data quality. More generally, there is a significant risk from just doing IT and losing focus on why we need to carefully structure the processing of data in a modern information processing system. You must not neglect the need to capture data with its semantic context, thus making it true and relevant, instead of the software system becoming the sole interpretation of the data. This freedom makes data and context equal to information that is fit for purpose, now and in the future.

We can begin with the business problem statement.

What is the business problem statement?

Data engineering approaches are rapidly morphing today. They will coalesce into a systemic, consistent whole. At the core of this transformation is the realization that data is information that needs to represent facts and truths along with the rationalization that created those facts and truths over time. There must not be any false facts in future information systems. That term may strike you as odd. Can a fact be false? This question may be a bit provocative. But haven’t we often built IT systems to determine just that?

We process data in software systems that preserve business context and meaning but force the data to be served only through those systems. It does not stand alone and if consumed out of context, it would lead to these false facts propagating into the business environment. Data can’t stand alone today; it must be transformed by information processing systems, which have technical limitations. Pragmatic programmers’ {https://packt-debp.link/zS3jWY} imperfect tools and technology will produce imperfect solutions. Nevertheless, the engineer is still tasked with removing as many as possible, if not all, false facts when producing a solution. That has been elusive in the past.

We often take shortcuts. We also justify these shortcuts with statements like: “there simply is not enough time!” or “there’s no way we can get all that data!” The business “can’t afford to curate it correctly,” or lastly “there’s no funding for boiling the ocean.” We do not need to boil the ocean.

What we are going to think about is how we are going to turn that ocean directly into steam! This should be our response, not a rollover! This rethinking mindset is exactly what is needed as we engineer solutions that will be future-proof. What is hard is still possible if we rethink the problem fully. To turn that metaphor around – we will use data as the new fuel for the engine of innovation.

Fun fact

In 2006, mathematician Clive Humby coined the phrase “data is the new oil” {https://packt-debp.link/SiG2rL}.

Data systems must become self-healing of false facts to enable them to be knowledge-complete. After all, what is a true fact? Is it not just a hypothesis backed up by evidence until such time that future observations disprove a prior truth? Likewise, organizing information into knowledge requires not just capturing semantics, context, and time series relevance but also the asserted reason for a fact being represented as information truth within a dataset. This is what knowledge defines: truth! However, it needs correct representation.

Note

The truth of a knowledge base is composed of facts that are proven by assertions that withstand the test of time and do not hide information context that makes up the truth contained within the knowledge base.

But sometimes, when we do not have enough information, we guess. This guessing is based on intuition and prior experience with similar patterns of interconnected information from related domains. We humans can be very wrong with our guesses. But strongly intuited guesses can lead to great leaps in innovation which can later be backfilled with empirically collected data.

Until then, we often stretch the truth to span gaps of knowledge. Information relationship patterns need to be retained as well as the hypothesis recording accurate educated guesses. In this manner, data truths can be guessed. They can also be guessed well! These guesses can even be unwound when proven to be wrong. It is essential that data is organized in a new way to support intelligence. Reasoning is needed to support or refute hypotheses, and the retention of information as knowledge to form truth is essential. If we don’t address organizing big data to form knowledge and truth within a framework consumable by the business, we are just wasting cycles and funding on cloud providers.

This book will focus on best practices; there are a couple of poor practices that need to be highlighted. These form anti-patterns that have crept into the data engineer’s tool bag over time that hinder the mission we seek to be successful in. Let’s look into these anti-patterns next.

Anti-patterns to avoid

What are anti-patterns? These are architectural patterns that form blueprints for ease of implementation. Just like when building a physical building, a civil architect will use blueprints to definitively communicate expectations to the engineers. If a common solution is recurring and successful, it is reused often as a pattern, like the framing of a wall or a truss for a type of roofline. Likewise, an anti-pattern is a pattern to be avoided: like not putting plumbing on an outside wall in a cold climate, because the cold temperature could freeze those pipes.

The first anti-pattern we describe deals with retaining stuff as data that we think is valuable but can no longer even be understood or processed given how it was stored, and it’s contextual meaning gets lost since it was never captured when the data was first retained in storage (such as cloud storage).

The second anti-pattern involves not knowing the business owner’s meaning for column-formatted data, nor how those columns relate to each other to form business meaning because this meaning was only preserved in the software solution, not in the data itself. We rely on entity relationship diagrams (ERDs), that are not worth the paper they were printed on, to gain some degree of clarity that is lost the next time an agile developer does not update them. Knowing what we must avoid in the future as we develop a future-proof, data-engineered solution will help set the foundation for this book.

In order to get a better understanding of the two anti-patterns just introduced, the following specific examples should help illustrate what to avoid.

Anti-pattern #1 – Must we retain garbage?

As an example of what not to do, in the past, I examined a system that retained years of data, only to be reminded that the data was useless after three months. This is because the processing code that created that data had changed hundreds of times in prior years and continued to evolve without being noted in the dataset produced by that processing. The assumptions put into those non-mastered datasets were not preserved in the data framework. Keeping that data around was a red herring, just waiting for some future big data analyst to try and reuse it. When I asked, “Why was it even retained?” I was told it had to be, according to company policy. We are often faced with someone who thinks piles of stuff are valuable, even if they’re not processable. Some data can be the opposite of valuable. It can be a business liability if reused incorrectly. Waterfall gathered business requirements or even loads of agile development stories will not solve this problem without a solid data framework for data semantics as well as data lineage for the data’s journey from information to knowledge. Without this smart data framework, the insights gathered would be wrong!

Anti-pattern #2 – What does that column mean?

Likewise, as another not-to-do example, I once built an elaborate, colorful graphical rendering of web consumer usage across several published articles. It was truly a work of art, though I say so myself. The insight clearly illustrated that some users were just not engaging a few key classes of information that were expensive to curate. However, it was a work of pure fiction and had to be scrapped! This was because I misused one key dataset column that was loaded with data that was, in fact, the inverted rank of users access rather than an actual usage value.

During the development of the data processing system, the prior developers produced no metadata catalog, no data architecture documentation, and no self-serve textual definition of the columns. All that information was retained in the mind of one self-serving data analyst. The analyst was holding the business data hostage and pocketing huge compensation for generating insights that only that individual could produce. Any attempt to dethrone this individual was met with one key and powerful consumer of the insight overruling IT management. As a result, the implementation of desperately needed governance mandated enterprise standards for analytics was stopped. Using the data in such an environment was a walk through a technical minefield.

Organizations must avoid this scenario at all costs. It is a data-siloed, poor-practice anti-pattern. It arises due to individuals seeking to preserve a niche position or a siloed business agenda. In the case just illustrated, that anti-pattern was to kill the use of the governance-mandated enterprise standard for analytics. The problem can be protected from abuse by properly implementing governance in the data framework where data becomes self-explanatory.

Let’s consider a real-world scenario that illustrates both of these anti-patterns. A large e-commerce company has many years of customer purchase data that includes a field called customer_value. Originally, this field was calculated using the total amount the customer spent, but its meaning has changed repeatedly over the years without updates to the supporting documentation. After a few years, it was calculated as total_spending – total_returns. Later, it becomes predicted_lifetime_value based on a machine learning (ML) model. When a new data scientist joins the company and uses the field to segment customers for a marketing campaign, the results are disastrous! High value customers from early years are undervalued while new customers are overvalued! This example illustrates how retaining data without proper context (Anti-pattern #1) and lack of clear documentation for data fields (Anti-pattern #2) can lead to significant mistakes.

Patterns in the future-proof architecture

Our effort in writing this book is to strive to highlight for the data engineer the reality that in our current information technology solutions, we process data as information, when, in fact, we want to use it to inform the business knowledgably.

Today, we glue solutions together with code that manipulates data to mimic information for business consumption. What we really want to do is to retain the business information with the data and make the data smart so that information in context forms knowledge that will form insights for the data consumer. The progression of data begins with just raw data that is transformed into information, and then knowledge, through the preservation of semantics along with context; and finally, the development of analytic derived insights will be elaborated on in future chapters. In Chapter 18, we have included a number of use cases that you will find interesting. From my experience over the years, I’ve learned that making data smarter has always been rewarded.

The resulting insights may be presented to the business in new innovative manners when the business requires those insights from data. The gap we see in the technology landscape is that in order for data to be leveraged as an insight generator, its data journey must be an informed one. Innovation can’t be pre-canned by the software engineer. It is teased out of the minds of business and IT leaders from the knowledge the IT data system presents from different stages of the data journey. This requires data, its semantics, its lineage, its direct or inferred relationships to concepts, its time series, and its context to be retained.

Technology tools and data processing techniques are not yet available to address this need in a single solution, but the need is clearly envisioned. One monolithic data warehouse, data lake, knowledge graph, or in-memory repository can’t solve the total user-originated demand today. Tools need time to evolve. We will need to implement tactically and think strategically regarding what data (also known as truths) we present to the analyst.

Key thought

Implement: Just enough, just in time.

Think strategically: Data should be smart.

Applying innovative modeling approaches can bring systemic and intrinsic risk. Leveraging new technologies will produce key advantages for the business. Minimizing the risk of technical or delivery failure is essential. When thinking of the academic discussions debating data mesh versus data fabric, we see various cloud vendors and tool providers embracing the need for innovation… but also creating a new technical gravity that can suck in the misinformed business IT leader.

Remember, this is an evolutionary event and for some it can become an extinction level event. Microsoft and Amazon can embrace well architected best practices that foster greater cloud spend and greater cloud vendor lock-in. Cloud platform-as-a-service (PaaS) offerings, cloud architecture patterns, and biased vendor training can be terminal events for a system and its builders. The same goes for tool providers such as the creators of relational database management systems (RDBMS), data lakes, operational knowledge graphs, or real-time in-memory storage systems. None of the providers or their niche consulting engagements come with warning signs. As a leader trying to minimize risk and maximize gain, you need to keep an eye on the end goal:

“I want to build a data solution that no one can live without – that lasts forever!”

To accomplish this goal, you will need to be very clear on the mission and retain a clear vision going forward. With a well-developed set of principles, best practices, and clear position regarding key considerations, with an unchallenged governance model … the objective is attainable. Be prepared for battle! The field is always evolving and there will be challenges to the architecture over time, maybe before it is even operational. Our suggestion is to always be ready for these challenges and do not count on political power alone to enforce compliance or governance of the architecture.

You will want to consider these steps when building a modern system:

Collect the objectives and key results (OKRs) from the business and show successes early and often.Always have a demo ready for key stakeholders at a moment’s notice for key stakeholders.Keep those key stakeholders engaged and satisfied as the return on investment (ROI) is demonstrated. Also, consider that they are funding your effort.Keep careful track of the feature to cost ratio and know who is getting value and at what cost as part of the system’s total cost of ownership (TCO).Never break a data service level agreement (SLA) or data contract without giving the stakeholders and users enough time to accommodate impacts. It’s best to not break the agreement at all, since it clearly defines the data consumer’s expectations!Architect data systems that are backwardcompatible and never produce a broken contract once the business has engaged the system to glean insight. Pulling the rug out from under the business will have more impact than not delivering a solution in the first place, since they will have set up their downstream expectations based on your delivery.

You can see that there are many patterns to consider and some to avoid when building a modern data solution. Software engineers, data admins, data scientists, and data analysts will come with their perspectives and technical requirements in addition to objectives and key results (OKRs) that the business will demand. Not all technical players will honor the nuances that their peers’ disciplines require. Yet, the data engineer has to deliver the future-proof solution while balancing on top of a pyramid change.

In the next section, we will show you how to keep the technological edge and retain the balance necessary to create a solution that withstands the test of time.

Future-proofing is …

To future-proof a solution means to create a solution that is relevant to the present, scalable, and cost-effective, and will still be relevant in the future. This goal is attainable with a constant focus on building out a reference architecture with best practices and design patterns.

The goal is as follows:

Develop a scalable, affordable IT strategy, architecture, and design that leads to the creation of a future-proof data processing system.

When faced with the preceding goal, you have to consider that change is evolutionary rather than revolutionary. That means that a data architecture is solid and future-proof. Making a system 100% future-proof is an illusion; however, the goal of attaining a near future-proof system must always remain a prime driver of your core principles.

The attraction of shiny lights must never become bait to catch an IT system manager in a web of errors, even though cool technology may attract a lot of venture and seed capital or even create a star on one’s curriculum vitae (CV). It may just as well all fade away after a breakthrough in a niche area is achieved by a disrupter. Just look at what happened when OpenAI, ChatGPT, and related large language model (LLM) technology started to roll out. Conversational artificial intelligence (AI) has changed many systems already.

After innovation rollout, what was once hard is now easy and often available in open source to become commoditized. Even if a business software method or process-oriented intellectual property (IP) is locked away with patent protection… after some time – 10, 15, or 20 years – it is also free for reuse. In the filing disclosure of the IP, valuable insights are also made available to the competition. There can only be so many cutting-edge tech winners, and brilliant minds tend to develop against the same problem at the same time until a breakthrough is attained, often creating similar approaches. It is at this stage that data engineering is nearing an inflection point.

There will always be many more losers than winners. Depending on the size of an organization’s budget and its culture for risk/reward, there can arise a shiny light idea that becomes a blazing star. 90% of those who pursue the shooting star wind up developing a dud that fades away along with an entire IT budget. Our suggestion is to follow the business’s money and develop agilely to minimize the risk of IT-driven failure.

International Data Corporation (IDC) and the business intelligence organization Qlik came up with the following comparison:

“Data is the new water.”

You can say that data is oil or it is water – a great idea is getting twisted and repurposed, even in these statements. It’s essential that data becomes information and that information is rendered in such a way as to create direct, inferred, and derived knowledge. Truth needs to be defined as knowledge in context, including time. We need systems to be not data processing systems but knowledge aware systems that support intelligence, insight, and development of truths that withstand the test of time. In that way, a system may be future-proof. Data is too murky, like dirty water. It’s clouded by the following:

Nonsense structures developed to support current machine insufficiencyErrors due to misunderstanding of the data meaning and lineageDeliberate opacity due to privacy and securityMissing context or state due to missing metadataMissing semantics due to complex relationships not being recorded because of missing data and a lack of funding to properly model the data for the domain in which it was collected

Data life cycle processes and costs are often not considered fully. Business use cases drive what is important (note: we will elaborate a lot more on how use cases are represented by conceptual, logical, and physical architectures in Chapters 5-7 of this book). Use cases are often not identified early enough. The data services that were implemented as part of the solution are often left undocumented. They are neither communicated well nor maintained well over the data’s timeframe of relevancy. The result is that the data’s quality melts down like a sugar cube left in the rain. It loses its worth organically as its value degrades in time. Data efficacy loses value over time. This may be accelerated by the business and technical contracts not being maintained, and without that maintenance comes the loss of trust in a dataset’s governance. The resulting friction between business silos becomes palpable. A potential solution has been to create business data services with data contracts. These contracts are defined by well-maintained metadata, and describe the dataset at rest (its semantics) as well as its origin (its lineage) and security methods. They also include software service contracts for the timely maintenance of the subscribed quality metrics.

Businesses need to enable datasets to be priced, enhanced as value-added sets, and even sold to the highest bidder. This is driven over time by the cost of maintaining data systems, which can only increase. The data’s relevance (correctness), submitted for value-added enrichment and re-integration into commoditized data exchanges, is a key objective:

Don’t move data; enrich it in place along with its metadata to preserve semantics and lineage!

The highest bidder builds on the data according to the framework architecture and preserves the semantic domain for which the data system was modeled. Like a ratchet that never loses its grip, datasets need to be correct and hold on to the grip of reality over time. This reality for which the dataset was created can be proposed by the value-added resellers without sacrificing the quality or data service level.

Observe that, over time, the cost of maintaining data correctness, context, and relevance will exceed any single organization’s ability to sustain it for a domain. Naturally, it remains instinctual for the IT leader to hold on to the data and produce a silo. This natural reality to hide imperfections for an established system that is literally melting down must be fixed in the future data architecture’s approach. Allowing the data to evolve/drift, be value-added, and yet remain correct and maintainable is essential. Imperfect alignment of facts, assertions, and other modeled relationships within a domain would be diminished with this approach.

Too often in today’s processing systems, the data is curated to the point where it is considered good enough for now. Yet, it is not good enough for future repurposing. It carries all the assumptions, gaps, fragments, and partial data implementations that made it just good enough. If the data is right and self-explanatory, its data service code is simpler. The system solution is engineered to be elegant. It is built to withstand the pressure of change since the data organization was designed to evolve and remain 100% correct for the business domain.

“There is never enough time or money to get it right… the first time! There is always time to get it right later… again and again!’”

This pragmatic approach can stop the IT leader’s search for a better data engineering framework. Best practices could become a bother since the solution just works, and we don’t want to fix what works. However, you must get real regarding the current tooling choices available. The cost to implement any solution must be a right fit, yet as part of the architecture due diligence process, you still need to push against the edge of technology to seize on innovation opportunities, when they are ripe for the taking.

Consider semantic graph technology in OWL-RDF and its modeling and validation complexities via SPARQL, compared to using the labeled property graphs with custom code for the semantic representation of data in a subject domain’s knowledge base. Both have advantages and disadvantages; however, neither scales without implementing a change-data-capture mechanism syncing an in-memory analytics storage area for real time analytics use case support. Cloud technology has not kept up with making a one-size-fits all, data store, data lake, or data warehouse. It’s better said that one technology solution to fit all use cases and operational service requirements does not exist.

Since one size does not fit all, one data representation does not fit all use cases.

A monolithic data lake, Delta Lake, raw data storage, or data warehouse does not fit the business needs. Logical segmentation and often physical segmentation of data are needed to create the right-sized solution needed to support required use cases. The data engineer has to balance cost, security, performance, scale, and reliability requirements, as well as provider limitations. Just as one shoe size does not fit all… the solution has to be implementable and remain functional over time.

Organization into zone considerations

One facet of the data engineering best practices presented in this book is the need for a primary form of data representation for important data patterns. A raw ingest zone is envisioned to hold input Internet of Things (IoT) data, raw retailer point-of-sale data, chemical property reference data, or web analytics usage data. We are proposing that the concept of the zone be a formalization of the layers set forth in the Databricks Medallion Architecture (https://www.databricks.com/glossary/medallion-architecture). It may be worth reading through the structure of that architecture pattern or waiting until you get a chance to read Chapter 6, where a more detailed explanation is provided.

Raw data may need data profiling systems processing applied as part of ingest processing, but that is to make sure that any input data is not rejected due to syntactic or semantic incorrectness. This profiled data may even be normalized in a basic manner prior to the next stage of processing in the data pipeline journey. Its transformation involves processing into the bronze zone and later into the silver zone, then the gold zone, and finally made ready for the consumption zone (for real-time, self-serve analytics use cases).

The bronze, silver, and gold zones host information of varying classes. The gold zone data organization looks a lot like a classic data warehouse, and the bronze zone looks like a data lake, with the silver zone being a cache enabled data lake with a lot of derived, imputed, and inferred data drawn from processing data in the bronze zone. This silver zone data supports online transaction processing (OLTP) use cases but stores processed outputs in the gold zone. The gold zone may also support OLTP use cases directly against information.

The consumption zone is enabled to provide for the measures, calculated metrics, and online analytic processing (OLAP) needs of the user. Keeping it all in sync can become a nightmare of complexity without a clear framework and best practices to keep the system correct. Just think about the loss of linear dataflow control in an AWS or Azure cloud PaaS solution required to implement this zone blueprint. Without a clear architecture, data framework, best practices, and governance… be prepared for many trials and errors.

Cloud limitations

Data engineering best practices must take into consideration current cloud provider limitations and constraints that drive cost for data movement and third-party tool deployment for analytics when architecting. Consider the ultimate: a zettabyte cube of memory with sub-millisecond access for terabytes of data, where compute code resides with data to support relationships in a massive fabric or mesh. Impossible, today! But wait… maybe tomorrow this will be reality. Meanwhile, how do you build today in order to effortlessly move to that vision in the future? This is the focus of the best practices of this book. All trends point to the eventual creation of big-data, AI enabled data systems.

There are some key trends and concepts forming as part of that vision. Data sharing, confidential computing, and concepts such as bring your algorithm to the data

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Data Engineering Best Practices E-Book

Richard J. Schiller

Data Engineering Best Practices

Contributors

About the authors

About the reviewers

Table of Contents

Preface

1

Overview of the Business Problem Statement

What is the business problem statement?

Anti-patterns to avoid

Patterns in the future-proof architecture

Future-proofing is …

Organization into zone considerations

Cloud limitations

The Intelligence Age

Use case definitions

The mission, the vision, and the strategy

Principles and the development life cycle

The architecture definition, best practices, and key considerations

The DataOps convergence

Summary

2

A Data Engineer’s Journey – Background Challenges

Challenge #1 – platform architectures change rapidly

Platform architectures in the 21st century

Impacts on business strategy

A flexible software development life cycle to manage platform risk

Challenge #2 – Total cost of ownership (TCO) is high

ETL architecture costs are high!

Buy versus build choices impact a solution’s longevity

Challenge #3 – Evolving data repository patterns – identifying big rocks for data engineers

Intake, integration, and storage challenges in data engineering

Identifying the big rocks to be placed first into your design

Being able to handle technology hype

Summary

3

A Data Engineer’s Journey – IT’s Vision and Mission

The vision

Develop the IT engineering vision

Vision summary

The mission and the IT strategy

IT’s vision

IT’s mission

IT mission summary

Principles, frameworks, and best practices

The architecture reflects the vision

Principles summary

Data engineering patterns for IT operability

What patterns are required and how are they specified?

Pattern summary

Summary

4

Architecture Principles

Architecture principles overview

Architecture foundation

Data lake, mesh, and fabric

Data immutability

Third party tool, cloud platform-as-a-service (PaaS), and framework integrations

Data mesh principles

Data mesh metadata

Data semantics in the data mesh

Data mesh, security, and tech stack considerations

What are the key foundational takeaways?

Architecture principles in depth

Principle #1 – Data lake as a centerpiece? No, implement the data journey!

Principle #2 – A data lake’s immutable data is to remain explorable

Principle #3 – A data lake’s immutable data remains available for analytics

Principle #4 – A data lake’s sources are discoverable

Principle #5 – A data lake’s tooling should be consistent with the architecture

Principle #6 – A data mesh defines data to be governed by domain-driven ownership

Principle #7 – A data mesh defines the data and derives insights as a product

Principle #8 – A data mesh defines data, information, and insights to be self-service

Principle #9 – A data mesh implements a federated governance processing system

Principle #10 – Metadata is associated with datasets and is relevant to the business

Principle #11 – Dataset lineage and at-rest metadata is subject to life cycle governance

Principle #12 – Datasets and metadata require cataloging and discovery services

Principle #13 – Semantic metadata guarantees correct business understanding at all stages in the data journey

Principle #14 – Data big rock architecture choices (time series, correction processing, security, privacy, and so on) are to be handled in the design early