Simplifying Data Engineering and Analytics with Delta - Anindita Mahapatra - E-Book

Simplifying Data Engineering and Analytics with Delta E-Book

Anindita Mahapatra

0,0
33,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases.
In this book, you’ll learn about the principles of distributed computing, data modeling techniques, and big data design patterns and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You’ll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. After that, you’ll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.
By the end of this Delta book, you’ll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 403

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Simplifying Data Engineering and Analytics with Delta

Create analytics-ready data that fuels artificial intelligence and business intelligence

Anindita Mahapatra

BIRMINGHAM—MUMBAI

Simplifying Data Engineering and Analytics with Delta

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Dhruv Jagdish Kataria

Senior Editor: Tazeen Shaikh

Content Development Editor: Sean Lobo, Priyanka Soam

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Manju Arasan

Production Designer: Roshan Kawale

Marketing Coordinator: Nivedita Singh

First published: July 2022

Production reference: 1290622

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80181-486-7

www.packt.com

This book is dedicated to my parents for their unconditional love and support.

While there are too many to name here, I would like to thank my mentors and colleagues that have encouraged and aided me in this journey. Last but not least, I would like to thank the team at Packt for all their help and guidance throughout the process.

Foreword

My father was one of the first chief information officers (CIOs) back in the mid-1980s. He led all of IT for the largest commercial property insurer in the world. He reported to the CEO, which, at that time, was uncommon as most IT functions reported to the CFO because they were cost centers. Every weekend he would bring home some type of new technology: an Apple 2E, an IBM PC, even a "portable" computer that weighed 40 lbs. My sisters and I would play with them for hours on end, creating spreadsheets and writing basic programs. At the time, I viewed him as being on the bleeding edge of technology, a real "techie."

When I graduated college and went to work at IBM in 1991, I came home and tried to talk about technology with my father using all the speeds and feeds of the mid-range and Unix systems that I had just been trained on. Each time I mentioned a particular technical specification, he would ask me "What does that do?" or "Why is that important?" His questions frustrated me. When I explained why the SPEC-INT metric was important, he would look confused. I began to think my father wasn't the techie I once believed him to be. And I was right. Part of me was disappointed with this realization. But, over time, I came to see that his expertise was not the technology itself, but understanding the business strategy deeply and translating how specific capabilities provided by technology could be applied to make the business strategy succeed.

Fast forward 30+ years and I'm now the vice president of Global Value Acceleration at Databricks, one of the fastest-growing software companies in history. I lead a global team of consultants, or translators, that help prospects and customers connect the technical power of our data and AI platform to the meaningful business value its capabilities will deliver as they pursue their business strategy.

Looking back, I realize that I've been doing value translation my entire career. I found that when the business strategy meets the technical strategy and they are well aligned, magic happens. Executives who hold budgets and decision-making authority accelerate and approve initiatives and their associated spending. Likewise, when the translation work isn't done or isn't done well, they deny those requests. Over my career, I've learned that when those requests fail, it's generally not the fault of the technology. It comes down to the quality of the translation and the underlying story.

The need for translators in data is significant and increasing. According to a recent McKinsey article, "(data) translators play a critical role in bridging the technical expertise of data engineers and data scientists with the operational expertise of marketing, supply chain, manufacturing, risk, and other frontline managers. In their role, translators help ensure that the deep insights generated through sophisticated analytics translate into impact at scale in an organization. By 2026, the McKinsey Global Institute estimates that demand for translators in the United States alone may reach two to four million."

Through thousands of translation engagements with global enterprises over the last decade, my team, with our business value assessment (BVA) methodology, has proven to be a critical ingredient to the success of large initiatives. The recipe that translates complex technology to the C-suite for investment consideration follows a simple framework comprising a story. It draws executives in, making it easy for them to say "yes":

Key strategic prioritiesUse cases aligned with those prioritiesTechnical barriers in the way of successCapabilities required to succeedValue to be realized when successfulReturn on investmentSuccess plan

According to International Data Corporation (IDC), 95% of technology investments require financial justification. This framework provides the financial justification that is needed, but it also reinforces the urgency to act by connecting the project to the most important priorities or business problems that the C-suite and board have their eyes on, and it specifies the capabilities required for success. When you put these together, you have a CFO-ready business case that qualifies and quantifies the value setting your project apart from all others.

This is why I've been so excited about this book. The opportunity to apply powerful technology such as Delta and deliver impact all the way to the boardroom of your employer is real and required for success in today's market. When I first worked with Anindita at Databricks, it was clear to me that she has a special talent that few technical people have. She is a translator. She can speak succinctly about very complex technical topics, make them easy to understand at any level, and connect the technology to why it matters to the business. Her ability to do this for our customers and for other Databricks employees has helped her, and Databricks, succeed in many ways.

As you read on from here, note how everything from data modeling to operationalizing Delta pipelines is made easy to understand and translatable to the business. Anindita, in her special way, will guide you to become a better data engineer while infusing you with specific skills to become a data translator, whose future value may just be priceless.

Doug May

VP, Global Value Acceleration

Databricks Inc.

Contributors

About the author

Anindita Mahapatra is a lead solutions architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Think Big/Teradata, prior to which she was managing the development of algorithmic app discovery and promotion for both Nokia and Microsoft stores. She holds a master's degree in liberal arts and management from Harvard Extension School, a master's in computer science from Boston University, and a bachelor's in computer science from BITS Pilani, India.

About the reviewer

Oleksandra Bovkun is a solutions architect for data and AI platforms and systems. She works with customers and engineering teams to develop architectures and solutions for data platforms and guide them through the implementation. She has extensive experience and expertise in open source technologies such as Apache Spark, MLflow, Delta Lake, Kubernetes, and Helm, and programming languages such as Python and Scala.Furthermore, she specializes in data platform architecture, especially in DevOps and MLOps processes. Oleksandra has more than 10 years of experience in the field of software development and data engineering. She likes to discover new technologies and tools, architecture patterns, and open source projects.

Table of Contents

Preface

Section 1 – Introduction to Delta Lake and Data Engineering Principles

Chapter 1: Introduction to Data Engineering

The motivation behind data engineering

Use cases

How big is big data?

But isn't ML and AI all the rage today?

Understanding the role of data personas

Big data ecosystem

What characterizes big data?

Classifying data

Reaping value from data

Top challenges of big data systems

Evolution of data systems

Rise of cloud data platforms

SQL and NoSQL systems

OLTP and OLAP systems

Distributed computing

SMP and MPP computing

Parallel and distributed computing

Business justification for tech spending

Strategy for business transformation to use data as an asset

Big data trends and best practices

Summary

Chapter 2: Data Modeling and ETL

Technical requirements

What is data modeling and why should you care?

Advantages of a data modeling exercise

Stages of data modeling

Data modeling approaches for different data stores

Understanding metadata – data about data

Data catalog

Types of metadata

Why is metadata management the nerve center of data?

Moving and transforming data using ETL

Scenarios to consider for building ETL pipelines

Job orchestration

How to choose the right data format

Text format versus binary format

Row versus column formats

When to use which format

Leveraging data compression

Common big data design patterns

Ingestion

Transformations

Persist

Summary

Further reading

Chapter 3: Delta – The Foundation Block for Big Data

Technical requirements

Motivation for Delta

A case of too many is too little

Data silos to data swamps

Characteristics of curated data lakes

DDL commands

DML commands

APPEND

Demystifying Delta

Format layout on disk

The main features of Delta

ACID transaction support

Schema evolution

Unifying batch and streaming workloads

Time travel

Performance

Life with and without Delta

Lakehouse

Summary

Section 2 – End-to-End Process of Building Delta Pipelines

Chapter 4: Unifying Batch and Streaming with Delta

Technical requirements

Moving toward real-time systems

Streaming concepts

Lambda versus Kappa architectures

Streaming ETL

Extract – file-based versus event-based streaming

Transforming – stream processing

Loading – persisting the stream

Handling streaming scenarios

Joining with other static and dynamic datasets

Recovering from failures

Handling late-arriving data

Stateless and stateful in-stream operations

Trade-offs in designing streaming architectures

Cost trade-offs

Handling latency trade-offs

Data reprocessing

Multi-tenancy

De-duplication

Streaming best practices

Summary

Chapter 5: Data Consolidation in Delta Lake

Technical requirements

Why consolidate disparate data types?

Delta unifies all types of data

Structured data

Semi-structured data

Unstructured data

Avoiding patches of data darkness

Addressing problems in flight status using Delta

Augmenting domain knowledge constraints to quality

Continuous quality monitoring

Curating data in stages for analytics

RDD, DataFrames, and datasets

Spark transformations and actions

Spark APIs and UDFs

Ease of extending to existing and new use cases

Delta Lake connectors

Specialized Delta Lakes by industry

Data governance

GDPR and CCPA compliance

Role-based data access

Summary

Chapter 6: Solving Common Data Pattern Scenarios with Delta

Technical requirements

Understanding use case requirements

Minimizing data movement with Delta time travel

Delta cloning

Handling CDC

CDC

Change Data Feed (CDF)

Handling Slowly Changing Dimensions (SCD)

SCD Type 1

SCD Type 2

Summary

Chapter 7: Delta for Data Warehouse Use Cases

Technical requirements

Choosing the right architecture

Understanding what a data warehouse really solves

Lacunas of data warehouses

Discovering when a data lake does not suffice

Addressing concurrency and latency requirements with Delta

Visualizing data using BI reporting

Can cubes be constructed with Delta?

Analyzing tradeoffs in a push versus pull data flow

Why is being open such a big deal?

Considerations around data governance

The rise of the lakehouse category

Summary

Chapter 8: Handling Atypical Data Scenarios with Delta

Technical requirements

Emphasizing the importance of exploratory data analysis (EDA)

From big data to good data

Data profiling

Statistical analysis

Applying sampling techniques to address class imbalance

How to detect and address imbalance

Synthetic data generation to deal with data imbalance

Addressing data skew

Providing data anonymity

Handling bias and variance in data

Bias versus variance

How do we detect bias and variance?

How do we fix bias and variance?

Compensating for missing and out-of-range data

Monitoring data drift

Summary

Chapter 9: Delta for Reproducible Machine Learning Pipelines

Technical requirements

Data science versus machine learning

Challenges of ML development

Formalizing the ML development process

What is a model?

What is MLOps?

Aspirations of a modern ML platform

The role of Delta in an ML pipeline

Delta-backed feature store

Delta-backed model training

Delta-backed model inferencing

Model monitoring with Delta

From business problem to insight generation

Summary

Chapter 10: Delta for Data Products and Services

Technical requirements

DaaS

The need for data democratization

Delta for unstructured data

NLP data (text and audio)

Image and video data

Data mashups using Delta

Data blending

Data harmonization

Federated query

Facilitating data sharing with Delta

Setting up Delta sharing

Benefits of Delta sharing

Data clean room

Summary

Section 3 – Operationalizing and Productionalizing Delta Pipelines

Chapter 11: Operationalizing Data and ML Pipelines

Technical requirements

Why operationalize?

Understanding and monitoring SLAs

Scaling and high availability

Planning for DR 

How to decide on the correct DR strategy

How Delta helps with DR

Guaranteeing data quality

Automation of CI/CD pipelines 

Code under version control

Infrastructure as Code (IaC)

Unit and integration testing

Data as code – An intelligent pipeline

Summary

Chapter 12: Optimizing Cost and Performance with Delta

Technical requirements

Improving performance with common strategies

Where to look and what to look for

Optimizing with Delta

Changing the data layout in storage

Other platform optimizations

Automation

Is cost always inversely proportional to performance?

Best practices for managing performance

Summary

Chapter 13: Managing Your Data Journey

Provisioning a multi-tenant infrastructure

Data democratization via policies and processes

Capacity planning

Managing and monitoring

Data sharing

Data migration

COE best practices

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Preface

Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked upon. This is especially important considering the same architecture is reused when onboarding new use cases.

In this book, you'll learn the principles of distributed computing, data modeling techniques, big data design patterns, and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You'll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. Next, you'll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.By the end of this book, you'll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.

Who this book is for

Individuals in the data domain such as data engineers, data scientists, ML practitioners, and BI analysts working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.

What this book covers

Chapter 1, Introduction to Data Engineering, covers how data is the new oil. Just as oil has to burn to get heat and light, data also has to be harnessed to get valuable insights. The quality of insights will depend on the quality of the data. So, understanding how to manage data is an important function for every industry vertical. This chapter introduces the fundamental principles of data engineering and addresses the growing trends in the industry of data-driven organizations and how to leverage IT data operation units as a competitive advantage instead of viewing them as a cost center.

Chapter 2, Data Modeling and ETL, covers how leveraging the scalability and elasticity of the cloud helps turn on compute on demand and move CAPEX allocation towards OPEX. This chapter introduces common big data design patterns and best practices for modeling big data.

Chapter 3, Delta – The Foundational Block for Big Data, introduces Delta as a file format and points out features that Delta brings to the table over vanilla Parquet and why it is a natural choice for any pipeline. Delta is an overloaded term – it is a protocol first, a table next, and a lake finally!

Chapter 4, Unifying Batch and Streaming with Delta, covers how the trend is towards real-time ingestion, analysis, and consumption of data. Batching is actually a type of streaming workload. Reader/writer isolation is necessary in an environment with multiple producers/consumers involving the same data assets to work independently with the promise that bad or partial data is never presented to the user.

Chapter 5, Data Consolidation in Delta Lake, covers how bringing data together from various silos is only the first step towards building a data lake. The real deal is in increased reliability, quality, and governance, which needs to be enforced to get the most out of the data and infrastructure investment while adding value to any BI or AI use case built on top of it.

Chapter 6, Solving Common Data Pattern Scenarios with Delta, covers common CRUD operations on big data and looks at use cases where they can be applied as a repeatable blueprint.

Chapter 7, Delta for Data Warehouse Use Cases, covers the journey from databases to data warehouses to data lakes, and finally, to lakehouses. The unification of data platforms has never been more important. Is it possible to house all kinds of use cases with a single architecture paradigm? This chapter focuses on the data handling needs and capability requirements that drive the next round of innovation.

Chapter 8, Handling Atypical Data Scenarios with Delta, covers several conditions, such as data imbalance, skew, and bias, that need to be addressed to ensure data is not only cleansed and transformed per the business requirements but is also conducive to the underlying compute and for the use case at hand. Even when the logic of the pipelines has been ironed out, there are other statistical attributes of the data that need to be addressed to ensure that the data characteristics for which it was initially designed still hold and make the most of the distributed compute.

Chapter 9, Delta for Reproducible Machine Learning Pipelines, emphasizes that if ML is hard, then reproducible ML and productionizing of ML is even harder. A large part of ML is data preparation. The quality of insights will be as good as the quality of the data that is used to build the models. In this chapter, we look at the role of Delta in ensuring reproducible ML.

Chapter 10, Delta for Data Products and Services, covers consumption patterns of data democratization that ensure the curated data gets into the hands of the consumers in a timely and secure manner so that the insights can be leveraged meaningfully. Data can be served both as a product and a service, especially in the context of a mesh architecture involving multiple lines of businesses specializing in different domains.

Chapter 11, Operationalizing Data and ML Pipelines, looks at the aspects of a mature pipeline that make it considered production worthy. A lot of the data around us remains in unstructured form and carries a wealth of information, and integrating it with more structured transactional data is where firms can not only get competitive intelligence but also begin to get a holistic view of their customers to employ predictive analytics.

Chapter 12, Optimizing Cost and Performance with Delta, looks at how running a pipeline faster has cost implications that translate directly to increased infrastructural savings. This applies to both the ETL pipeline that brings in the data and curates it as well as the consumption pipeline where the stakeholders tap into this curated data. In this chapter, we look at strategies such as file skipping, z-ordering, small file coalescing, and bloom filtering to improve query runtime.

Chapter 13, Managing Your Data Journey, emphasizes the need for policies around data access and data use that need to be honored as per regulatory and compliance guidelines. In some industries, it may be necessary to provide evidence of all data access and transformations. Hence, there is a need to be able to set controls in place, detect if something has been changed, and provide a transparent audit trail.

To get the most out of this book

Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book. Delta is open source and can be run both on-prem and in the cloud. Because of the rise in cloud data platforms, a lot of the descriptions and examples are in the context of cloud storage.

Use the following GitHub link for the Delta Lake documentation and quickstart guide to help you set up your environment and become familiar with the necessary APIs: https://github.com/delta-io/delta.

Databricks is the original creator of Delta, which was open sourced to the Linux Foundation and is supported by a large user community. Examples in this book cover some Databricks-specific features to provide a complete view of features and capabilities. Newer features continue to be ported from Databricks to open source Delta. Please refer to the proposed roadmap for the feature migration details: https://github.com/delta-io/delta/issues/920.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta.

If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/UI11F.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "There is no need to run the REPAIR TABLE command when you're working with the Delta format".

A block of code is set as follows:

SELECT COUNT(*) FROM some _ parquet _ table

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. On the other hand, a data swamp is a large body of data that is ungoverned and unreliable.

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Simplifying Data Engineering and Analytics with Delta, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Section 1 – Introduction to Delta Lake and Data Engineering Principles

Understanding modern data architectures and sound data engineering principles and practices are crucial to ensure that your AI and BI strategies are reliable and defensible. Generated insights are going to be as good as the quality of the underlying data, so the upfront effort put into understanding the data, modeling it, and transforming it per the business needs goes a long way to foster innovation, productivity, and agility in your data teams.

This part includes the following chapters:

Chapter 1, An Introduction to Data EngineeringChapter 2, Data Modeling and ETLChapter 3, Delta – The Foundation Block for Big Data

Chapter 1: Introduction to Data Engineering

"Water, water, everywhere, nor any drop to drink...

Data data everywhere, not a drop of insight!"

With the vast exodus of data around us, it is important to crunch it meaningfully and promptly to extract value from all the noise. This is where data engineering steps in. If collecting data is the first step, drawing useful insights is the next. Data engineering encompasses several personas that come together with their unique individual skill sets and processes to bring this to fruition. Data usually outlives the technology, and it continues to grow. New tools and frameworks come to the forefront to solve a lot of old problems. It is important to understand business requirements, the accompanying tech challenges, and typical shifts in paradigms to solve these age-old problems in a better manner.

By the end of this chapter, you should have an appreciation of the data landscape, the players, and the advances in distributed computing and cloud infrastructure that make it possible to support the high pace of innovation.

In this chapter, we will cover the following topics:

The motivation behind data engineeringData personasBig data ecosystemEvolution of data storesTrends in distributed computingBusiness justification for tech spending

The motivation behind data engineering

Data engineering is the process of converting raw data into analytics-ready data that is more accessible, usable, and consumable than its raw format. Modern companies are increasingly becoming data-driven, which means they use data to make business decisions to give them better insights into their customers and business operations. They can use these to improve profitability, reduce costs, and give them a competitive edge in the market. Behind the scenes, a series of tasks and processes are performed by a host of data personas who build reliable pipelines to source, transform, and analyze data so that it is a repeatable and mostly automated process.

Different systems produce different datasets that need to function as individual units and are brought together to provide a holistic view of the state of the business – for example, a customer buying merchandise through different channels such as the web, in-app, or in-store. Analyzing activity in all the channels will help predict the next customer purchase and possibly the next channel type as well. In other words, having all the datasets in one place can help answer questions that couldn't be answered by the individual systems. So, data consolidation is an industry trend that breaks down individual silos. However, each of the systems may have been designed differently, as well as different requirements and service-level agreements (SLAs), and now all of that needs to be normalized and consolidated in a single place to facilitate better analytics.

The following diagram compares the process of farming to that of processing and refining data. In both setups, there are different producers and consumers and a series of refining and packaging steps:

Figure 1.1 – Farming compared to a data pipeline

In this analogy, there is a farmer, and the process consists of growing crops, harvesting them, and making them available in a grocery store. This produce eventually becomes a ready-to-eat meal. Similarly, a data engineer is responsible for creating ready-to-consume data so that each consumer does not have to invest in the same heavy lifting. Each cook taps into different points of the pipeline and makes different recipes based on the specific needs of the use cases that need to be catered for. However, the freshness and quality of the produce are what make for a delightful meal, irrespective of the recipe that's used.

We are at the interesting conjunction of big data, the cloud, and artificial intelligence (AI), all of which are fueling tremendous innovation in every conceivable industry vertical and generating data exponentially. Data engineering is increasingly important as data drives business use cases in every industry vertical. You may argue that data scientists and machine learning practitioners are the unicorns of the industry, and they can work their magic for business. That is certainly a stretch of the imagination. Simple algorithms and a lot of good reliable data produce better insights than complicated algorithms with inadequate data. Some examples of how pivotal data is to the very existence of some of these businesses are listed in the following section.

Use cases

In this section, we've taken a popular use case from a few industry verticals to highlight how data is being used as a driving force for their everyday operations and the scale of data involved:

Security Incident and Event Management (SIEM) cyber security systems for threat detection and prevention.

This involves user activity monitoring and auditing for suspicious activity patterns and entails collecting a large volume of logs across several devices and systems, analyzing them in real time, correlating data, and reporting on findings via alerts and dashboard refreshes.

Genomics and drug development in health and life sciences.

The Human Genome project took almost 15 years to complete. A single human genome requires about 100 gigabytes of storage, and it is estimated that by 2025, 40 exabytes of data will be required to process and store all the sequenced genomes. This data helps researchers understand and develop cures that are more targeted and precise.

Autonomous vehicles.

Autonomous vehicles use a lot of unstructured image data that's been generated from cameras on the body of the car to make safe driving decisions. It is estimated that an active vehicle generates about 5 TB every hour. Some of it will be thrown away after a decision has been made, but a part of it will be saved both locally as well as transmitted to a data center for long-term trend monitoring.

IoT sensors in Industry 4.0 smart factories in manufacturing.

Smart manufacturing and the Industry 4.0 revolution, which are powered by advances in IoT, are enabling a lot of efficiencies in machine and human utilization on the shop floor. Data is at the forefront of scaling these smart factory initiatives with real-time monitoring, predictive maintenance, early alerting, and digital twin technology to create closed-loop operations.

Personalized recommendations in retail.

In an omnichannel experience, personalization helps retailers engage better with their customers, irrespective of the channel they choose to engage with, all while picking up the relevant state from the previous channel they may have used. They can address concerns before the customer churns to a competitor. Personalization at scale can not only deliver a percentage lift in sales but can also reduce marketing and sales costs.

Gaming/entertainment.

Games such as Fortnite and Minecraft have captivated children and adults alike who spend several hours in a multi-player online game session. It is estimated that Fortnite generates 100 MB of data per user, per hour. Music and video streaming also rely a lot on recommendations for new playlists. Netflix receives more than a million new ratings every day and uses several parameters to bin users to understand similarities in their preferences.

Smart agriculture.

The agriculture market in North America is estimated to be worth 6.2 billion US dollars and uses big data to understand weather patterns for smart irrigation and crop planting, as well as to check soil conditions for the right fertilizer dose. John Deere uses computer vision to detect weeds and can localize the use of sprays to help preserve the quality of both the environment and the produce.

Fraud detection in the Fintech sector.

Detecting and preventing fraud is a constant effort as fraudsters find new ways to game the system. Because we are constantly transacting online, a lot of digital footprints are left behind. By some estimates, about 10% of insurance company payments are made due to fraud. AI techniques such as biometric data and ML algorithms can detect unusual patterns, which leads to better monitoring and risk assessment so that the user can be alerted before a lot of damage is done.

Forecasting use cases across a wide variety of verticals.

Every business has some need for forecasting, either to predict sales, stock inventory, or supply chain logistics. It is not as straightforward as projection – other patterns influence this, such as seasonality, weather, and shifts in micro or macro-economic conditions. Data that's been augmented over several years by additional data feeds helps create more realistic and accurate forecasts.

How big is big data?

90% of the data that's generated thus far has been generated in the last 2 years alone. At the time of writing, it is estimated that 2.5 quintillion (18 zeros) bytes of data is produced every day. A typical commercial aircraft generates 20 terabytes of data per engine every hour it's in flight.

We are just at the beginning stages of autonomous driving vehicles, which rely on data points to operate. The world's population is about 7.7 billion. The number of connected devices is about 10 billion, with portions of the world not yet connected by the internet. So, this number will only grow as the exodus of IoT sensors and other connected devices grows. People have an appetite to use apps and services that generate data, including search functionalities, social media, communication, services such as YouTube and Uber, photo and video services such as Snapchat and Facebook, and more. The following statistics give you a better idea of the data that's generated all around us and how we need to swim effectively through all the waves and turbulences that they create to digest the most useful nuggets of information.

Every minute, the following occurs (approximately):

16 million text messages1 million Tinder swipes160 million emails 4 million YouTube videos0.5 million tweets0.5 million Snapchat shares

With so much data being generated, there is a need for robust data engineering tools and frameworks and reliable data and analytics platforms to harness this data and make sense of it. This is where data engineering comes to the rescue. Data is as important an asset as code is, so there should be governance around it. Structured data only accounts for 5-10% of enterprise data; semi-structured and unstructured data needs to be added to complete this picture.

Data is the new oil and is at the heart of every business. However, raw data by itself is not going to make a dent in a business. It is the useful insights that are generated from curated data that are the refined consumable oil that businesses aspire for. Data drives ML, which, in turn, gives businesses their competitive advantage. This is the age of digitization, where most successful businesses see themselves as tech companies first. Start-ups have the advantage of selecting the latest digital platforms while traditional companies are all undergoing digital transformations. Why should I care so much for the underlying data? I have highly qualified ML practitioners who are the unicorns of the industry that can use sophisticated algorithms and their special skill sets to make magic!

In this section, we established the importance of curating data since raw data by itself isn't going to make a dent in a business. In the next section, we will explore the influence that curated data has on the effectiveness of ML initiatives.

But isn't ML and AI all the rage today?

AI and ML are catchy buzzwords, and everybody wants to be on the bandwagon and use ML to differentiate their product. However, the hardest part about ML is not ML – it is managing everything else around ML creation. This is shown by Google in one of their papers in 2014 (https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf). Garbage in, garbage out, is true. The magic wand of ML will only work if the boxes surrounding it are well developed and most of them are data engineering tasks. In short, high-quality curated data is the foundational layer of any ML application, and the data engineering practices that curate this data are the backbone that holds it all together:

Figure 1.2 – The hardest part about ML is not ML, but rather everything else around it

Technologies come and go, so understanding the core challenges around data is critical. As technologists, we create more impact when we align solutions with business challenges. Speed to insights is what all businesses demand and the key to this is data. The data and IT functional areas within an organization that were traditionally viewed as cost centers are now being viewed as revenue-generating sources. Organizations where business and tech cooperate, instead of competing with each other, are the ones most likely to succeed with their data initiatives. Building data services and products involves several personas. In the next section, we will articulate the varying skill sets of these personas within an organization.

Understanding the role of data personas

Since data engineering is such a crucial field, you may be wondering who the main players are and what skill sets they possess. Building a data product involves several folks, all of whom need to come together with seamless handoffs to ensure a successful end product or service is created. It would be a mistake to create silos and increase both the number and complexity of integration points as each additional integration is a potential failure point. Data engineering has a fair overlap with software engineering and data science tasks:

Figure 1.3 – Data engineering requires multidisciplinary skill sets

All these roles require an understanding of data engineering:

Data engineers focus on maintaining how the data pipelines that ingest and transform data run. This has a lot in common with a software engineering role coupled with lots of data.BI analysts focus on SQL-based reporting and can be operational or domain-specific subject-matter experts (SMEs) such as financial or supply chain analysts.Data scientists and ML practitioners are statisticians who explore and analyze the data (via Exploratory Data Analysis (EDA)) and use modeling techniques at various levels of sophistication.DevOps and MLOps focus on the infrastructure aspects of monitoring and automation. MLOps is DevOps coupled with the additional task of managing the life cycle of analytic models.ML engineers refer to folks who can span across both the data engineer and data scientist roles.Data leaders are chief data officers – that is, data stewards who are at the top of the food chain in terms of the ultimate governors of data.

The following diagram shows the typical placement of the four main data personas working collaboratively on a data platform to produce business insights to give the company a competitive advantage in the industry:

Figure 1.4 – Data personas working in collaboration

Let's take a look at a few of these points in more detail:

DevOps is responsible for ensuring all operational aspects of the data platform and traditionally does a lot of scripting and automation.Data/ML engineers are responsible for building the data pipeline and taking care of the extract, transform, load (ETL) aspects of the pipeline.Data scientists of varying skill levels build models.Business analysts create reporting dashboards from aggregated curated data.

Big data ecosystem

The big data ecosystem has a fairly large footprint that's contributed by several infrastructures, analytics (BI and AI) technologies, data stores, and apps. Some of these are open source, while others are proprietary. Some are easy to wield, while others have steeper learning curves. Big data management can be daunting as it brings in another layer of challenges over existing data systems. So, it is important to understand what qualifies as a big data system and know what set of tools should be used for the use case at hand.

What characterizes big data?

Big data was initially characterized with three Vs (volume, velocity, and variety). This involves processing a lot of data coming into a system at high velocity with varying data types. Two more Vs were subsequently added (veracity and value). This list continues to grow and now includes variability and visibility. Let's look at the top five and see what each of them mean:

Volume: This is measured by the size of data, both historical and current: The number of records in a file or tableThe size of the data in gigabytes, terabytes, and so onVelocity: This refers to the frequency at which new data arrives:The batches have a well-defined interval, such as daily or hourly.Real time is either continuous or micro-batch, typically in seconds.Variety: This refers to the structural nature of the data:Structured data is usually relational and has a well-defined schema.Semi-structured data has a self-describing schema that can evolve, such as the XML and JSON formats.Unstructured data refers to free-text documents, audio, and video data that's usually in binary format.Veracity: This refers to the trustworthiness and reliability of the data:Lineage refers to not just the source but also the subsequent systems where transformations took place to ensure that data fidelity is maintained and can be audited. To guarantee such reliability, data lineage must be maintained.Value: This refers to the business impact that the dataset has – that is, how valuable the data is to the business.

Classifying data

Different classification gauges can be used. The common ones are based on the following aspects:

As the volume of data increases, we move from regular systems to big data systems. Big data is typically terabytes of data that cannot fit on a single computer node.As the velocity of the data increases, we move toward big data systems specialized in streaming. In batch systems, irrespective of when data arrives, it is processed at a predefined regular interval. In streaming systems, there are two flavors. If it's set to continuous, data is processed as it comes. If it's set to micro-batch, data is aggregated in small batches, typically a few seconds or milliseconds.When it comes to variety – that is, the structure of the data – we move toward the realm of big data systems. In structured data, the schema is well known and stable, so it's assumed to be fairly static and rigid to the definition. With semi-structured data, the schema is built into the data and can evolve. In unstructured data such as images, audio, and video, there is some metadata but no real schema to the binary data that's sent.

The following diagram shows what trends in data characteristics signal a move toward big data systems. For example, demographic data is fairly structured with predefined fields, operational data moves toward the semi-structured realm as schemas evolve, and the most voluminous is behavioral data as it encompasses user sentiment, which is constantly changing and is best captured by unstructured data such as text, audio, and images:

Figure 1.5 – Classifying data

Now that we have covered the different types of data, let's see how much processing needs to be done before it can be consumed.

Reaping value from data

As data is refined and moves further along the pipeline, there is a tradeoff between the value that's added and the cost of the data. In other words, more time, effort, and resources are used, which is why the cost increases, but the value of the data increases as well:

Figure 1.6 – The layers of data value

The analogy we're using here is that of cutting carbon to create a diamond. The raw data is the carbon, which gets increasingly refined. The longer the processing layers, the more refined and curated the value of the data. However, it is more time-consuming and expensive to produce the artifact.

Top challenges of big data systems

People, technology, and processes are the three prongs that every enterprise has to keep up with. Technology changes around us at a pace that is hard to keep up with and gives us better tools and frameworks. Tools are great but until you train people to use them effectively, you cannot create solutions, which is what a business needs. Sound and effective business processes help you pass information quickly and break data silos.

According to Gartner, the three main challenges of big data systems are as follows:

Data silosFragmented toolsPeople with the skill sets to wield them

The following diagram shows these challenges:

Figure 1.7 – Big data challenges

Any imbalance or immaturity in these areas results in poor insights. These challenges around data quality and data staleness lead to inaccurate, delayed, and hence unusable insights.

Evolution of data systems

We have been collecting data for decades. The flat file storages of the 60s led to the data warehouses of the 80s to Massively Parallel Processing (MPP) and NoSQL databases, and eventually to data lakes. New paradigms continue to be coined but it would be fair to say that most enterprise organizations have settled on some variation of a data lake:

Figure 1.8 – Evolution of big data systems

Cloud adoption continues to grow with even highly regulated industries such as healthcare and Fintech embracing the cloud for cost-effective alternatives to keep pace with innovation; otherwise, they risk being left behind. People who have used security as the reason for not going to the cloud should be reminded that all the massive data breaches that have been splashing the media in recent years have all been from on-premises setups. Cloud architectures have more scrutiny and are in some ways more governed and secure.

Rise of cloud data platforms

The data challenges remain the same. However, over time, the three major shifts in architecture offerings have been due to the introduction of the following:

Data warehousesHadoop heralding the start of data lakesCloud data platforms refining the data lake offerings

The use cases that we've been trying to solve for all three generations can be placed into three categories, as follows:

SQL-based BI ReportingExploratory data analytics (EDA) ML

Data warehouses were good at handling modest volume structured data and excelled at BI Reporting use cases, but they had limited support for semi-structured data and practically no support for unstructured data. Their workloads could only support batch processing. Once ingested, the data was in a proprietary format, and they were expensive. So, older data would be dropped in favor of accommodating new data. Also, because they were running at capacity, interactive queries had to wait for ingestion workloads to finish to avoid putting strain on the system. There were no ML capabilities built into these systems.

Hadoop came with the promise of handling large volumes of data and could support all types of data, along with streaming capabilities. In theory, all the use cases were feasible. In practice, they weren't. Schema on read meant that the ingestion path was greatly simplified, and people dumped their data, but the consumption paths were more difficult. Managing the Hadoop cluster was complex, so it was a challenge to upgrade versions of software. Hive was SQL-like and was the most popular of all the Hadoop stack offerings. However, access performance was slow. So, part of the curated data was pushed into data warehouses due to their structure. This meant that data personas were left to stitch two systems that had their fair share of fragility and increased end-to-end latency.

Cloud data platforms



Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.