Data Engineering with Apache Spark, Delta Lake, and Lakehouse - Manoj Kukreja - E-Book

Data Engineering with Apache Spark, Delta Lake, and Lakehouse E-Book

Manoj Kukreja

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

In the world of ever-changing data and schemas, it is important to build data pipelines that can auto-adjust to changes. This book will help you build scalable data platforms that managers, data scientists, and data analysts can rely on.
Starting with an introduction to data engineering, along with its key concepts and architectures, this book will show you how to use Microsoft Azure Cloud services effectively for data engineering. You'll cover data lake design patterns and the different stages through which the data needs to flow in a typical data lake. Once you've explored the main features of Delta Lake to build data lakes with fast performance and governance in mind, you'll advance to implementing the lambda architecture using Delta Lake. Packed with practical examples and code snippets, this book takes you through real-world examples based on production scenarios faced by the author in his 10 years of experience working with big data. Finally, you'll cover data lake deployment strategies that play an important role in provisioning the cloud resources and deploying the data pipelines in a repeatable and continuous way.
By the end of this data engineering book, you'll know how to effectively deal with ever-changing data and create scalable data pipelines to streamline data science, ML, and artificial intelligence (AI) tasks.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 364

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way

Manoj Kukreja

BIRMINGHAM—MUMBAI

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Parikh

Publishing Product Manager: Sunith Shetty

Senior Editor: Davud Sugarman

Content Development Editor: Joseph Sunil

Technical Editor: Sonam Pandey

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Prashant Ghare

First published: October 2021

Production reference: 1170921

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-774-3

www.packt.com

In memory of my two fathers, my mother, and my father-in-law, whom I lost recently. Hope you keep showering your blessings from far and beyond. To my darling wife Roopika, a simple thank you is not good enough. To my daughter Saesha who inspired me to write this book, and to my other daughter Shagun who is living my other dream.

I love you all.

– Manoj Kukreja

Foreword

I don't think it is required anymore to explain the importance data plays in the modern world. Many of the online services we use today are powered by data systems of varying complexity. Machine learning emerged as a powerful tool for certain tasks in big part thanks to modern data platforms that can collect, process, and distribute large volumes of data. These data platforms also provide a strong foundation for data analytics and help drive real world decisions in healthcare, business, and education.

One of the challenges that I have faced when I have started working in data space, which I'm sure is familiar to many of you, is the lack of good books and other educational resources on the topic. There is a lot of documentation on specific tools that you can use, but knowledge of what good architectures look like and which patterns work in which specific cases only resides with a few experts who have done this type of work before. This is especially true when it comes to emerging cloud technologies.

Manoj is one of such experts since he has been implementing data platforms for companies in different industries for the last 10 years. In this book, he will share his knowledge and experience on architecting and implementing modern data platforms using Microsoft Azure, one of the leading cloud providers for data workloads. You will learn practical skills on implementing data lake architectures, implementing data pipelines, and running them in a production environment. Books like this are useful because they give you a download of the knowledge and experience that Manoj has accumulated over the years and you know that recipes and advice he gives will work in a real-life implementation because he has implemented similar systems many times before.

Getting practical data engineering skills is a great investment in your career and knowing patterns that you can rely on will spare you lots of trial and error and will help you get results much faster. Hope you enjoy it!

Danil Zburivsky,

Senior Consultant at Sourced Group

Data Engineering and Cloud Infrastructure specialist

Ottawa, Ontario, Canada

Contributors

About the author

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.

About the reviewers

Rajeesh Punathil is currently a Data Architect with a prominent energy company based in Canada. He leads the digital team in building cloud data analytics solutions aimed to help businesses arrive at right questions and answers.In prior roles he has designed data analytics pipelines in AWS and Azure stack for major companies like Sunlife, CIBC , City National bank, Hudsons Bay etc. He holds a B Tech in Electronics and Telecom Engineering from University of Calicut and has over 15 Years of IT experience. He consider himself as a forever student, and his constant striving for knowledge has made him an expert and leader in the field of data analytics and big data. He has completed AWS Big Data Analytics certification to his credit.

Dang Trung Anh is a principal engineer and software architect with over 15 years of experience in delivering big software systems for various customers across many areas. He has spent most of his career writing many software systems that support millions of users. He has had various roles from an ordinary full-stack developer, through software architect, to director of engineering. He is the author of many viral articles related to software engineering on Medium.

Table of Contents

Preface

Section 1: Modern Data Engineering and Tools

Chapter 1: The Story of Data Engineering and Analytics

The journey of data

Exploring the evolution of data analytics

Core capabilities of storage and compute resources

Availability of varying datasets

The paradigm shift to distributed computing

Adoption of cloud computing

Data storytelling

The monetary power of data

Organic growth

Summary

Chapter 2: Discovering Storage and Compute Data Lakes

Introducing data lakes

Exploring the benefits of data lakes

Adhering to compliance frameworks

Segregating storage and compute in a data lake

Discovering data lake architectures

The CAP theorem

Summary

Chapter 3: Data Engineering on Microsoft Azure

Introducing data engineering in Azure

Performing data engineering in Microsoft Azure

Self-managed data engineering services (IaaS)

Azure-managed data engineering services (PaaS)

Data processing services in Microsoft Azure

Data engineering as a service (SaaS)

Data cataloging and sharing services in Microsoft Azure

Opening a free account with Microsoft Azure

Summary

Section 2: Data Pipelines and Stages of Data Engineering

Chapter 4: Understanding Data Pipelines

Exploring data pipelines

Components of a data pipeline

Process of creating a data pipeline

Discovery phase

Design phase

Development phase

Deployment phase

Running a data pipeline

Sample lakehouse project

Summary

Chapter 5: Data Collection Stage – The Bronze Layer

Architecting the Electroniz data lake

The cloud architecture

The pipeline design

The deployment strategy

Understanding the bronze layer

Configuring data sources

Data preparation

Configuring data destinations

Building the ingestion pipelines

Building a batch ingestion pipeline

Testing the ingestion pipelines

Building the streaming ingestion pipeline

Summary

Chapter 6: Understanding Delta Lake

Understanding how Delta Lake enables the lakehouse

Understanding Delta Lake

Preparing Azure resources

Creating a Delta Lake table

Changing data in an existing Delta Lake table

Performing time travel

Performing upserts of data

Understanding isolation levels

Understanding concurrency control

Cleaning up Azure resources

Summary

Chapter 7: Data Curation Stage – The Silver Layer

The need for curating raw data

Unstandardized data

Invalid data

Non-uniform data

Inconsistent data

Duplicate data

Insecure data

The process of curating raw data

Inspecting data

Getting approval

Cleaning data

Verifying data

Developing a data curation pipeline

Preparing Azure resources

Creating the pipeline for the silver layer

Running the pipeline for the silver layer

Verifying curated data in the silver layer

Verifying unstandardized data

Verifying invalid data

Verifying non-uniform data

Verifying duplicate data

Verifying insecure data

Cleaning up Azure resources

Summary

Chapter 8: Data Aggregation Stage – The Gold Layer

The need to aggregate data

The process of aggregating data

Developing a data aggregation pipeline

Preparing the Azure resources

Creating the pipeline for the gold layer

Running the aggregation pipeline

Understanding data consumption

Accessing silver layer data

Accessing gold layer data

Verifying aggregated data in the gold layer

Meeting customer expectations

Summary

Section 3: Data Engineering Challenges and Effective Deployment Strategies

Chapter 9: Deploying and Monitoring Pipelines in Production

The deployment strategy

Developing the master pipeline

Testing the master pipeline

Scheduling the master pipeline

Monitoring pipelines

Adding durability features

Dealing with failure conditions

Adding alerting features

Summary

Chapter 10: Solving Data Engineering Challenges

Schema evolution

Sharing data

Preparing the Azure resources

Creating a data share

Data governance

Preparing the Azure resources

Creating a data catalog

Cleaning up Azure resources

Summary

Chapter 11: Infrastructure Provisioning

Infrastructure as code

Deploying infrastructure using Azure Resource Manager

Creating ARM templates

Deploying ARM templates using the Azure portal

Deploying ARM templates using the Azure CLI

Deploying ARM templates containing secrets

Deploying multiple environments using IaC

Cleaning up Azure resources

Summary

Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

Understanding CI/CD

Traditional software delivery cycle

Modern software delivery cycle

Designing CI/CD pipelines

Developing CI/CD pipelines

Creating an Azure DevOps organization

Creating the Electroniz infrastructure CI/CD pipeline

Creating the Electroniz code CI/CD pipeline

Creating the CI/CD life cycle

Summary

Other Books You May Enjoy

Section 1: Modern Data Engineering and Tools

This section introduces you to the world of data engineering. It gives you an understanding of data engineering concepts and architectures. Furthermore, it educates you on how to effectively utilize the Microsoft Azure cloud services for data engineering.

This section contains the following chapters:

Chapter 1, The Story of Data Engineering and AnalyticsChapter 2, Discovering Storage and Compute Data Lake ArchitecturesChapter 3, Data Engineering on Microsoft Azure

Chapter 1: The Story of Data Engineering and Analytics

Every byte of data has a story to tell. The real question is whether the story is being narrated accurately, securely, and efficiently. In the modern world, data makes a journey of its own—from the point it gets created to the point a user consumes it for their analytical requirements.

But what makes the journey of data today so special and different compared to before? After all, Extract, Transform, Load (ETL) is not something that recently got invented. In fact, I remember collecting and transforming data since the time I joined the world of information technology (IT) just over 25 years ago.

In this chapter, we will discuss some reasons why an effective data engineering practice has a profound impact on data analytics.

In this chapter, we will cover the following topics:

The journey of dataExploring the evolution of data analyticsThe monetary power of data

Remember:

the road to effective data analytics leads through effective data engineering.

The journey of data

Data engineering is the vehicle that makes the journey of data possible, secure, durable, and timely. A data engineer is the driver of this vehicle who safely maneuvers the vehicle around various roadblocks along the way without compromising the safety of its passengers. Waiting at the end of the road are data analysts, data scientists, and business intelligence (BI) engineers who are eager to receive this data and start narrating the story of data. You can see this reflected in the following screenshot:

Figure 1.1 – Data's journey to effective data analysis

Traditionally, the journey of data revolved around the typical ETL process. Unfortunately, the traditional ETL process is simply not enough in the modern era anymore. Due to the immense human dependency on data, there is a greater need than ever to streamline the journey of data by using cutting-edge architectures, frameworks, and tools.

You may also be wondering why the journey of data is even required. Gone are the days where datasets were limited, computing power was scarce, and the scope of data analytics was very limited. We now live in a fast-paced world where decision-making needs to be done at lightning speeds using data that is changing by the second. Let's look at how the evolution of data analytics has impacted data engineering.

Exploring the evolution of data analytics

Data analytics has evolved over time, enabling us to do bigger and better. For many years, the focus of data analytics was limited to descriptive analysis, where the focus was to gain useful business insights from data, in the form of a report. This type of analysis was useful to answer question such as "What happened?". A hypothetical scenario would be that the sales of a company sharply declined within the last quarter.

Very quickly, everyone started to realize that there were several other indicators available for finding out what happened, but it was the why it happened that everyone was after. The core analytics now shifted toward diagnostic analysis, where the focus is to identify anomalies in data to ascertain the reasons for certain outcomes. An example scenario would be that the sales of a company sharply declined in the last quarter because there was a serious drop in inventory levels, arising due to floods in the manufacturing units of the suppliers. This form of analysis further enhances the decision support mechanisms for users, as illustrated in the following diagram:

Figure 1.2 – The evolution of data analytics

Important note

Both descriptive analysis and diagnostic analysis try to impact the decision-making process using factual data only.

Since the advent of time, it has always been a core human desire to look beyond the present and try to forecast the future. If we can predict future outcomes, we can surely make a lot of better decisions, and so the era of predictive analysis dawned, where the focus revolves around "What will happen in the future?". Predictive analysis can be performed using machine learning (ML) algorithms—let the machine learn from existing and future data in a repeated fashion so that it can identify a pattern that enables it to predict future trends accurately.

Now that we are well set up to forecast future outcomes, we must use and optimize the outcomes of this predictive analysis. Based on the results of predictive analysis, the aim of prescriptive analysis is to provide a set of prescribed actions that can help meet business goals.

Important note

Unlike descriptive and diagnostic analysis, predictive and prescriptive analysis try to impact the decision-making process, using both factual and statistical data.

But how can the dreams of modern-day analysis be effectively realized? After all, data analysts and data scientists are not adequately skilled to collect, clean, and transform the vast amount of ever-increasing and changing datasets.

The data engineering practice is commonly referred to as the primary support for modern-day data analytics' needs.

The following are some major reasons as to why a strong data engineering practice is becoming an absolutely unignorable necessity for today's businesses:

Core capabilities of compute and storage resourcesAvailability of varying datasetsThe paradigm shift to distributed computingAdoption of cloud computingData storytelling

We'll explore each of these in the following subsections.

Important note

Having a strong data engineering practice ensures the needs of modern analytics are met in terms of durability, performance, and scalability.

Core capabilities of storage and compute resources

25 years ago, I had an opportunity to buy a Sun Solaris server—128 megabytes (MB) random-access memory (RAM), 2 gigabytes (GB) storage—for close to $ 25K. The intended use of the server was to run a client/server application over an Oracle database in production. Given the high price of storage and compute resources, I had to enforce strict countermeasures to appropriately balance the demands of online transaction processing (OLTP) and online analytical processing (OLAP) of my users. One such limitation was implementing strict timings for when these programs could be run; otherwise, they ended up using all available power and slowing down everyone else.

Today, you can buy a server with 64 GB RAM and several terabytes (TB) of storage at one-fifth the price. The extra power available can do wonders for us. Multiple storage and compute units can now be procured just for data analytics workloads. The extra power available enables users to run their workloads whenever they like, however they like. In fact, it is very common these days to run analytical workloads on a continuous basis using data streams, also known as stream processing.

The installation, management, and monitoring of multiple compute and storage units requires a well-designed data pipeline, which is often achieved through a data engineering practice.

Availability of varying datasets

A few years ago, the scope of data analytics was extremely limited. Performing data analytics simply meant reading data from databases and/or files, denormalizing the joins, and making it available for descriptive analysis. The structure of data was largely known and rarely varied over time.

We live in a different world now; not only do we produce more data, but the variety of data has increased over time. In addition to collecting the usual data from databases and files, it is common these days to collect data from social networking, website visits, infrastructure logs' media, and so on, as depicted in the following screenshot:

Figure 1.3 – Variety of data increases the accuracy of data analytics

Important note

More variety of data means that data analysts have multiple dimensions to perform descriptive, diagnostic, predictive, or prescriptive analysis.

Naturally, the varying degrees of datasets injects a level of complexity into the data collection and processing process. On the flip side, it hugely impacts the accuracy of the decision-making process as well as the prediction of future trends. A well-designed data engineering practice can easily deal with the given complexity.

The paradigm shift to distributed computing

The traditional data processing approach used over the last few years was largely singular in nature. To process data, you had to create a program that collected all required data for processing—typically from a database—followed by processing it in a single thread. This type of processing is also referred to as data-to-code processing. Unfortunately, there are several drawbacks to this approach, as outlined here:

Since vast amounts of data travel to the code for processing, at times this causes heavy network congestion. Since a network is a shared resource, users who are currently active may start to complain about network slowness. Being a single-threaded operation means the execution time is directly proportional to the data. Therefore, the growth of data typically means the process will take longer to finish. This could end up significantly impacting and/or delaying the decision-making process, therefore rendering the data analytics useless at times.Something as minor as a network glitch or machine failure requires the entire program cycle to be restarted, as illustrated in the following diagram:

Figure 1.4 – Rise of distributed computing

The distributed processing approach, which I refer to as the paradigm shift, largely takes care of the previously stated problems. Instead of taking the traditional data-to-code route, the paradigm is reversed to code-to-data.

In a distributed processing approach, several resources collectively work as part of a cluster, all working toward a common goal. In simple terms, this approach can be compared to a team model where every team member takes on a portion of the load and executes it in parallel until completion. If a team member falls sick and is unable to complete their share of the workload, some other member automatically gets assigned their portion of the load.

Distributed processing has several advantages over the traditional processing approach, outlined as follows:

The code-to-data paradigm shift ensures the network does not get clogged. The entire idea of distributed processing heavily relies on the assumption that data is stored in a distributed fashion across several machines, also referred to as nodes. At the time of processing, only the code portion (usually a much smaller footprint as compared to actual data) is sent over to each node that stores the portion of the data being processed. This ensures that the processing happens locally on the node where the data is stored. Since several nodes are collectively participating in data processing, the overall completion time is drastically reduced.Program execution is immune to network and node failures. If a node failure is encountered, then a portion of the work is assigned to another available node in the cluster.

Important note

Distributed processing is implemented using well-known frameworks such as Hadoop, Spark, and Flink. Modern massively parallel processing (MPP)-style data warehouses such as Amazon Redshift, Azure Synapse, Google BigQuery, and Snowflake also implement a similar concept.

Since distributed processing is a multi-machine technology, it requires sophisticated design, installation, and execution processes. That makes it a compelling reason to establish good data engineering practices within your organization.

Adoption of cloud computing

The vast adoption of cloud computing allows organizations to abstract the complexities of managing their own data centers. Migrating their resources to the cloud offers faster deployments, greater flexibility, and access to a pricing model that, if used correctly, can result in major cost savings.

In the previous section, we talked about distributed processing implemented as a cluster of multiple machines working as a group. For this reason, deploying a distributed processing cluster is expensive.

In the pre-cloud era of distributed processing, clusters were created using hardware deployed inside on-premises data centers. Very careful planning was required before attempting to deploy a cluster (otherwise, the outcomes were less than desired). You might argue why such a level of planning is essential. Let me address this:

Since the hardware needs to be deployed in a data center, you need to physically procure it. The real question is how many units you would procure, and that is precisely what makes this process so complex.Order more units than required and you'll end up with unused resources, wasting money. Order fewer units than required and you will have insufficient resources, job failures, and degraded performance.

To order the right number of machines, you start the planning process by performing benchmarking of the required data processing jobs.

The results from the benchmarking process are a good indicator of how many machines will be able to take on the load to finish the processing in the desired time. You now need to start the procurement process from the hardware vendors. Keeping in mind the cycle of procurement and shipping process, this could take weeks to months to complete.Once the hardware arrives at your door, you need to have a team of administrators ready who can hook up servers, install the operating system, configure networking and storage, and finally install the distributed processing cluster software—this requires a lot of steps and a lot of planning.

I hope you may now fully agree that the careful planning I spoke about earlier was perhaps an understatement. The complexities of on-premises deployments do not end after the initial installation of servers is completed. You are still on the hook for regular software maintenance, hardware failures, upgrades, growth, warranties, and more.

This is precisely the reason why the idea of cloud adoption is being very well received. Having resources on the cloud shields an organization from many operational issues. Additionally, the cloud provides the flexibility of automating deployments, scaling on demand, load-balancing resources, and security.

Important note

Many aspects of the cloud particularly scale on demand, and the ability to offer low pricing for unused resources is a game-changer for many organizations. If used correctly, these features may end up saving a significant amount of cost. Having a well-designed cloud infrastructure can work miracles for an organization's data engineering and data analytics practice.

Data storytelling

I started this chapter by stating Every byte of data has a story to tell. Data storytelling is a new alternative for non-technical people to simplify the decision-making process using narrated stories of data. Traditionally, decision makers have heavily relied on visualizations such as bar charts, pie charts, dashboarding, and so on to gain useful business insights. These visualizations are typically created using the end results of data analytics. The problem is that not everyone views and understands data in the same way. Let me give you an example to illustrate this further.

Here is a BI engineer sharing stock information for the last quarter with senior management:

Figure 1.5 – Visualizing data using simple graphics

And here is the same information being supplied in the form of data storytelling:

Figure 1.6 – Storytelling approach to data visualization

Important note

Visualizations are effective in communicating why something happened, but the storytelling narrative supports the reasons for it to happen.

Data storytelling tries to communicate the analytic insights to a regular person by providing them with a narration of data in their natural language. This does not mean that data storytelling is only a narrative. It is a combination of narrative data, associated data, and visualizations. With all these combined, an interesting story emerges—a story that everyone can understand.

As data-driven decision-making continues to grow, data storytelling is quickly becoming the standard for communicating key business insights to key stakeholders.

There's another benefit to acquiring and understanding data: financial. Let's look at the monetary power of data next.

The monetary power of data

Modern-day organizations are immensely focused on revenue acceleration. Traditionally, organizations have primarily focused on increasing sales as a method of revenue acceleration… but is there a better method?

Modern-day organizations that are at the forefront of technology have made this possible using revenue diversification. Here are some of the methods used by organizations today, all made possible by the power of data.

Organic growth

During my initial years in data engineering, I was a part of several projects in which the focus of the project was beyond the usual. On several of these projects, the goal was to increase revenue through traditional methods such as increasing sales, streamlining inventory, targeted advertising, and so on. This meant collecting data from various sources, followed by employing the good old descriptive, diagnostic, predictive, or prescriptive analytics techniques.

But what can be done when the limits of sales and marketing have been exhausted? Where does the revenue growth come from?

Some forward-thinking organizations realized that increasing sales is not the only method for revenue diversification. They started to realize that the real wealth of data that has accumulated over several years is largely untapped. Instead of solely focusing their efforts entirely on the growth of sales, why not tap into the power of data and find innovative methods to grow organically?

This innovative thinking led to the revenue diversification method known as organic growth. Subsequently, organizations started to use the power of data to their advantage in several ways. Let's look at several of them.

Customer retention

Data scientists can create prediction models using existing data to predict if certain customers are in danger of terminating their services due to complaints. Based on this list, customer service can run targeted campaigns to retain these customers. By retaining a loyal customer, not only do you make the customer happy, but you also protect your bottom line.

Fraud prevention

Banks and other institutions are now using data analytics to tackle financial fraud. Based on key financial metrics, they have built prediction models that can detect and prevent fraudulent transactions before they happen. These models are integrated within case management systems used for issuing credit cards, mortgages, or loan applications.

Using the same technology, credit card clearing houses continuously monitor live financial traffic and are able to flag and prevent fraudulent transactions before they happen. Detecting and preventing fraud goes a long way in preventing long-term losses.

Problem detection

I was part of an internet of things (IoT) project where a company with several manufacturing plants in North America was collecting metrics from electronic sensors fitted on thousands of machinery parts. The sensor metrics from all manufacturing plants were streamed to a common location for further analysis, as illustrated in the following diagram:

Figure 1.7 – IoT is contributing to a major growth of data

These metrics are helpful in pinpointing whether a certain consumable component such as rubber belts have reached or are nearing their end-of-life (EOL) cycle. Collecting these metrics is helpful to a company in several ways, including the following:

The data indicates the machinery where the component has reached its EOL and needs to be replaced. Having this data on hand enables a company to schedule preventative maintenance on a machine before a component breaks (causing downtime and delays).The data from machinery where the component is nearing its EOL is important for inventory control of standby components. Before this system is in place, a company must procure inventory based on guesstimates. Buy too few and you may experience delays; buy too many, you waste money. At any given time, a data pipeline is helpful in predicting the inventory of standby components with greater accuracy.

The combined power of IoT and data analytics is reshaping how companies can make timely and intelligent decisions that prevent downtime, reduce delays, and streamline costs.

Data monetization

Innovative minds never stop or give up. They continuously look for innovative methods to deal with their challenges, such as revenue diversification. Organizations quickly realized that if the correct use of their data was so useful to themselves, then the same data could be useful to others as well.

As per Wikipedia, data monetization is the "act of generating measurable economic benefits from available data sources".

The following diagram depicts data monetization using application programming interfaces (APIs):

Figure 1.8 – Monetizing data using APIs is the latest trend

In the latest trend, organizations are using the power of data in a fashion that is not only beneficial to themselves but also profitable to others. In a recent project dealing with the health industry, a company created an innovative product to perform medical coding using optical character recognition (OCR) and natural language processing (NLP).

Before the project started, this company made sure that we understood the real reason behind the project—data collected would not only be used internally but would be distributed (for a fee) to others as well. Knowing the requirements beforehand helped us design an event-driven API frontend architecture for internal and external data distribution. At the backend, we created a complex data engineering pipeline using innovative technologies such as Spark, Kubernetes, Docker, and microservices. This is how the pipeline was designed:

Several microservices were designed on a self-serve model triggered by requests coming in from internal users as well as from the outside (public). For external distribution, the system was exposed to users with valid paid subscriptions only. Once the subscription was in place, several frontend APIs were exposed that enabled them to use the services on a per-request model.Each microservice was able to interface with a backend analytics function that ended up performing descriptive and predictive analysis and supplying back the results.

The power of data cannot be underestimated, but the monetary power of data cannot be realized until an organization has built a solid foundation that can deliver the right data at the right time. Data engineering plays an extremely vital role in realizing this objective.

Summary

In this chapter, we went through several scenarios that highlighted a couple of important points.

Firstly, the importance of data-driven analytics is the latest trend that will continue to grow in the future. Data-driven analytics gives decision makers the power to make key decisions but also to back these decisions up with valid reasons.

Secondly, data engineering is the backbone of all data analytics operations. None of the magic in data analytics could be performed without a well-designed, secure, scalable, highly available, and performance-tuned data repository—a data lake.

In the next few chapters, we will be talking about data lakes in depth. We will start by highlighting the building blocks of effective data—storage and compute. We will also look at some well-known architecture patterns that can help you create an effective data lake—one that effectively handles analytical requirements for varying use cases.

Chapter 2: Discovering Storage and Compute Data Lakes

In the previous chapter, we discussed the immense power that data possesses, but with immense power comes increased responsibility. In the past, the key focus of organizations has been to detect trends with data, with the goal of revenue acceleration. Very commonly, however, they have paid less attention to vulnerabilities caused by inconsistent data management and delivery.

In this chapter, we will discuss some ways a data lake can effectively deal with the ever-growing demands of the analytical world.

In this chapter, we will cover the following topics:

Introducing data lakesSegregating storage and compute in a data lakeDiscovering data lake architectures

Introducing data lakes

Over the last few years, the markers for effective data engineering and data analytics have shifted. Up to now, organizational data has been dispersed over several internal systems (silos), each system performing analytics over its own dataset.

Additionally, it has been difficult to interface with external datasets for extending the spectrum of analytic workloads. As a result, it has been difficult for these organizations to get a unified view of their data and gain global insights.

In a world where organizations are seeking revenue diversification by fine-tuning existing processes and generating organic growth, a globally unified repository of data has become a core necessity. Data lakes solve this need by providing a unified view of data into the hands of users who can use this data to devise innovative techniques for the betterment of mankind.

The following diagram outlines the characteristics of a data lake:

Figure 2.1 – Characteristics of a data lake

You will find several varying definitions of a data lake on the internet. Instead of defining what a data lake is, let's focus on the benefits of having a data lake for a modern-day organization.

We will also see how the storage and computations process in a data lake differs from traditional data storage solutions such as databases and data warehouses.

Important note

In the modern world, the typical extract, transform, load (ETL) process (collecting, transforming, and analyzing) is simply not enough. There are other aspects surrounding the ETL process—such as security, orchestration, and delivery—that need to be equally accounted for.

Exploring the benefits of data lakes

In addition to just being a