Cloud Scale Analytics with Azure Data Services - Patrik Borosch - E-Book

Cloud Scale Analytics with Azure Data Services E-Book

Patrik Borosch

0,0
44,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Azure Data Lake, the modern data warehouse architecture, and related data services on Azure enable organizations to build their own customized analytical platform to fit any analytical requirements in terms of volume, speed, and quality.
This book is your guide to learning all the features and capabilities of Azure data services for storing, processing, and analyzing data (structured, unstructured, and semi-structured) of any size. You will explore key techniques for ingesting and storing data and perform batch, streaming, and interactive analytics. The book also shows you how to overcome various challenges and complexities relating to productivity and scaling. Next, you will be able to develop and run massive data workloads to perform different actions. Using a cloud-based big data-modern data warehouse-analytics setup, you will also be able to build secure, scalable data estates for enterprises. Finally, you will not only learn how to develop a data warehouse but also understand how to create enterprise-grade security and auditing big data programs.
By the end of this Azure book, you will have learned how to develop a powerful and efficient analytical platform to meet enterprise needs.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 553

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Cloud Scale Analytics with Azure Data Services

Build modern data warehouses on Microsoft Azure

Patrik Borosch

BIRMINGHAM—MUMBAI

Cloud Scale Analytics with Azure Data Services

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Parikh

Publishing Product Manager: Reshma Raman

Senior Editor: David Sugarman

Content Development Editor: Joseph Sunil

Technical Editor: Manikandan Kurup

Copy Editor: Safis Editing

Project Coordinator: Aparna Nair

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Vijay Kamble

First published: July 2021

Production reference: 1250621

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-293-6

www.packt.com

A big thank you to Packt Publishing: Reshma Raman, Gebin George, David Sugarman, Aishwarya Mohan and, of course, Joseph Sunil. It was a ride, and your support was vital to finish this book. I am very proud, humbled, and thankful that you approached me and gave me the chance to write this book.

Special thanks go to my manager Zoran Draganic, who encouraged me to take on this challenge and supported me throughout the writing process.

I want to say another huge thank you to my CSA colleague Meinrad Weiss. Your technical expertise, your honest (and always to the point) feedback, and our reviews helped me to learn even more and to improve the quality of the book.

A fourth big thank you to Liviana Zürcher, another CSA alongside Meinrad and myself. Your reviews and your support, especially when I started the writing, were more than important, as you gave me faith and kept me going during the critical starting phase.

Finally, I need to thank my wife, Simone, from the deepest bottom of my heart. She was with me in this challenge and stayed patient during all those night sessions, and all the times I told her, "I need to finish the next chapter!" Without your support I wouldn't have been able to research, experiment, and finish this book.

Contributors

About the author

Patrik Borosch is a Cloud Solution Architect for Data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international Data Warehouse, Data Integration and Big Data projects. There, he has built and extended his experience in all facets from requirement engineering over data modelling and ETL all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of Azure Cloud.

About the reviewers

Pradeep Menon is a seasoned data analytics professional with more than 18 years of experience in data and AI. Currently, Pradeep works as a data and AI strategist with Microsoft. In this role, he is responsible for helping Microsoft's strategic customers across Asia to be more data-driven by using cloud, big data, and AI technologies. He is also a distinguished speaker and blogger and has given numerous keynotes on cloud technologies, data, and AI. He has previously worked at Microsoft, Alibaba Group, and IBM.

Liviana Zürcher started her career in Romania as Technical Sales Consultant for Business Intelligence and Data Warehouse at Oracle. She has been working a few years as a Big Data Warehouse and Business Intelligence consultant and trainer, having projects all over the world. Finally, she has joined Microsoft Switzerland team as Cloud Solution Architect for Data and Artificial Intelligence at Microsoft, where she is currently working since 2018.

First of all, I would like to thank my parents that have always helped me and encouraged me to go out from my comfort zone. Special thank you to my wonderful husband and to our two amazing children for their support. I thank Patrik for this opportunity, I was flattered by your request.

Meinrad Weiss works as a Senior Cloud Solution Architect in the Data and AI Team at Microsoft Switzerland. He is a very experienced Database expert and comes with a long and successful track record in Data, BI and analytical projects. Meinrad’s expertise spans from RDMS on-prem and in the cloud up to the most complex analytical architectures with the Azure Data Services and the field of IoT at Microsoft. He joined Microsoft in 2017 and became a vital and reliable pillar of the Data and AI architects team.

Table of Contents

Preface

Section 1: Data Warehousing and Considerations Regarding Cloud Computing

Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses

Distinguishing between Data Warehouses and Data Lakes

Understanding Data Warehouse patterns

Investigating ETL/ELT

Understanding Data Warehouse layers

Implementing reporting and dashboarding

Loading bigger amounts of data

Starting with Data Lakes

Understanding the Data Lake ecosystem

Comparing Data Lake zones

Discovering caveats

Understanding the opportunities of modern cloud computing

Understanding Infrastructure-as-a-Service

Understanding Platform-as-a-Service

Understanding Software-as-a-Service

Examining the possibilities of virtual machines

Understanding Serverless Functions

Looking at the importance of containers

Exploring the advantages of scalable environments

Implementing elastic storage and compute

Exploring the benefits of AI and ML

Understanding ML challenges

Sorting ML into the Modern Data Warehouse

Understanding responsible ML/AI

Answering the question

Summary

Chapter 2: Connecting Requirements and Technology

Formulating your requirements

Asking in the right direction

Understanding basic architecture patterns

Examining the scalable storage component

Looking at data integration

Sorting in compute

Adding a presentation layer

Planning for dashboard/reporting

Adding APIs/API management

Relying on SSO/MFA/networking

Not forgetting DevOps and CI/CD

Finding the right Azure tool for the right purpose

Understanding Industry Data Models

Thinking about different sizes

Planning for S size

Planning for M size

Planning for L size

Understanding the supporting services

Requiring data governance

Establishing security

Establishing DevOps and CI/CD

Summary

Questions

Section 2: The Storage Layer

Chapter 3: Understanding the Data Lake Storage Layer

Technical requirements

Setting up your Cloud Big Data Storage

Provisioning a standard storage account instead

Creating an Azure Data Lake Gen2 storage account

Organizing your data lake

Talking about zones in your data lake

Creating structures in your data lake

Planning the leaf level

Understanding data life cycles

Investigating storage tiers

Planning for criticality

Setting up confidentiality

Using filetypes

Implementing a data model in your Data Lake

Understanding interconnectivity between your data lake and the presentation layer

Examining key implementation and usage

Monitoring your storage account

Creating alerts for Azure storage accounts

Talking about backups

Configuring delete locks for the storage service

Backing up your data

Implementing access control in your Data Lake

Understanding RBAC

Understanding ACLs

Understanding the evaluation sequence of RBAC and ACLs

Understanding Shared Key authorization

Understanding Shared Access Signature authorization

Setting the networking options

Understanding storage account firewalls

Adding Azure virtual networks

Using private endpoints with Data Lake Storage

Discovering additional knowledge

Summary

Further reading

Chapter 4: Understanding Synapse SQL Pools and SQL Options

Uncovering MPP in the cloud – the power of 60

Understanding the control node

Understanding compute nodes

Understanding the data movement service

Understanding distributions

Provisioning a Synapse dedicated SQL pool

Connecting to your database for the first time

Distributing, replicating, and round-robin

Understanding CCI

Talking about partitioning

Implementing workload management

Understanding concurrency and memory settings

Using resource classes

Implementing workload classification

Adding workload importance

Understanding workload isolation

Scaling the database

Using PowerShell to handle scaling and start/stop

Using T-SQL to scale your database

Loading data

Using the COPY statement

Maintaining statistics

Understanding other SQL options in Azure

Summary

Further reading

Additional links

Static resource classes and concurrency slots

Dynamic resource classes, memory allocation, and concurrency slots

Effective values for REQUEST_MIN_RESOURCE_GRANT_PERCENT

Section 3: Cloud-Scale Data Integration and Data Transformation

Chapter 5: Integrating Data into Your Modern Data Warehouse

Technical requirements

Setting up Azure Data Factory

Creating the Data Factory service

Examining the authoring environment

Understanding the Author section

Understanding the Monitor section

Understanding the Manage section

Understanding the object types

Using wizards

Working with parameters

Using variables

Adding data transformation logic

Understanding mapping flows

Understanding wrangling flows

Understanding integration runtimes

Integrating with DevOps

Summary

Further reading

Chapter 6: Using Synapse Spark Pools

Technical requirements

Setting up a Synapse Spark pool

Bringing your Spark cluster live for the first time

Examining the Synapse Spark architecture

Understanding the Synapse Spark pool and its components

Running a Spark job

Examining Synapse Spark instances

Understanding Spark pools and Spark instances

Understanding resource usage

Programming with Synapse Spark pools

Understanding Synapse Spark notebooks

Running Spark applications

Benefiting of the Synapse metadata exchange

Using additional libraries with your Spark pool

Using public libraries

Adding your own packages

Handling security

Monitoring your Synapse Spark pools

Summary

Further reading

Chapter 7: Using Databricks Spark Clusters

Technical requirements

Provisioning Databricks

Examining the Databricks workspace

Understanding the Databricks components

Creating Databricks clusters

Managing clusters

Using Databricks notebooks

Using Databricks Spark jobs

Adding dependent libraries to a job

Creating Databricks tables

Understanding Databricks Delta Lake

Having a glance at Databricks SQL Analytics

Adding libraries

Adding dashboards

Setting up security

Examining access controls

Understanding secrets

Understanding networking

Monitoring Databricks

Summary

Further reading

Chapter 8: Streaming Data into Your MDWH

Technical requirements

Provisioning ASA

Implementing an ASA job

Integrating sources

Writing to sinks

Understanding ASA SQL

Understanding windowing

Using window functions in your SQL

Delivering to more than one output

Adding reference data to your query

Adding functions to your ASA job

Understanding streaming units

Resuming your job

Using Structured Streaming with Spark

Security in your streaming solution

Connecting to sources and sinks

Understanding ASA clusters

Monitoring your streaming solution

Using Azure Monitor

Summary

Further reading

Chapter 9: Integrating Azure Cognitive Services and Machine Learning

Technical requirements

Understanding Azure Cognitive Services

Examining available Cognitive Services

Getting in touch with Cognitive Services

Using Cognitive Services with your data

Understanding the Azure Text Analytics cognitive service

Implementing the call to your Text Analytics cognitive service with Spark

Examining Azure Machine Learning

Browsing the different Azure ML tools

Examining Azure Machine Learning Studio

Understanding the ML designer

Creating a linear regression model with the designer

Publishing your trained model for usage

Using Azure Machine Learning with your modern data warehouse

Connecting the services

Understanding further options to integrate Azure ML with your modern data warehouse

Summary

Further reading

Chapter 10: Loading the Presentation Layer

Technical requirements

Understanding the loading strategy with Synapse-dedicated SQL pools

Loading data into Synapse-dedicated SQL pools

Examining PolyBase

Loading data into a dedicated SQL pool using COPY

Adding data with Synapse pipelines/Data Factory

Using Synapse serverless SQL pools

Browsing data ad hoc

Using a serverless SQL pool to ELT

Building a virtual data warehouse layer with Synapse serverless SQL pools

Integrating data with Synapse Spark pools

Reading and loading data

Exchanging metadata between computes

Summary

Further reading

Section 4: Data Presentation, Dashboarding, and Distribution

Chapter 11: Developing and Maintaining the Presentation Layer

Developing with Synapse Studio

Integrating Synapse Studio with Azure DevOps

Understanding the development life cycle

Automating deployments

Understanding developer productivity with Synapse Studio

Using the Copy Data Wizard

Integrating Spark notebooks with Synapse pipelines

Analyzing data ad hoc with Azure Synapse Spark pools

Creating Spark tables

Enriching Spark tables

Enriching dedicated SQL pool tables

Creating new integration datasets

Starting serverless SQL analysis

Backing up and DR in Azure Synapse

Backing up data

Backing up dedicated SQL pools

Monitoring your MDWH

Understanding security in your MDWH

Implementing access control

Implementing networking

Summary

Further reading

Chapter 12: Distributing Data

Technical requirements

Building data marts with Power BI

Understanding the Power BI ecosystem

Understanding Power BI object types

Understanding Power BI offerings

Acquiring data

Optimizing the columnstore database in Power BI

Building business logic with Data Analysis Expressions

Visualizing data

Publishing insights

Creating data models with Azure Analysis Services

Developing AAS models

Distributing data using Azure Data Share

Summary

Further reading

Chapter 13: Introducing Industry Data Models

Understanding Common Data Model

Examining the basics of the SDK

Understanding solutions and the manifest file

Examining and leveraging predefined entities

Finding CDM definitions

Using the APIs of CDM

Introducing Dataverse

Discovering Azure Industry Data Workbench

Summary

Further reading

Chapter 14: Establishing Data Governance

Technical requirements

Discovering Azure Purview

Provisioning the service

Connecting to your data

Scanning data

Searching your catalog

Browsing assets

Examining assets

Classifying data

Creating a custom classification

Creating a custom classification rule

Using custom classifications

Integrating with Azure services

Integrating with Synapse

Integrating with Power BI

Integrating with Azure Data Factory

Using data lineage

Discovering Insights

Discovering more Purview

Summary

Further reading

Why subscribe?

Other Books You May Enjoy

Section 1: Data Warehousing and Considerations Regarding Cloud Computing

This section will examine the question of whether data warehouses are still required given the rise of the enterprise data lake and provides a brief overview of the trends and development on the market of data and AI. As cloud computing adds flexible and scalable services to AI, there are no more limits in terms of source formats and volumes that can be processed for AI requirements, and given that AI and machine learning are on everybody's mind at the moment, the book attempts to ask what all this entails and where we are heading. In addition, we'll take a technology-agnostic look at the components that make up a successful analytical system. From an agnostic viewpoint, we will try to find the right Azure services to build a modern data warehouse.

This section comprises the following chapters:

Chapter 1, Balancing the Benefits of Data Lakes over Data WarehousesChapter 2, Connecting Requirements and Technology

Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses

Is the Data Warehouse dead with the advent of Data Lakes? There is disagreement everywhere about the need for Data Warehousing in a modern data estate. With the rise of Data Lakes and Big Data technology, many people use other, newer technologies compared to databases for their analytical efforts. Establishing a data-driven company seems to be possible without all those narrow definitions and planned structures, the ETL/ELT, and all the indexing for performance. But when we examine the technology carefully, when we compare the requirements that are formulated in analytical projects, free of prejudice to the functionality that the chosen services or software packages can deliver, we often find gaps on both ends. This chapter discusses the capabilities of Data Warehousing and Data Lakes and introduces the concept of the Modern Data Warehouse.

With all the innovations that have been brought to us in the last few years, such as faster hardware, new technologies, and new dogmas such as the Data Lake, older concepts and methods are being questioned and challenged. In this chapter, I would like to explore the evolution of the analytical world and try to answer the question, is the Data Warehouse really obsolete?

We'll find out by covering the following topics:

Distinguishing between Data Warehouses and Data Lakes Understanding the opportunities of modern cloud computing Exploring the benefits of AI and ML Answering the question

Distinguishing between Data Warehouses and Data Lakes

There are several definitions of Data Warehousing on the internet. The narrower ones characterize a warehouse as the database and the model used in the database; the wider descriptions look at the term as a method, and a suitable collection of all organizational and technological components that make up a BI solution. They talk about everything from the Extract Transform Load tool (ETL tool) to the database, the model, and, of course, the reporting and dashboarding solution.

Understanding Data Warehouse patterns

When we look at the Data Warehousing method in general, at its heart, we find a database that offers a certain table structure. We almost always find two main types of artifacts in the database: Facts and Dimensions.

Facts provide all the measurable information that we want to analyze; for example, the quantities of products sold per customer, per region, per sales representative, and per time. Facts are normally quite narrow objects, but with a lot of rows stored.

In the Dimensions, we will find all the descriptive information that can be linked to the Facts for analysis. Every piece of information that a user puts on their report or dashboard to aggregate and group the fact data, filter it, and view it is collected in the Dimensions. All the data related to customer information, such as Product, Contract, Address, and so on, that might need to be analyzed and correlated is stored here. Typically, these objects are stored as tables in the database and are joined using their key columns. Dimensions are normally wide objects, sometimes with controlled redundancy, that look at the given modeling method.

Three main methods for modeling the Facts and Dimensions within a Data Warehouse database have crystalized over the years of its evolution:

Star-Join/Snowflake: This is probably the most famous method for Data Warehouse modeling. Fact tables are put in the center of the model, while Dimension tables are arranged around them, inheriting their Primary Key into the Fact table. In the Star-Join method, we find a lot of redundancy in the tables since as all the Dimension data, including all hierarchical information (such as Product Group -> Product SubCategory -> Product Category) regarding a certain artifact (Product, Customer, and so on), is stored in one table. In a Snowflake schema, hierarchies are spread in additional tables per hierarchy level and are linked over relationships with each other. This, when expressed in a graph, turns out to show a kind of snowflake pattern.Data Vault: This is a newer method that reflects the rising structural volatility of data sources, which offers higher flexibility and speed for developing. Entities that need to be analyzed are stored over Hubs, Satellites, and Links. Hubs simply reflect the presence of an entity by storing its ID and some audit information such as its data source, create times, and so on. Each hub can have one or more Satellite(s). These Satellites store all the descriptive information about the entity. If we need to change the system and add new information about an entity, we can add another Satellite to the model, reflecting just the new data. This will bring the benefit of non-destructive deployments to the productive system in the rollout. In the Data Vault, the Customer data will be stored in one Hub (the CustomerID and audit columns) and one or more Satellites (the rest of the customer information). The structure in the model is finally brought by Links. They are provided with all the Primary Keys of the Hubs and, again, some metadata. Additionally, Links can have Satellites of their own that describe the relationships of the Link content. Therefore, the connection between a Customer and the Products they bought will be reflected in a Link, where the Customer-Hub-Key and the Product-Hub-Key are stored together with audit columns. A Link Satellite can, for example, reflect some characteristics of the relationship, such as the amount bought, the date, or a discount. Finally, we can even add a Star-Join-View schema to abstract all the tables of the Data Vault and make it easier to understand for the users.3rd Normal form: This is the typical database modeling technique that is also used in (and first created for) so-called Online Transactional Processing (OLTP) databases. Artifacts are broken up into their atomic information and are spread over several tables so that no redundancy is stored in either table. The Product information of a system might be split in separate tables for the product's name, color, size, price information, and many more. To derive all the information of a 3rd Normal Form model, a lot of joining is necessary.

Investigating ETL/ELT

But how does data finally land in the Data Warehouse database? The process and the related tools are named Extract, Transform, Load (ETL), but depending on the sequence we implement it in, it may be referred to as ELT. You'll find several possible ways to implement data loading into a Data Warehouse. This can be done by implementing specialized ETL tools such as Azure Data Factory in the cloud, SQL Server Integration services (SSIS), Informatica, Talend, or IBM Data Stage, for example.

The biggest advantage of these tools is the availability of wide catalogues of ready-to-use source and target connectors. They can connect directly to a source, query the needed data, and even transform it while being transported to the target. In the end, data is loaded into the Data Warehouse database. Other advantages include its graphical interfaces, where complex logic can be implemented on a "point-and-click" basis, which is very easy to understand and maintain.

There are other options as well. Data is often pushed by source applications and a direct connection for the data extraction process is not wanted at all. Many times, files are provided that are stored somewhere near the Data Warehouse database and then need to be imported. Maybe there is no ETL tool available. Since nearly every database nowadays provides loader tools, the import can be accomplished using those tools in a scripted environment. Once the data has made its way to the database tables, the transformational steps are done using Stored Procedures that then move the data through different DWH stages or layers to the final Core Data Warehouse.

Understanding Data Warehouse layers

Talking about the Data Warehouse layers, we nearly always find several steps that are processed before the data is provided for reporting or dashboarding. Typically, there are at least the following stages or layers:

Landing or staging area: Data is imported into this layer in its rawest format. Nothing is changed on its way to this area and only audit information is added to track down all the loads.QS, transient, or cleansing area: This is where the work is done. You will only find a few projects where data is consistent. Values, sometimes even mandatory ones from the source, may be missing, content might be formatted incorrectly or even corrupted, and so on. In this zone, all the issues with the source data are taken care of, and the data is curated and cleansed.Core Data Warehouse: In this part of the database, the clean data is brought into the data model of choice. The model takes care of data historization, if required. This is also one possible point of security – or, even better, the central point of security. Data is secured so that all participating users can only see what they can see. The Core Data Warehouse is also the place where performance is boosted. By using all the available database techniques, such as indexes, partitioning, compression, and so on, the Data Warehouse will be tuned to suit the performance needs of the reporting users. The Core Data Warehouse is often also seen as the presentation layer of the system, where all kinds of reporting and dashboarding and self-service BI tools can connect and create their analysis.

Additionally, Data Warehouses have also brought in other types of layers. There might be sublayers for the cleansing area, for example, or the Data Marts, which are used to slice data semantically for the needs of certain user groups, or the Operational Data Store (ODS), which has different definitions, depending on who you ask. Sometimes, it is used as an intermediary data store that can be used for reporting. Other definitions speak of a transient zone that stores data for integration and further transport to the DWH. Sometimes, it is also used as a kind of archived data pool that the Data Warehouse can always be reproduced from. The definitions vary.

Implementing reporting and dashboarding

In terms of the reporting/dashboarding solution, we will also find several approaches to visualize the information that's provided by the Data Warehouse. There are the typical reporting tools, such as SQL Server Reporting Service or Crystal Reports, for example. These are page-oriented report generators that access the underlying database, get the data, and render it according to the template being used.

The more modern approach, however, has resulted in tools that can access a certain dataset, store it in an internal caching database, and allow interactive reporting and dashboarding based on that data. Self-Service BI tools such as Power BI, QLIK, and Tableau allow you to access the data, create your own visuals, and put them together in dashboards. Nowadays, these tools can even correlate data from different data sources and allow you to analyze data that might not have made it to the Data Warehouse database yet.

Loading bigger amounts of data

You can scale databases with newer, faster hardware, more memory, and faster disks. You can also go from Symmetric Multi-Processing (SMP) to Massively Parallel Processing (MPP) databases. However, the usual restrictions still apply: the more data we need to process, the longer it will take to do so. And there are also workloads that databases will not support, such as image processing.

Starting with Data Lakes

It's funny that we find a similar mix or maybe even call it confusion when we examine the term Data Lake. Many people refer to a Data Lake as a HadoopBig Data implementation that delivers one or more clusters of computers, where a distributed filesystem and computational software is installed. It can deal with a distributed Input and Output (I/O) on the one hand but can also do distributed and parallel computation. Such a system adds specialized services for all kinds of workloads, be it just SQL queries against the stored data, in-memory computation, streaming analytics – you name it. Interestingly, as we mentioned while discussing Data Warehouses discussion, the narrower definition of the Data Lake only refers to the storage solution, and less to the method and the collection of services. To be honest, I like the wider definition far better, as it describes the method more holistically.

With the Hadoop Distributed File System (HDFS) from the Apache Software Foundation, a system is delivered that is even capable of storing data distributed over cheap legacy hardware clusters. HDFS splits the files into blocks and will also replicate those blocks within the system. This not only delivers a failsafe environment, but it also generates the biggest advantage of HDFS: parallel access to the data blocks for the consuming compute components. This means that a Spark cluster, for example, can read several data blocks in parallel, and with increased speed, because it can access several blocks of a file with several parallel threads.

With the possibility to hand files to a filesystem and start analyzing them just where they are, we enter another dogma.

MapReduce, the programming paradigm, supports parallel processing many files in this distributed environment. Every participating node runs calculations over a certain subset of all the files that must be analyzed. In the end, the results are aggregated by a driver node and returned to the querying instance. Different to the Data Warehouse, we can start analyzing data on a Schema-On-Read basis. This means we decide on the structure of the files that are processed, right when they are being processed. The Data Warehouse, in comparison, is based on a Schema-On-Write strategy, where the tables must be defined before the data is loaded into the database.

In the Data Lake world, you, as the developer, do not need to plan structures weeks ahead, as you would in the Data Warehouse. Imagine you have stored a vast quantity of files, where each row in a file will reflect an event, and each row also consists of, let's say, 100 columns. Using Schema-On-Read, you can just decide to only count the rows of all the files. You won't need to cut the rows into the columns for this process. For other purposes, you might need to split the rows during the reading process into their own columns to access the detailed information. You might need to predict a machine failure based on the content of the columns. Using a database, you can also mimic this behavior, but you would need to create different objects and selectively store data for each intention.

Understanding the Data Lake ecosystem

HDFS is the center of a Data Lake system. But we can't get further with just a filesystem, even one as sophisticated as this one. The Hadoop open source world around HDFS has therefore developed many services to interact and process the content that's kept in distributed storage. There are services such as Hive, a SQL-like Data Warehousing service; Pig, for highly parallel analysis jobs; Spark as a large-scale, in-memory analytical engine; Storm as a distributed real-time streaming analysis engine; Mahout for machine learning; Hbase as a database; Oozie, for workflows; and many more.

Spark clusters, with their ability to run distributed, in-memory analytical processes on distributed datasets, have influenced the market heavily in the recent years. Scala, one of the main programming languages used with Spark, is a Java-based programming language. It can be used to write high-performance routines for ETL/ELT, data cleansing, and transformation, but also for machine learning and artificial intelligence on massive datasets. Python also made it into this technology and is one of the go-to languages here. R, as the statistical programming language of choice for many data scientists, was also a must for the Spark offering. Dealing with data has been important for so many years and guess what – SQL is still a language that was not possible to skip in such an environment.

These services and languages are making the open source Hadoop ecosystem a rich, complex engine for processing and analyzing excessive amounts of data. They are available all over the clusters and can interact with the HDFS to make use of the distributed files and compute.

Tip

Jumping into a big data challenge should, just like a Data Warehouse project, always be a well-prepared and intensively examined project. Just starting it from an "I want that too!" position would be the worst driver you can have, and unfortunately happens far too often. When you start with purpose and find the right tool for that purpose, at least that selection might create the right environment for a successful project. Maybe we will find some alternatives for you throughout this book.

Comparing Data Lake zones

Very similar to the Data Warehouse layers, as we discussed previously, a Data Lake is also structured in different layers. These layers form the stages that the data must go through on its way to be formed into information. The three major zones are pretty much comparable. We can form a landing zone in the Data Lake, where data is written to "as-is" in the exact format as it comes from the source. The transient zone then compares to the cleansing area of the Data Warehouse. Data is transformed, wrangled, and massaged to suit the requirements of the analyzing users. As no one should access either the landing zone or the transient zone, the curated zone is where you're heading, and what you, as the developer, will allow your users to access. Other (similar) concepts talk about Bronze, Silver, and Gold areas, or use other terms. Just like in the Data Warehouse concept, the Data Lakes can include other additional zones. For example, zones for master data, user sandboxes, or logging areas for your routines. We can have several user- or group-related folder structures in the curated zone where data is aggregated, just like the Data Marts of the DWH.

Discovering caveats

So, yes, you can be flexible and agile. But in a Data Lake with Schema-On-Read, you'll need to decide which attributes you want to analyze or are needed by your machine learning models for training. This will, after all, force you to structure your sources and will therefore force you into a certain project life cycle. You will go through user stories, requirements analysis, structuring and development, versioning, and delivering artifacts.

If you're only analyzing tabular-oriented data, maybe it's worth checking if your hardware, your ETL-tool, and your company databases can scale to the needed volume.

This question, along with the nature of the source's data, complexity, and format, should be taken into account when you're deciding on the technology to use. If you need to process sound files, images, PDFs, or similar, then this is no job for a database. SQL does not offer language elements for this (you can add programming extensions to your database). But here, we have a definitive marker for the possible usage of technologies other than databases.

Once you have analyzed the so-called unstructured data, you will structure the results back into a tabular-oriented result set as an array (a tabular representation) of data. This needs to be delivered somehow to the recipients of your analysis. But how is that done in a Data Lake?

The typical reporting tools still require tables or table-like source data to work with. Often, they will import the source data into their internal database to be able to quickly answer reporting and dashboarding operations by their users. Experiences with these visualization tools have shown that it is not very performant to directly report from a vast number of files from a Data Lake. Data always needs to be condensed into digestible chunks for these purposes and should then also be stored in a suitable and accessible way.

Funnily enough, many of the services in the Hadoop ecosystem are equipped with similar functionality to the ones that databases have offered for ages now and are optimized for. Data Warehouse databases are more mature in many respects: when we look at Hive as the Data Warehouse service in Hadoop, for example, it still can't update the data and can't run queries with nested subqueries in Hive-QL. But over the years, it has been extended with all kinds of database-like functionality (views, indexes, and many more).

Important Note

The Data Lake approaches are still missing a fine-grained security model that can deliver centralized security mechanisms to control Row-Level-Security and Column-Level-Security over data while also allowing an easy-to-implement mechanism for Data Masking.

Understanding the opportunities of modern cloud computing

Hyperscale cloud vendors such as Microsoft, AWS, and Google, where you can just swipe your credit card and get storage, a web server, or provision a VM of your choice, are a real game changer in the IT world and started an unstoppable revolution. The speed at which new applications are developed and rolled out gets faster and faster, and these cloud technologies reduce time-to-market dramatically. And the cloud vendors don't stop there. Let's examine the different types of offerings they provide you as their customer.

Understanding Infrastructure-as-a-Service

Infrastructure-as-a-Service (IaaS) was only the start. Foremost, we are talking about delivering VMs You can set up a system of VMs just as you would within your own data center, with domain controllers, database and application servers, and so on. Another example for IaaS would be storage that can be integrated just like a virtual network drive into the customers' workspaces.

Sure, a VM with a pre-installed OS and maybe some other software components is a nice thing to have. And you even can scale it if you need more power. But still, you would need to take care of patching the operating system and all the software components that have been installed, the backup and disaster recovery measurements, and so on.

Nowadays, cloud vendors are making these activities easier and help automate this process. But still, this is not the end of the speed, automation, and convenience that a cloud offering could bring:

Figure 1.1 – On-premises, IaaS, PaaS, and SaaS compared

We'll now check out PaaS.

Understanding Platform-as-a-Service

Platform-as-a-Service (PaaS) soon gained the attention of cloud users. Need a database to store data from your web app? You go to your cloud portal, click the necessary link, fill in some configuration information and click OK, and there, you are done... your web app now runs on a service that can offer far better Service-Level Agreements (SLAs) than the tower PC under the desk at your home.

For example, databases, as one interesting representative of a PaaS, offer automated patching with new versions, backup, and disaster recovery deeply integrated into the service. This is also available for a queue service or the streaming component offered on the cloud.

For databases, there are even point-in-time restores from backups that go quite far back in time, and automated geo redundancy and replication are possible, among many more functions. Without PaaS support, an administrator would need to invest a lot of time into providing a similar level of maintenance and service.

No wonder that other use cases made it to the Hyperscalers' (the big cloud vendors, which can offer nearly unlimited resources) offerings. This is also why they all (with differing levels and completeness, of course) offer PaaS capabilities to implement a Modern Data Warehouse/Data Lakehouse in their data centers nowadays.

Some of these offerings are as follows:

Storage components that can store nearly unlimited volumes of data.Data integration tools that can access on-premises, cloud, and third-party software that reaches compute components. The ability to scale and process data in a timely manner. Databases that can hold the amount of data the customer needs and answer the queries against that data with enough power to fulfill the time requirements set by users. Reporting/dashboarding tools, or the connectivity for tools of other vendors, to consume the data in an understandable way. DevOps life cycle support is available to help you develop, version, deploy, and run an application and gain even more speed with a higher quality.

Understanding Software-as-a-Service

The next evolutionary step was into Software-as-a-Service (SaaS), where you can get an "all-you-can-eat" software service to fulfill your requirements, without implementing the software at all. Only the configuration, security, and connection need to be managed with a SaaS solution. Microsoft Dynamics 365 is an example of such a system. You can use an Enterprise Resource Planning (ERP) suite from the cloud without the need to have it licensed and installed in your data center. Sure, the right configuration and the development of the ERP in the cloud still needs to be done, but the Hyperscaler cloud vendor and/or the SaaS provider relieves you of which server hardware, backup and restore, SLAs, and so on to use.

However, not all cloud users want to or can rely on predefined software to solve their problems. The need for individual software is still high and, as we discussed previously, the virtualization options that are available on the cloud platforms are very attractive in this case.

Examining the possibilities of virtual machines

Development can follow a few different options. Microsoft's Azure, for example, offers different capabilities to support application development on this platform. Starting with VMs, you can make use of any setup and configuration for this application that you can think of. Spin the VM, install and run the app, and off you go.

But don't forget – everything that you need to run the app needs to be installed and configured upfront. Maybe there are VM images available that come predefined in a certain configuration. Still, you as the developer are, after spinning it up, responsible for keeping the operating system and all the installed components up-to-date and secure.

The cloud vendor, however, supports you with automation features to help you keep up with changes – to back up the VM, for example – but like every individual server or PC, a VM is a complete computer and brings that level of complexity.

Let's assume you need to run a small piece of software to offer a function or one or more microservices; a cloud-based VM may be overkill. Even with all the cloud advantages of having a VM without the need of spinning up and maintaining your own hardware and data center, this can be quite costly and more complex than what is really needed.

Again, the Hyperscaler can help as it has different offerings to overcome this complexity. Looking at Microsoft's example, there are different approaches to solving this issue.

Understanding Serverless Functions

One of these approaches is Azure Serverless Functions, which you can use to implement functionality in different languages. You can choose between C#, Java, JavaScript, PowerShell, or Python to create the function. Then, you can deploy it to the environment and just rely on the runtime environment. Seamless integration with the rest of the services, especially the data services on the platform, is available to you.

Serverless, in this case, means that the underlying infrastructure will scale with increasing and decreasing requests against the function, without needing the developer or the administrator to manually scale the backing components. This can be configured using different consumption plans, with different limits to keep control of the cost.

Cost, besides implementation speed and flexibility, is one of the most important factors when moving to cloud platforms. With the example of Serverless Functions, we find a great template for the elasticity of cloud services and the possibilities of saving money in comparison to on-premises projects.

Using a seamless scaling service for your functionality gives you the possibility to start small and start experimenting, without spending too much money. Should you find out that your development trajectory points to a dead end, you can directly stop all efforts. You can then archive or even delete the developed artifacts and restart your development with a new approach, without spending a fortune purchasing new hardware and software licenses.

Time is the investment here. Maybe these possibilities will lead to more and better judgement about the status and viability of a project that is underway. And maybe, in the future, less time will be wasted because of investments that need to be justified and followed. However, this may lead to quick-started, half-baked artifacts that are rolled out too early just because they can be started easily and cheaply. This is a situation that might lead to expensive replacement efforts and a "running in circles" situation. Therefore, a cost-effective runtime environment and a quick and easy development and deployment situation should still be approached carefully, and designed and planned to a certain degree.

Looking at the importance of containers

Container technologies are another chance to benefit from the cloud's elastic capabilities. They come with some more detailed possibilities for development and configuration in comparison to Serverless Functions. Every Hyperscaler nowadays offers one or more container technologies.

Containers enable you to deploy modularized functionalities into a runtime environment. They abstract from the OS, support all the necessary libraries and dependencies, and can hold any routine or executable that you might need. In comparison to VMs, containers can easily be ported and deployed from one OS to another. The containers that are running in a certain environment use the kernel of the OS and share it with each other.

VMs run their own OSes, apart from each other's, so they will also need far more maintenance and care. Within a container, you can concentrate on the logic and the code you want to implement, without the need to organize the "computer" around that. This also leads to a far smaller footprint for a container compared to VMs. Containers also boot far quicker than VMs and can be available quickly in an on-demand fashion. This leads to systems that can react nearly instantly to all kinds of events. A container failing, for example, can be mitigated by spinning up the image and redirecting the connection to it internally.

Therefore, in terms of their basic usage, containers are stateless to ensure quick. This adds up to the elasticity of cloud services. Developing and versioning of container modules eases their usage and increases the stability of applications based on this technology. You can roll back a buggy deployment easily, and deployments can be modularized down to small services (microservices approach).

Containers can be developed and implemented on your laptop and then deployed to a cloud repository, where they can be instantiated by the container runtime of choice.

In Microsoft Azure, the offering for containers comes from the container registry, where the typical container technologies can be placed and instantiated from. To run containers on Azure, you can use different offerings, such as Azure Container Instances or Azure Kubernetes Service (AKS). Azure Red Hat OpenShift and Azure Batch can also be used to run containers in some situations. It is also worth mentioning Azure Service Fabric and Azure App Service, which can host, use, or orchestrate containers on Azure:

Figure 1.2 – Virtual machines versus containers

Exploring the advantages of scalable environments

Modern Data Warehouse requirements may reach beyond the functionalities of the typical out-of-the-box components of the ETL/ELT tools. As serverless functions and containers can run a wide variety of programming languages, they can be used to add nearly any function that is not available by default. A typical example is consuming streaming data as it pours into the platform to the queuing endpoints. As some of the streaming components might not offer extended programming functionality, serverless functions and/or containers can add the needed features, and can also add improved performance while keeping the overall solution simple and easy to implement.

Implementing elastic storage and compute

Talking about the possibilities created by cloud computing in the field of data, analytics, and AI, I need to mention storage and compute and the distinction between the two. To scale cloud components to fulfill volume or performance requirements, there are often services where the two are closely coupled. Looking at VMs, for example, higher performance measures such as virtual CPUs automatically require more local storage, which will always cause higher costs in both dimensions. This is also true for scaling many databases. Adding vCores to a database will also add memory and will scale diskspace ranges. Looking at the databases on Azure, for example, the diskspace can be influenced in certain ranges, but these ranges are still coupled to the amount of compute that the database consumes.

The development of serverless services in the data and AI sector is leading to new patterns in storage and compute. Many use cases benefit from the separate scaling possibilities that this distinction offers. There are also complex computations on smaller datasets whose internal complexity needs some significant computational power, be it for a high amount of iterations or wider join requirements. Elastically increasing compute for a particular calculation is an opportunity to save money. Why?

As you spin up or scale a computational cluster and take it back down once the computation is done, you only pay for what you consume. The rest of the time, you run the clusters in "keep-the-lights-on" mode at a lower level to fulfill routine requests. If you don't need a compute component, you can take it down and switch it off:

Figure 1.3 – Scaling options for Spark pools in Azure Synapse

More and more, we even see automatic scaling for compute components, such as in Spark clusters (as shown in the preceding screenshot). You can decide on lower and upper limits for the cluster to instantly react on computational peaks when they occur during a computation. The clusters will even go into hibernation mode if there is no computation going on. This can also be accomplished with the database technology that makes up the Modern Data Warehouse, for example, on Microsoft Azure.

As soon as a database is not experiencing high demand, you can scale it back down to a minimum to keep it working and answer random requests. Or, if it is not needed at all, you just switch it off and avoid any compute costs during that time. This comes in handy when you're working in development or test systems, or databases that are created for certain purposes and aren't needed 24/7. You can even do this with your production systems, which are only used during office times for an ELT/ETL load.

Hyperscaler cloud vendors don't limit you in terms of the amount and formats of data you can put into your storage service. They also often offer different storage tiers, where you can save money by deciding how much data needs to be in hot mode, a slightly more expensive mode for recurring read/write access, or "cool" mode, which is cheaper for longer periods without access. Even archive modes are available. They are the cheapest modes and intended for data that must be kept but will be accessed only in rare cases.

Cloud storage systems act just like a hard drive would in your laptop, but with far greater capacity. Looking at a Modern Data Warehouse that wants to access the storage, you would need sufficient performance when writing to or reading from such a storage. And performance is always worked on at the big cloud vendors.

Talking about Azure specifically, many measures are taken to improve performance, but also stability and reliance. One example is the fact that files are tripled and spread over different discs in the data center; they are also split into blocks, such as on HDFS, for parallel access and performance gains. This, with other, far more sophisticated techniques, increases the reading speed for analysis significantly and adds to the stability, availability, and reliability of the system.

The cloud vendors, with their storage, computational capabilities, and the elasticity of their service offerings, will lay the foundation for you to build successful and financial competitive systems. The nature of the pay-as-you-go models will make it easy for you to get started with a cloud project, pursue a challenge, and succeed with a good price/performance ratio. And, looking at the PaaS offerings, a project can be equipped with the required components in hours instead of weeks or even months. You can react, when the need arises, instead of purchasing new hardware and software over lengthy processes. If a project becomes obsolete or goes down the wrong path for any reason, deleting the related artifacts and eliminating the related cost can be done very easily.

Cloud technology can help you get things done more quickly, efficiently, and at a far lower cost. We will be talking about the advantages of cloud security later in this book.

Exploring the benefits of AI and ML

Companies start