44,39 €
Azure Data Lake, the modern data warehouse architecture, and related data services on Azure enable organizations to build their own customized analytical platform to fit any analytical requirements in terms of volume, speed, and quality.
This book is your guide to learning all the features and capabilities of Azure data services for storing, processing, and analyzing data (structured, unstructured, and semi-structured) of any size. You will explore key techniques for ingesting and storing data and perform batch, streaming, and interactive analytics. The book also shows you how to overcome various challenges and complexities relating to productivity and scaling. Next, you will be able to develop and run massive data workloads to perform different actions. Using a cloud-based big data-modern data warehouse-analytics setup, you will also be able to build secure, scalable data estates for enterprises. Finally, you will not only learn how to develop a data warehouse but also understand how to create enterprise-grade security and auditing big data programs.
By the end of this Azure book, you will have learned how to develop a powerful and efficient analytical platform to meet enterprise needs.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 553
Build modern data warehouses on Microsoft Azure
Patrik Borosch
BIRMINGHAM—MUMBAI
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Kunal Parikh
Publishing Product Manager: Reshma Raman
Senior Editor: David Sugarman
Content Development Editor: Joseph Sunil
Technical Editor: Manikandan Kurup
Copy Editor: Safis Editing
Project Coordinator: Aparna Nair
Proofreader: Safis Editing
Indexer: Rekha Nair
Production Designer: Vijay Kamble
First published: July 2021
Production reference: 1250621
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-293-6
www.packt.com
A big thank you to Packt Publishing: Reshma Raman, Gebin George, David Sugarman, Aishwarya Mohan and, of course, Joseph Sunil. It was a ride, and your support was vital to finish this book. I am very proud, humbled, and thankful that you approached me and gave me the chance to write this book.
Special thanks go to my manager Zoran Draganic, who encouraged me to take on this challenge and supported me throughout the writing process.
I want to say another huge thank you to my CSA colleague Meinrad Weiss. Your technical expertise, your honest (and always to the point) feedback, and our reviews helped me to learn even more and to improve the quality of the book.
A fourth big thank you to Liviana Zürcher, another CSA alongside Meinrad and myself. Your reviews and your support, especially when I started the writing, were more than important, as you gave me faith and kept me going during the critical starting phase.
Finally, I need to thank my wife, Simone, from the deepest bottom of my heart. She was with me in this challenge and stayed patient during all those night sessions, and all the times I told her, "I need to finish the next chapter!" Without your support I wouldn't have been able to research, experiment, and finish this book.
Patrik Borosch is a Cloud Solution Architect for Data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international Data Warehouse, Data Integration and Big Data projects. There, he has built and extended his experience in all facets from requirement engineering over data modelling and ETL all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of Azure Cloud.
Pradeep Menon is a seasoned data analytics professional with more than 18 years of experience in data and AI. Currently, Pradeep works as a data and AI strategist with Microsoft. In this role, he is responsible for helping Microsoft's strategic customers across Asia to be more data-driven by using cloud, big data, and AI technologies. He is also a distinguished speaker and blogger and has given numerous keynotes on cloud technologies, data, and AI. He has previously worked at Microsoft, Alibaba Group, and IBM.
Liviana Zürcher started her career in Romania as Technical Sales Consultant for Business Intelligence and Data Warehouse at Oracle. She has been working a few years as a Big Data Warehouse and Business Intelligence consultant and trainer, having projects all over the world. Finally, she has joined Microsoft Switzerland team as Cloud Solution Architect for Data and Artificial Intelligence at Microsoft, where she is currently working since 2018.
First of all, I would like to thank my parents that have always helped me and encouraged me to go out from my comfort zone. Special thank you to my wonderful husband and to our two amazing children for their support. I thank Patrik for this opportunity, I was flattered by your request.
Meinrad Weiss works as a Senior Cloud Solution Architect in the Data and AI Team at Microsoft Switzerland. He is a very experienced Database expert and comes with a long and successful track record in Data, BI and analytical projects. Meinrad’s expertise spans from RDMS on-prem and in the cloud up to the most complex analytical architectures with the Azure Data Services and the field of IoT at Microsoft. He joined Microsoft in 2017 and became a vital and reliable pillar of the Data and AI architects team.
This section will examine the question of whether data warehouses are still required given the rise of the enterprise data lake and provides a brief overview of the trends and development on the market of data and AI. As cloud computing adds flexible and scalable services to AI, there are no more limits in terms of source formats and volumes that can be processed for AI requirements, and given that AI and machine learning are on everybody's mind at the moment, the book attempts to ask what all this entails and where we are heading. In addition, we'll take a technology-agnostic look at the components that make up a successful analytical system. From an agnostic viewpoint, we will try to find the right Azure services to build a modern data warehouse.
This section comprises the following chapters:
Chapter 1, Balancing the Benefits of Data Lakes over Data WarehousesChapter 2, Connecting Requirements and TechnologyIs the Data Warehouse dead with the advent of Data Lakes? There is disagreement everywhere about the need for Data Warehousing in a modern data estate. With the rise of Data Lakes and Big Data technology, many people use other, newer technologies compared to databases for their analytical efforts. Establishing a data-driven company seems to be possible without all those narrow definitions and planned structures, the ETL/ELT, and all the indexing for performance. But when we examine the technology carefully, when we compare the requirements that are formulated in analytical projects, free of prejudice to the functionality that the chosen services or software packages can deliver, we often find gaps on both ends. This chapter discusses the capabilities of Data Warehousing and Data Lakes and introduces the concept of the Modern Data Warehouse.
With all the innovations that have been brought to us in the last few years, such as faster hardware, new technologies, and new dogmas such as the Data Lake, older concepts and methods are being questioned and challenged. In this chapter, I would like to explore the evolution of the analytical world and try to answer the question, is the Data Warehouse really obsolete?
We'll find out by covering the following topics:
Distinguishing between Data Warehouses and Data Lakes Understanding the opportunities of modern cloud computing Exploring the benefits of AI and ML Answering the questionThere are several definitions of Data Warehousing on the internet. The narrower ones characterize a warehouse as the database and the model used in the database; the wider descriptions look at the term as a method, and a suitable collection of all organizational and technological components that make up a BI solution. They talk about everything from the Extract Transform Load tool (ETL tool) to the database, the model, and, of course, the reporting and dashboarding solution.
When we look at the Data Warehousing method in general, at its heart, we find a database that offers a certain table structure. We almost always find two main types of artifacts in the database: Facts and Dimensions.
Facts provide all the measurable information that we want to analyze; for example, the quantities of products sold per customer, per region, per sales representative, and per time. Facts are normally quite narrow objects, but with a lot of rows stored.
In the Dimensions, we will find all the descriptive information that can be linked to the Facts for analysis. Every piece of information that a user puts on their report or dashboard to aggregate and group the fact data, filter it, and view it is collected in the Dimensions. All the data related to customer information, such as Product, Contract, Address, and so on, that might need to be analyzed and correlated is stored here. Typically, these objects are stored as tables in the database and are joined using their key columns. Dimensions are normally wide objects, sometimes with controlled redundancy, that look at the given modeling method.
Three main methods for modeling the Facts and Dimensions within a Data Warehouse database have crystalized over the years of its evolution:
Star-Join/Snowflake: This is probably the most famous method for Data Warehouse modeling. Fact tables are put in the center of the model, while Dimension tables are arranged around them, inheriting their Primary Key into the Fact table. In the Star-Join method, we find a lot of redundancy in the tables since as all the Dimension data, including all hierarchical information (such as Product Group -> Product SubCategory -> Product Category) regarding a certain artifact (Product, Customer, and so on), is stored in one table. In a Snowflake schema, hierarchies are spread in additional tables per hierarchy level and are linked over relationships with each other. This, when expressed in a graph, turns out to show a kind of snowflake pattern.Data Vault: This is a newer method that reflects the rising structural volatility of data sources, which offers higher flexibility and speed for developing. Entities that need to be analyzed are stored over Hubs, Satellites, and Links. Hubs simply reflect the presence of an entity by storing its ID and some audit information such as its data source, create times, and so on. Each hub can have one or more Satellite(s). These Satellites store all the descriptive information about the entity. If we need to change the system and add new information about an entity, we can add another Satellite to the model, reflecting just the new data. This will bring the benefit of non-destructive deployments to the productive system in the rollout. In the Data Vault, the Customer data will be stored in one Hub (the CustomerID and audit columns) and one or more Satellites (the rest of the customer information). The structure in the model is finally brought by Links. They are provided with all the Primary Keys of the Hubs and, again, some metadata. Additionally, Links can have Satellites of their own that describe the relationships of the Link content. Therefore, the connection between a Customer and the Products they bought will be reflected in a Link, where the Customer-Hub-Key and the Product-Hub-Key are stored together with audit columns. A Link Satellite can, for example, reflect some characteristics of the relationship, such as the amount bought, the date, or a discount. Finally, we can even add a Star-Join-View schema to abstract all the tables of the Data Vault and make it easier to understand for the users.3rd Normal form: This is the typical database modeling technique that is also used in (and first created for) so-called Online Transactional Processing (OLTP) databases. Artifacts are broken up into their atomic information and are spread over several tables so that no redundancy is stored in either table. The Product information of a system might be split in separate tables for the product's name, color, size, price information, and many more. To derive all the information of a 3rd Normal Form model, a lot of joining is necessary.But how does data finally land in the Data Warehouse database? The process and the related tools are named Extract, Transform, Load (ETL), but depending on the sequence we implement it in, it may be referred to as ELT. You'll find several possible ways to implement data loading into a Data Warehouse. This can be done by implementing specialized ETL tools such as Azure Data Factory in the cloud, SQL Server Integration services (SSIS), Informatica, Talend, or IBM Data Stage, for example.
The biggest advantage of these tools is the availability of wide catalogues of ready-to-use source and target connectors. They can connect directly to a source, query the needed data, and even transform it while being transported to the target. In the end, data is loaded into the Data Warehouse database. Other advantages include its graphical interfaces, where complex logic can be implemented on a "point-and-click" basis, which is very easy to understand and maintain.
There are other options as well. Data is often pushed by source applications and a direct connection for the data extraction process is not wanted at all. Many times, files are provided that are stored somewhere near the Data Warehouse database and then need to be imported. Maybe there is no ETL tool available. Since nearly every database nowadays provides loader tools, the import can be accomplished using those tools in a scripted environment. Once the data has made its way to the database tables, the transformational steps are done using Stored Procedures that then move the data through different DWH stages or layers to the final Core Data Warehouse.
Talking about the Data Warehouse layers, we nearly always find several steps that are processed before the data is provided for reporting or dashboarding. Typically, there are at least the following stages or layers:
Landing or staging area: Data is imported into this layer in its rawest format. Nothing is changed on its way to this area and only audit information is added to track down all the loads.QS, transient, or cleansing area: This is where the work is done. You will only find a few projects where data is consistent. Values, sometimes even mandatory ones from the source, may be missing, content might be formatted incorrectly or even corrupted, and so on. In this zone, all the issues with the source data are taken care of, and the data is curated and cleansed.Core Data Warehouse: In this part of the database, the clean data is brought into the data model of choice. The model takes care of data historization, if required. This is also one possible point of security – or, even better, the central point of security. Data is secured so that all participating users can only see what they can see. The Core Data Warehouse is also the place where performance is boosted. By using all the available database techniques, such as indexes, partitioning, compression, and so on, the Data Warehouse will be tuned to suit the performance needs of the reporting users. The Core Data Warehouse is often also seen as the presentation layer of the system, where all kinds of reporting and dashboarding and self-service BI tools can connect and create their analysis.Additionally, Data Warehouses have also brought in other types of layers. There might be sublayers for the cleansing area, for example, or the Data Marts, which are used to slice data semantically for the needs of certain user groups, or the Operational Data Store (ODS), which has different definitions, depending on who you ask. Sometimes, it is used as an intermediary data store that can be used for reporting. Other definitions speak of a transient zone that stores data for integration and further transport to the DWH. Sometimes, it is also used as a kind of archived data pool that the Data Warehouse can always be reproduced from. The definitions vary.
In terms of the reporting/dashboarding solution, we will also find several approaches to visualize the information that's provided by the Data Warehouse. There are the typical reporting tools, such as SQL Server Reporting Service or Crystal Reports, for example. These are page-oriented report generators that access the underlying database, get the data, and render it according to the template being used.
The more modern approach, however, has resulted in tools that can access a certain dataset, store it in an internal caching database, and allow interactive reporting and dashboarding based on that data. Self-Service BI tools such as Power BI, QLIK, and Tableau allow you to access the data, create your own visuals, and put them together in dashboards. Nowadays, these tools can even correlate data from different data sources and allow you to analyze data that might not have made it to the Data Warehouse database yet.
You can scale databases with newer, faster hardware, more memory, and faster disks. You can also go from Symmetric Multi-Processing (SMP) to Massively Parallel Processing (MPP) databases. However, the usual restrictions still apply: the more data we need to process, the longer it will take to do so. And there are also workloads that databases will not support, such as image processing.
It's funny that we find a similar mix or maybe even call it confusion when we examine the term Data Lake. Many people refer to a Data Lake as a HadoopBig Data implementation that delivers one or more clusters of computers, where a distributed filesystem and computational software is installed. It can deal with a distributed Input and Output (I/O) on the one hand but can also do distributed and parallel computation. Such a system adds specialized services for all kinds of workloads, be it just SQL queries against the stored data, in-memory computation, streaming analytics – you name it. Interestingly, as we mentioned while discussing Data Warehouses discussion, the narrower definition of the Data Lake only refers to the storage solution, and less to the method and the collection of services. To be honest, I like the wider definition far better, as it describes the method more holistically.
With the Hadoop Distributed File System (HDFS) from the Apache Software Foundation, a system is delivered that is even capable of storing data distributed over cheap legacy hardware clusters. HDFS splits the files into blocks and will also replicate those blocks within the system. This not only delivers a failsafe environment, but it also generates the biggest advantage of HDFS: parallel access to the data blocks for the consuming compute components. This means that a Spark cluster, for example, can read several data blocks in parallel, and with increased speed, because it can access several blocks of a file with several parallel threads.
With the possibility to hand files to a filesystem and start analyzing them just where they are, we enter another dogma.
MapReduce, the programming paradigm, supports parallel processing many files in this distributed environment. Every participating node runs calculations over a certain subset of all the files that must be analyzed. In the end, the results are aggregated by a driver node and returned to the querying instance. Different to the Data Warehouse, we can start analyzing data on a Schema-On-Read basis. This means we decide on the structure of the files that are processed, right when they are being processed. The Data Warehouse, in comparison, is based on a Schema-On-Write strategy, where the tables must be defined before the data is loaded into the database.
In the Data Lake world, you, as the developer, do not need to plan structures weeks ahead, as you would in the Data Warehouse. Imagine you have stored a vast quantity of files, where each row in a file will reflect an event, and each row also consists of, let's say, 100 columns. Using Schema-On-Read, you can just decide to only count the rows of all the files. You won't need to cut the rows into the columns for this process. For other purposes, you might need to split the rows during the reading process into their own columns to access the detailed information. You might need to predict a machine failure based on the content of the columns. Using a database, you can also mimic this behavior, but you would need to create different objects and selectively store data for each intention.
HDFS is the center of a Data Lake system. But we can't get further with just a filesystem, even one as sophisticated as this one. The Hadoop open source world around HDFS has therefore developed many services to interact and process the content that's kept in distributed storage. There are services such as Hive, a SQL-like Data Warehousing service; Pig, for highly parallel analysis jobs; Spark as a large-scale, in-memory analytical engine; Storm as a distributed real-time streaming analysis engine; Mahout for machine learning; Hbase as a database; Oozie, for workflows; and many more.
Spark clusters, with their ability to run distributed, in-memory analytical processes on distributed datasets, have influenced the market heavily in the recent years. Scala, one of the main programming languages used with Spark, is a Java-based programming language. It can be used to write high-performance routines for ETL/ELT, data cleansing, and transformation, but also for machine learning and artificial intelligence on massive datasets. Python also made it into this technology and is one of the go-to languages here. R, as the statistical programming language of choice for many data scientists, was also a must for the Spark offering. Dealing with data has been important for so many years and guess what – SQL is still a language that was not possible to skip in such an environment.
These services and languages are making the open source Hadoop ecosystem a rich, complex engine for processing and analyzing excessive amounts of data. They are available all over the clusters and can interact with the HDFS to make use of the distributed files and compute.
Tip
Jumping into a big data challenge should, just like a Data Warehouse project, always be a well-prepared and intensively examined project. Just starting it from an "I want that too!" position would be the worst driver you can have, and unfortunately happens far too often. When you start with purpose and find the right tool for that purpose, at least that selection might create the right environment for a successful project. Maybe we will find some alternatives for you throughout this book.
Very similar to the Data Warehouse layers, as we discussed previously, a Data Lake is also structured in different layers. These layers form the stages that the data must go through on its way to be formed into information. The three major zones are pretty much comparable. We can form a landing zone in the Data Lake, where data is written to "as-is" in the exact format as it comes from the source. The transient zone then compares to the cleansing area of the Data Warehouse. Data is transformed, wrangled, and massaged to suit the requirements of the analyzing users. As no one should access either the landing zone or the transient zone, the curated zone is where you're heading, and what you, as the developer, will allow your users to access. Other (similar) concepts talk about Bronze, Silver, and Gold areas, or use other terms. Just like in the Data Warehouse concept, the Data Lakes can include other additional zones. For example, zones for master data, user sandboxes, or logging areas for your routines. We can have several user- or group-related folder structures in the curated zone where data is aggregated, just like the Data Marts of the DWH.
So, yes, you can be flexible and agile. But in a Data Lake with Schema-On-Read, you'll need to decide which attributes you want to analyze or are needed by your machine learning models for training. This will, after all, force you to structure your sources and will therefore force you into a certain project life cycle. You will go through user stories, requirements analysis, structuring and development, versioning, and delivering artifacts.
If you're only analyzing tabular-oriented data, maybe it's worth checking if your hardware, your ETL-tool, and your company databases can scale to the needed volume.
This question, along with the nature of the source's data, complexity, and format, should be taken into account when you're deciding on the technology to use. If you need to process sound files, images, PDFs, or similar, then this is no job for a database. SQL does not offer language elements for this (you can add programming extensions to your database). But here, we have a definitive marker for the possible usage of technologies other than databases.
Once you have analyzed the so-called unstructured data, you will structure the results back into a tabular-oriented result set as an array (a tabular representation) of data. This needs to be delivered somehow to the recipients of your analysis. But how is that done in a Data Lake?
The typical reporting tools still require tables or table-like source data to work with. Often, they will import the source data into their internal database to be able to quickly answer reporting and dashboarding operations by their users. Experiences with these visualization tools have shown that it is not very performant to directly report from a vast number of files from a Data Lake. Data always needs to be condensed into digestible chunks for these purposes and should then also be stored in a suitable and accessible way.
Funnily enough, many of the services in the Hadoop ecosystem are equipped with similar functionality to the ones that databases have offered for ages now and are optimized for. Data Warehouse databases are more mature in many respects: when we look at Hive as the Data Warehouse service in Hadoop, for example, it still can't update the data and can't run queries with nested subqueries in Hive-QL. But over the years, it has been extended with all kinds of database-like functionality (views, indexes, and many more).
Important Note
The Data Lake approaches are still missing a fine-grained security model that can deliver centralized security mechanisms to control Row-Level-Security and Column-Level-Security over data while also allowing an easy-to-implement mechanism for Data Masking.
Hyperscale cloud vendors such as Microsoft, AWS, and Google, where you can just swipe your credit card and get storage, a web server, or provision a VM of your choice, are a real game changer in the IT world and started an unstoppable revolution. The speed at which new applications are developed and rolled out gets faster and faster, and these cloud technologies reduce time-to-market dramatically. And the cloud vendors don't stop there. Let's examine the different types of offerings they provide you as their customer.
Infrastructure-as-a-Service (IaaS) was only the start. Foremost, we are talking about delivering VMs You can set up a system of VMs just as you would within your own data center, with domain controllers, database and application servers, and so on. Another example for IaaS would be storage that can be integrated just like a virtual network drive into the customers' workspaces.
Sure, a VM with a pre-installed OS and maybe some other software components is a nice thing to have. And you even can scale it if you need more power. But still, you would need to take care of patching the operating system and all the software components that have been installed, the backup and disaster recovery measurements, and so on.
Nowadays, cloud vendors are making these activities easier and help automate this process. But still, this is not the end of the speed, automation, and convenience that a cloud offering could bring:
Figure 1.1 – On-premises, IaaS, PaaS, and SaaS compared
We'll now check out PaaS.
Platform-as-a-Service (PaaS) soon gained the attention of cloud users. Need a database to store data from your web app? You go to your cloud portal, click the necessary link, fill in some configuration information and click OK, and there, you are done... your web app now runs on a service that can offer far better Service-Level Agreements (SLAs) than the tower PC under the desk at your home.
For example, databases, as one interesting representative of a PaaS, offer automated patching with new versions, backup, and disaster recovery deeply integrated into the service. This is also available for a queue service or the streaming component offered on the cloud.
For databases, there are even point-in-time restores from backups that go quite far back in time, and automated geo redundancy and replication are possible, among many more functions. Without PaaS support, an administrator would need to invest a lot of time into providing a similar level of maintenance and service.
No wonder that other use cases made it to the Hyperscalers' (the big cloud vendors, which can offer nearly unlimited resources) offerings. This is also why they all (with differing levels and completeness, of course) offer PaaS capabilities to implement a Modern Data Warehouse/Data Lakehouse in their data centers nowadays.
Some of these offerings are as follows:
Storage components that can store nearly unlimited volumes of data.Data integration tools that can access on-premises, cloud, and third-party software that reaches compute components. The ability to scale and process data in a timely manner. Databases that can hold the amount of data the customer needs and answer the queries against that data with enough power to fulfill the time requirements set by users. Reporting/dashboarding tools, or the connectivity for tools of other vendors, to consume the data in an understandable way. DevOps life cycle support is available to help you develop, version, deploy, and run an application and gain even more speed with a higher quality.The next evolutionary step was into Software-as-a-Service (SaaS), where you can get an "all-you-can-eat" software service to fulfill your requirements, without implementing the software at all. Only the configuration, security, and connection need to be managed with a SaaS solution. Microsoft Dynamics 365 is an example of such a system. You can use an Enterprise Resource Planning (ERP) suite from the cloud without the need to have it licensed and installed in your data center. Sure, the right configuration and the development of the ERP in the cloud still needs to be done, but the Hyperscaler cloud vendor and/or the SaaS provider relieves you of which server hardware, backup and restore, SLAs, and so on to use.
However, not all cloud users want to or can rely on predefined software to solve their problems. The need for individual software is still high and, as we discussed previously, the virtualization options that are available on the cloud platforms are very attractive in this case.
Development can follow a few different options. Microsoft's Azure, for example, offers different capabilities to support application development on this platform. Starting with VMs, you can make use of any setup and configuration for this application that you can think of. Spin the VM, install and run the app, and off you go.
But don't forget – everything that you need to run the app needs to be installed and configured upfront. Maybe there are VM images available that come predefined in a certain configuration. Still, you as the developer are, after spinning it up, responsible for keeping the operating system and all the installed components up-to-date and secure.
The cloud vendor, however, supports you with automation features to help you keep up with changes – to back up the VM, for example – but like every individual server or PC, a VM is a complete computer and brings that level of complexity.
Let's assume you need to run a small piece of software to offer a function or one or more microservices; a cloud-based VM may be overkill. Even with all the cloud advantages of having a VM without the need of spinning up and maintaining your own hardware and data center, this can be quite costly and more complex than what is really needed.
Again, the Hyperscaler can help as it has different offerings to overcome this complexity. Looking at Microsoft's example, there are different approaches to solving this issue.
One of these approaches is Azure Serverless Functions, which you can use to implement functionality in different languages. You can choose between C#, Java, JavaScript, PowerShell, or Python to create the function. Then, you can deploy it to the environment and just rely on the runtime environment. Seamless integration with the rest of the services, especially the data services on the platform, is available to you.
Serverless, in this case, means that the underlying infrastructure will scale with increasing and decreasing requests against the function, without needing the developer or the administrator to manually scale the backing components. This can be configured using different consumption plans, with different limits to keep control of the cost.
Cost, besides implementation speed and flexibility, is one of the most important factors when moving to cloud platforms. With the example of Serverless Functions, we find a great template for the elasticity of cloud services and the possibilities of saving money in comparison to on-premises projects.
Using a seamless scaling service for your functionality gives you the possibility to start small and start experimenting, without spending too much money. Should you find out that your development trajectory points to a dead end, you can directly stop all efforts. You can then archive or even delete the developed artifacts and restart your development with a new approach, without spending a fortune purchasing new hardware and software licenses.
Time is the investment here. Maybe these possibilities will lead to more and better judgement about the status and viability of a project that is underway. And maybe, in the future, less time will be wasted because of investments that need to be justified and followed. However, this may lead to quick-started, half-baked artifacts that are rolled out too early just because they can be started easily and cheaply. This is a situation that might lead to expensive replacement efforts and a "running in circles" situation. Therefore, a cost-effective runtime environment and a quick and easy development and deployment situation should still be approached carefully, and designed and planned to a certain degree.
Container technologies are another chance to benefit from the cloud's elastic capabilities. They come with some more detailed possibilities for development and configuration in comparison to Serverless Functions. Every Hyperscaler nowadays offers one or more container technologies.
Containers enable you to deploy modularized functionalities into a runtime environment. They abstract from the OS, support all the necessary libraries and dependencies, and can hold any routine or executable that you might need. In comparison to VMs, containers can easily be ported and deployed from one OS to another. The containers that are running in a certain environment use the kernel of the OS and share it with each other.
VMs run their own OSes, apart from each other's, so they will also need far more maintenance and care. Within a container, you can concentrate on the logic and the code you want to implement, without the need to organize the "computer" around that. This also leads to a far smaller footprint for a container compared to VMs. Containers also boot far quicker than VMs and can be available quickly in an on-demand fashion. This leads to systems that can react nearly instantly to all kinds of events. A container failing, for example, can be mitigated by spinning up the image and redirecting the connection to it internally.
Therefore, in terms of their basic usage, containers are stateless to ensure quick. This adds up to the elasticity of cloud services. Developing and versioning of container modules eases their usage and increases the stability of applications based on this technology. You can roll back a buggy deployment easily, and deployments can be modularized down to small services (microservices approach).
Containers can be developed and implemented on your laptop and then deployed to a cloud repository, where they can be instantiated by the container runtime of choice.
In Microsoft Azure, the offering for containers comes from the container registry, where the typical container technologies can be placed and instantiated from. To run containers on Azure, you can use different offerings, such as Azure Container Instances or Azure Kubernetes Service (AKS). Azure Red Hat OpenShift and Azure Batch can also be used to run containers in some situations. It is also worth mentioning Azure Service Fabric and Azure App Service, which can host, use, or orchestrate containers on Azure:
Figure 1.2 – Virtual machines versus containers
Modern Data Warehouse requirements may reach beyond the functionalities of the typical out-of-the-box components of the ETL/ELT tools. As serverless functions and containers can run a wide variety of programming languages, they can be used to add nearly any function that is not available by default. A typical example is consuming streaming data as it pours into the platform to the queuing endpoints. As some of the streaming components might not offer extended programming functionality, serverless functions and/or containers can add the needed features, and can also add improved performance while keeping the overall solution simple and easy to implement.
Talking about the possibilities created by cloud computing in the field of data, analytics, and AI, I need to mention storage and compute and the distinction between the two. To scale cloud components to fulfill volume or performance requirements, there are often services where the two are closely coupled. Looking at VMs, for example, higher performance measures such as virtual CPUs automatically require more local storage, which will always cause higher costs in both dimensions. This is also true for scaling many databases. Adding vCores to a database will also add memory and will scale diskspace ranges. Looking at the databases on Azure, for example, the diskspace can be influenced in certain ranges, but these ranges are still coupled to the amount of compute that the database consumes.
The development of serverless services in the data and AI sector is leading to new patterns in storage and compute. Many use cases benefit from the separate scaling possibilities that this distinction offers. There are also complex computations on smaller datasets whose internal complexity needs some significant computational power, be it for a high amount of iterations or wider join requirements. Elastically increasing compute for a particular calculation is an opportunity to save money. Why?
As you spin up or scale a computational cluster and take it back down once the computation is done, you only pay for what you consume. The rest of the time, you run the clusters in "keep-the-lights-on" mode at a lower level to fulfill routine requests. If you don't need a compute component, you can take it down and switch it off:
Figure 1.3 – Scaling options for Spark pools in Azure Synapse
More and more, we even see automatic scaling for compute components, such as in Spark clusters (as shown in the preceding screenshot). You can decide on lower and upper limits for the cluster to instantly react on computational peaks when they occur during a computation. The clusters will even go into hibernation mode if there is no computation going on. This can also be accomplished with the database technology that makes up the Modern Data Warehouse, for example, on Microsoft Azure.
As soon as a database is not experiencing high demand, you can scale it back down to a minimum to keep it working and answer random requests. Or, if it is not needed at all, you just switch it off and avoid any compute costs during that time. This comes in handy when you're working in development or test systems, or databases that are created for certain purposes and aren't needed 24/7. You can even do this with your production systems, which are only used during office times for an ELT/ETL load.
Hyperscaler cloud vendors don't limit you in terms of the amount and formats of data you can put into your storage service. They also often offer different storage tiers, where you can save money by deciding how much data needs to be in hot mode, a slightly more expensive mode for recurring read/write access, or "cool" mode, which is cheaper for longer periods without access. Even archive modes are available. They are the cheapest modes and intended for data that must be kept but will be accessed only in rare cases.
Cloud storage systems act just like a hard drive would in your laptop, but with far greater capacity. Looking at a Modern Data Warehouse that wants to access the storage, you would need sufficient performance when writing to or reading from such a storage. And performance is always worked on at the big cloud vendors.
Talking about Azure specifically, many measures are taken to improve performance, but also stability and reliance. One example is the fact that files are tripled and spread over different discs in the data center; they are also split into blocks, such as on HDFS, for parallel access and performance gains. This, with other, far more sophisticated techniques, increases the reading speed for analysis significantly and adds to the stability, availability, and reliability of the system.
The cloud vendors, with their storage, computational capabilities, and the elasticity of their service offerings, will lay the foundation for you to build successful and financial competitive systems. The nature of the pay-as-you-go models will make it easy for you to get started with a cloud project, pursue a challenge, and succeed with a good price/performance ratio. And, looking at the PaaS offerings, a project can be equipped with the required components in hours instead of weeks or even months. You can react, when the need arises, instead of purchasing new hardware and software over lengthy processes. If a project becomes obsolete or goes down the wrong path for any reason, deleting the related artifacts and eliminating the related cost can be done very easily.
Cloud technology can help you get things done more quickly, efficiently, and at a far lower cost. We will be talking about the advantages of cloud security later in this book.
Companies start