37,19 €
The famous quote 'Data is the new oil' seems more true every day as the key to most organizations' long-term success lies in extracting insights from raw data. One of the major challenges organizations face in leveraging value out of data is building performant data engineering pipelines for data visualization, ingestion, storage, and processing. This second edition of the immensely successful book by Ahmad Osama brings to you several recent enhancements in Azure data engineering and shares approximately 80 useful recipes covering common scenarios in building data engineering pipelines in Microsoft Azure.
You’ll explore recipes from Azure Synapse Analytics workspaces Gen 2 and get to grips with Synapse Spark pools, SQL Serverless pools, Synapse integration pipelines, and Synapse data flows. You’ll also understand Synapse SQL Pool optimization techniques in this second edition. Besides Synapse enhancements, you’ll discover helpful tips on managing Azure SQL Database and learn about security, high availability, and performance monitoring. Finally, the book takes you through overall data engineering pipeline management, focusing on monitoring using Log Analytics and tracking data lineage using Azure Purview.
By the end of this book, you’ll be able to build superior data engineering pipelines along with having an invaluable go-to guide.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 394
Veröffentlichungsjahr: 2022
Get well versed in various data engineering techniques in Azure using this recipe-based guide
Nagaraj Venkatesan
Ahmad Osama
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Reshma Raman
Senior Editor: Nazia Shaikh
Content Development Editor: Manikandan Kurup
Technical Editor: Sweety Pagaria
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Joshua Misquitta
Marketing Coordinators: Priyanka Mhatre and Nivedita Singh
First published: March 2021
Second edition: September 2022
Production reference: 2070922
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-678-9
www.packt.com
Nagaraj Venkatesan works as a cloud solution architect at Microsoft. At Microsoft, he works with some of the largest companies in the world, solving their complex data engineering problems and helping them build effective solutions using cutting-edge technologies based on Azure. Nagaraj, based out of Singapore, is a popular member of the data and AI community and is a regular speaker at several international data and AI conferences. He is a two-time Microsoft Most Valuable Professional (MVP) award winner, in 2016 and 2017. Nagaraj shares his technical expertise through his blog and on his YouTube channel called DataChannel. He also holds a master’s degree in computing from the National University of Singapore.
Ahmad Osama works for Pitney Bowes Pvt. Ltd. as a technical architect and is a former Microsoft Data Platform MVP. In his day job, he works on developing and maintaining high performant, on-premises and cloud SQL Server OLTP environments as well as deployment and automating tasks using PowerShell. When not working, Ahmad blogs at DataPlatformLabs and can be found glued to his Xbox.
Milind Kumar Chaudhari is an experienced cloud data engineer/architect who specializes in designing, implementing, and managing complex data pipelines. He is a data enthusiast and passionate about leveraging the best data engineering practices to solve critical business problems. He is also currently a technical reviewer with O’Reilly Media.
Vijay Kumar Suda is a senior cloud big data architect with over 20 years of experience working with global clients. He has worked in Switzerland, Belgium, Mexico, Bahrain, India, and Canada and helped customers across multiple industries. He has been based out of the USA since 2008. His expertise includes data engineering, architecture, the cloud, AI, and machine learning.
Firstly, I’d like to thank my parents, Sri Koteswara Rao and Rajyalakshmi, for their love and support in every step of my life. I’d like to thank my wife, Radhika, my son, Chandra, and my daughter, Akshaya, for their daily support and patience. I’d like to thank my siblings, Rama, Swarana, and Dr. SriKumar, for their support, and finally, I’d like to thank Packt for the opportunity to review this book.
Data is the new oil and probably the most valuable resource. Data engineering covers how one can gain insights out of data. This book will introduce the key processes in data engineering (ingesting, storing, processing, and consuming) and share a few common recipes that can help us develop data engineering pipelines to gain insights into our data.
The book follows the logical data engineering process by beginning with Azure Data Lake and covering data ingestion using Azure Data Factory into Azure Data Lake and Azure SQL Database in the first few chapters. In these chapters, the book also covers the management of common storage layers such as Azure Data Lake and Azure SQL Database, focusing on topics such as security, high availability, and performance monitoring. The middle chapters focus on data processing using Azure Databricks, Azure Synapse Analytics Spark pools, and Synapse dataflows, and data exploration using Synapse serverless SQL pools. The final few chapters focus on the consumption of the data using Synapse dedicated SQL pool and Synapse Spark lake databases, covering the tips and tricks to optimize and maintain Synapse dedicated SQL pool databases and lake databases. Finally, the book also has a bonus chapter on managing the overall data engineering pipeline, which covers pipeline monitoring using Azure Log Analytics and tracking data lineage using Microsoft Purview.
While the book can be consumed in parts or any sequence, following along sequentially will help the readers experience building an end-to-end data engineering solution on Azure.
The book is for anyone working on data engineering projects in Azure. Azure data engineers, data architects, developers, and database administrators working on Azure will find the book extremely useful.
Chapter 1, Creating and Managing Data in Azure Data Lake, focuses on provisioning, uploading, and managing the data life cycle in Azure Data Lake accounts.
Chapter 2, Securing and Monitoring Data in Azure Data Lake, covers securing an Azure Data Lake account using firewall and private links, accessing data lake accounts using managed identities, and monitoring an Azure Data Lake account using Azure Monitor.
Chapter 3, Building Data Ingestion Pipelines Using Azure Data Factory, covers ingesting data using Azure Data Factory and copying data between Azure SQL Database and Azure Data Lake.
Chapter 4, Azure Data Factory Integration Runtime, focuses on configuring and managing self-hosted integration runtimes and running SSIS packages in Azure using Azure-SSIS integration runtimes.
Chapter 5, Configuring and Securing Azure SQL Database, covers configuring a Serverless SQL database, Hyperscale SQL database, and securing Azure SQL Database using virtual networks and private links.
Chapter 6, Implementing High Availability and Monitoring in Azure SQL Database, explains configuring high availability to Azure SQL Database using auto-failover groups and read replicas, monitoring Azure SQL Database, and the automated scaling of Azure SQL Database during utilization spikes.
Chapter 7, Processing Data Using Azure Databricks, covers integrating Azure Databricks with Azure Data Lake and Azure Key Vault, processing data using Databricks notebooks, working with Delta tables, and visualizing Delta tables using Power BI.
Chapter 8, Processing Data Using Azure Synapse Analytics covers exploring data using Synapse Serverless SQL pool, processing data using Synapse Spark Pools, Working with Synapse Lake database, and integrating Synapse Analytics with Power BI.
Chapter 9, Transforming Data Using Azure Synapse Dataflows, focuses on performing transformations using Synapse Dataflows, optimizing data flows using partitioning, and managing dynamic source schema changes using schema drifting.
Chapter 10, Building the Serving Layer in Azure Synapse SQL Pools, covers loading processed data into Synapse dedicated SQL pools, performing data archival using partitioning, managing table distributions, and optimizing performance using statistics and workload management.
Chapter 11, Monitoring Synapse SQL and Spark Pools, covers monitoring Synapse dedicated SQL and Spark pools using Azure Log Analytics workbooks, Kusto scripts, and Azure Monitor, and monitoring Synapse dedicated SQL pools using Dynamic Management Views (DMVs).
Chapter 12, Optimizing and Maintaining Synapse SQL and Spark Pools, offers techniques for tuning query performance by optimizing query plans, rebuilding replication caches and maintenance scripts to optimize Delta tables, and automatically pausing SQL pools during inactivity, among other things.
Chapter 13, Monitoring and Maintaining Azure Data Engineering Pipelines, covers monitoring and managing end-to-end data engineering pipelines, which includes tracking data lineage using Microsoft Purview and improving the observability of pipeline executions using log analytics and query labeling.
Readers with exposure to Azure and a basic understanding of data engineering should easily be able to follow this book:
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Azure-Data-Engineering-Cookbook-2nd-edition. In case there’s an update to the code, it will be updated in the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/CJshA.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Observe that the CopyFiles package is now listed under the AzureSSIS | Projects folder."
A block of code is set as follows:
CREATE TABLE dbo.transaction_tbl WITH (DISTRIBUTION = ROUND_ROBIN) AS Select * from dbo.ext_transaction_tbl; GO Select TOP 100 * from dbo.transaction_tbl GOAny command-line input or output is written as follows:
Connect-AzAccount
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "The Configuration section under the Source section of Copy Data tool can remain with defaults."
Tips or important notes
Appear like this.
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows.
This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Azure Data Engineering Cookbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Data Lake forms the key storage layer for data engineering pipelines. Security and the monitoring of Data Lake accounts are key aspects of Data Lake maintenance. This chapter will focus on configuring security controls such as firewalls, encryption, and creating private links to a Data Lake account. By the end of this chapter, you will have learned how to configure a firewall, virtual network, and private link to secure the Data Lake, encrypt Data Lake using Azure Key Vault, and monitor key user actions in Data Lake.
We will be covering the following recipes in this chapter:
Configuring a firewall for an Azure Data Lake account using the Azure portalConfiguring virtual networks for an Azure Data Lake account using the Azure portalConfiguring private links for an Azure Data Lake accountConfiguring encryption using Azure Key Vault for Azure Data LakeAccessing Blob storage accounts using managed identitiesCreating an alert to monitor an Azure Data Lake accountSecuring an Azure Data Lake account with an SAS using PowerShellData Lake account access can be restricted to an IP or a range of IPs by whitelisting the allowed IPs in the storage account firewall. In this recipe, we'll learn to restrict access to a Data Lake account using a firewall.
Before you start, perform the following steps:
Open a web browser and go to the Azure portal at https://portal.azure.com.Make sure you have an existing storage account. If not, create one using the Provisioning an Azure storage account using the Azure portal recipe in Chapter 1, Creating and Managing Data in Azure Data Lake.To provide access to an IP or range of IPs, follow these steps:
In the Azure portal, locate and open the Azure storage account. In our case, the storage account is packtadestoragev2, created in the Provisioning an Azure storage account using the Azure portal recipe of Chapter 1, Creating and Managing Data in Azure Data Lake.On the storage account page, in the Security + Networking section, locate and select Firewalls and virtual networks.As the packtadestoragev2 account was created with public access, it can be accessed from all networks.
To allow access from an IP or an IP range, click on the Selected networks option on the storage account on the Firewalls and virtual networks page:Figure 2.1 – Azure Storage – Firewalls and virtual networks
In the Selected networks option, scroll down to the Firewall section. To give access to your machine only, select the Add your client IP address option. To give access to a different IP or range of IPs, type in the IPs in the Address range section:Figure 2.2 – The whitelist IPs in the Azure Storage Firewall section
To access storage accounts from Azure services such as Azure Data Factory and Azure Functions, check Allow Azure services on the trusted services list to access this storage account under the Exceptions heading.Click Save to save the configuration changes.Firewall settings are used to restrict access to an Azure storage account to an IP or range of IPs. Even if a storage account is public, it will only be accessible to the whitelisted IPs defined in the firewall configuration.
A storage account can be public which is accessible to everyone, public with access to an IP or range of IPs, or private with access to selected virtual networks. In this recipe, we'll learn how to restrict access to an Azure storage account in a virtual network.
Before you start, perform the following steps:
Open a web browser and go to the Azure portal at https://portal.azure.com.Make sure you have an existing storage account. If not, create one using the Provisioning an Azure storage account using the Azure portal recipe in Chapter 1, Creating and Managing Data in Azure Data Lake.To restrict access to a virtual network, follow the given steps:
In the Azure portal, locate and open the storage account. In our case, it's packtadestoragev2. On the storage account page, in the Security + Network section, locate and select Firewalls and virtual networks | Selected networks:Figure 2.3 – Azure Storage – Selected networks
In the Virtual networks section, select + Add new virtual network:Figure 2.4 – Adding a virtual network
In the Create virtual network blade, provide the virtual network name, Address space details, and Subnet address range. The remainder of the configuration values are pre-filled, as shown in the following screenshot:Figure 2.5 – Creating a new virtual network
Click on Create to create the virtual network. This is created and listed in the Virtual Network section, as shown in the following screenshot:Figure 2.6 – Saving a virtual network configuration
Click Save to save the configuration changes.We first created an Azure virtual network and then added it to the Azure storage account. Creating the Azure virtual network from the storage account page automatically fills in the resource group, location, and subscription information. The virtual network and the storage account should be in the same location.
The address space specifies the number of IP addresses in a given virtual network.
We also need to define the subnet within the virtual network that the storage account will belong to. We can also create a custom subnet. In our case, for the sake of simplicity, we have used the default subnet.
This allows the storage account to only be accessed by resources that belong to the given virtual network. The storage account is inaccessible to any network other than the specified virtual network.
In this recipe, we will be creating a private link to a storage account and using private endpoints to connect to it.
Private links and private endpoints ensure that all communication to the storage account goes through the Azure backbone network. Communications to the storage account don't use a public internet network, which makes them very secure.
Before you start, perform the following steps:
Open a web browser and go to the Azure portal at https://portal.azure.com.Make sure you have an existing storage account. If not, create one using the Provisioning an Azure storage account using the Azure Portal recipe in Chapter 1, Creating and Managing Data in Azure Data Lake.Make sure you have an existing virtual network configured to the storage account. If not, create one using the Configuring virtual networks for an Azure Data Lake account using the Azure portal recipe in this chapter.Perform the following steps to configure private links to a Data Lake account:
Log in to the Azure portal and click on the storage account.Click on Networking | the Private Endpoints tab.Click on the + Private endpoint button, as shown here:Figure 2.7 – Creating a private endpoint to a storage account
Provide an endpoint name, as shown in the following screenshot:Figure 2.8 – Providing an endpoint name
In the Resource tab, set Target sub-resource to dfs. Distributed File Systems (DFS) is sub-source if we are connecting to Data Lake Storage Gen2. The rest of the fields are auto-populated. Proceed to the Configuration section:Figure 2.9 – Setting the target resource type to dfs
Create a private Domain Name System (DNS) zone by picking the same resource group where you created the storage account, as shown in the following screenshot:Figure 2.10 – Creating a private DNS
Hit the Create button to create the private DNS link.After the private endpoint is created, open it in the Azure portal. Click on DNS configuration:Figure 2.11 – Copy the FQD9
Make a note of the FQDN and IP addresses details. The FQDN is the Fully Qualified Domain Name, which will resolve to the private IP address if, and only if, you are connected to the virtual network.With the preceding steps, we have created a private endpoint that will use private links to connect to a storage account.
We have created a private link to a storage account and ensured that traffic goes through the Microsoft backbone network (and not the public internet), as we will be accessing the storage account via a private endpoint. To show how it works, let's resolve the private URL link from the following locations. Let's perform the following:
Use nslookup to look up a private URL link from your local machine.Use nslookup to look up a private URL link from a virtual machine inside the virtual network.On your machine, open Command Prompt and type nslookup <FQDN of private link>, as shown in the following screenshot:
Figure 2.12 – Testing a private endpoint connection outside of the virtual network
nslookup resolves the private link to an incorrect IP address, as your machine is not part of the virtual network. To see it working, perform the following instructions:
Create a new virtual machine in the Azure portal. Ensure to allow a remote desktop connection to the virtual machine, as shown in the following screenshot:Figure 2.13 – Creating a new virtual machine and allowing a remote desktop
Under Networking, select the virtual network in which the storage account resides:Figure 2.14 – Configuring the virtual machine to use the virtual network