Azure Data Engineering Cookbook - Nagaraj Venkatesan - E-Book

Azure Data Engineering Cookbook E-Book

Nagaraj Venkatesan

0,0
37,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The famous quote 'Data is the new oil' seems more true every day as the key to most organizations' long-term success lies in extracting insights from raw data. One of the major challenges organizations face in leveraging value out of data is building performant data engineering pipelines for data visualization, ingestion, storage, and processing. This second edition of the immensely successful book by Ahmad Osama brings to you several recent enhancements in Azure data engineering and shares approximately 80 useful recipes covering common scenarios in building data engineering pipelines in Microsoft Azure.
You’ll explore recipes from Azure Synapse Analytics workspaces Gen 2 and get to grips with Synapse Spark pools, SQL Serverless pools, Synapse integration pipelines, and Synapse data flows. You’ll also understand Synapse SQL Pool optimization techniques in this second edition. Besides Synapse enhancements, you’ll discover helpful tips on managing Azure SQL Database and learn about security, high availability, and performance monitoring. Finally, the book takes you through overall data engineering pipeline management, focusing on monitoring using Log Analytics and tracking data lineage using Azure Purview.
By the end of this book, you’ll be able to build superior data engineering pipelines along with having an invaluable go-to guide.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 394

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Azure Data Engineering Cookbook

Second Edition

Get well versed in various data engineering techniques in Azure using this recipe-based guide

Nagaraj Venkatesan

Ahmad Osama

BIRMINGHAM—MUMBAI

Azure Data Engineering Cookbook

Second Edition

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Reshma Raman

Senior Editor: Nazia Shaikh

Content Development Editor: Manikandan Kurup

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Joshua Misquitta

Marketing Coordinators: Priyanka Mhatre and Nivedita Singh

First published: March 2021

Second edition: September 2022

Production reference: 2070922

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80324-678-9

www.packt.com

Contributors

About the authors

Nagaraj Venkatesan works as a cloud solution architect at Microsoft. At Microsoft, he works with some of the largest companies in the world, solving their complex data engineering problems and helping them build effective solutions using cutting-edge technologies based on Azure. Nagaraj, based out of Singapore, is a popular member of the data and AI community and is a regular speaker at several international data and AI conferences. He is a two-time Microsoft Most Valuable Professional (MVP) award winner, in 2016 and 2017. Nagaraj shares his technical expertise through his blog and on his YouTube channel called DataChannel. He also holds a master’s degree in computing from the National University of Singapore.

Ahmad Osama works for Pitney Bowes Pvt. Ltd. as a technical architect and is a former Microsoft Data Platform MVP. In his day job, he works on developing and maintaining high performant, on-premises and cloud SQL Server OLTP environments as well as deployment and automating tasks using PowerShell. When not working, Ahmad blogs at DataPlatformLabs and can be found glued to his Xbox.

About the reviewers

Milind Kumar Chaudhari is an experienced cloud data engineer/architect who specializes in designing, implementing, and managing complex data pipelines. He is a data enthusiast and passionate about leveraging the best data engineering practices to solve critical business problems. He is also currently a technical reviewer with O’Reilly Media.

Vijay Kumar Suda is a senior cloud big data architect with over 20 years of experience working with global clients. He has worked in Switzerland, Belgium, Mexico, Bahrain, India, and Canada and helped customers across multiple industries. He has been based out of the USA since 2008. His expertise includes data engineering, architecture, the cloud, AI, and machine learning.

Firstly, I’d like to thank my parents, Sri Koteswara Rao and Rajyalakshmi, for their love and support in every step of my life. I’d like to thank my wife, Radhika, my son, Chandra, and my daughter, Akshaya, for their daily support and patience. I’d like to thank my siblings, Rama, Swarana, and Dr. SriKumar, for their support, and finally, I’d like to thank Packt for the opportunity to review this book.

Table of Contents

Preface

1

Creating and Managing Data in Azure Data Lake

Technical requirements

Provisioning an Azure storage account using the Azure portal

Getting ready

How to do it…

How it works…

Provisioning an Azure storage account using PowerShell

Getting ready

How to do it…

How it works…

Creating containers and uploading files to Azure Blob storage using PowerShell

Getting ready

How to do it…

How it works…

Managing blobs in Azure Storage using PowerShell

Getting ready

How to do it…

How it works…

Configuring blob lifecycle management for blob objects using the Azure portal

Getting ready

How to do it…

How it works…

2

Securing and Monitoring Data in Azure Data Lake

Configuring a firewall for an Azure Data Lake account using the Azure portal

Getting ready

How to do it…

How it works…

Configuring virtual networks for an Azure Data Lake account using the Azure portal

Getting ready

How to do it…

How it works…

Configuring private links for an Azure Data Lake account

Getting ready

How to do it…

How it works…

Configuring encryption using Azure Key Vault for Azure Data Lake

Getting ready

How to do it…

How it works…

Accessing Blob storage accounts using managed identities

Getting ready

How to do it…

How it works…

Creating an alert to monitor an Azure storage account

Getting ready

How to do it…

How it works…

Securing an Azure storage account with SAS using PowerShell

Getting ready

How to do it…

How it works…

3

Building Data Ingestion Pipelines Using Azure Data Factory

Technical requirements

Provisioning Azure Data Factory

How to do it…

How it works…

Copying files to a database from a data lake using a control flow and copy activity

Getting ready

How to do it…

How it works…

Triggering a pipeline in Azure Data Factory

Getting ready

How to do it…

How it works…

Copying data from a SQL Server virtual machine to a data lake using the Copy data wizard

Getting ready

How to do it…

How it works…

4

Azure Data Factory Integration Runtime

Technical requirements

Configuring a self-hosted IR

Getting ready

How to do it…

How it works…

Configuring a shared self-hosted IR

Getting ready

How to do it…

Configuring high availability for a self-hosted IR

Getting ready

How to do it…

How it works…

Patching a self-hosted IR

Getting ready

How to do it…

How it works…

Migrating an SSIS package to Azure Data Factory

Getting ready

How to do it…

How it works...

5

Configuring and Securing Azure SQL Database

Technical requirements

Provisioning and connecting to an Azure SQL database using PowerShell

Getting ready

How to do it…

How it works…

Implementing an Azure SQL Database elastic pool using PowerShell

Getting ready

How to do it...

How it works…

Configuring a virtual network and private endpoints for Azure SQL Database

Getting ready

How to do it…

How it works…

Configuring Azure Key Vault for Azure SQL Database

Getting ready

How to do it…

How it works…

Provisioning and configuring a wake-up script for a serverless SQL database

Getting ready

How to do it…

How it works…

Configuring the Hyperscale tier of Azure SQL Database

Getting ready

How to do it…

6

Implementing High Availability and Monitoring in Azure SQL Database

Implementing active geo-replication for an Azure SQL database using PowerShell

Getting ready

How to do it…

How it works…

Implementing an auto-failover group for an Azure SQL database using PowerShell

Getting ready

How to do it…

How it works…

Configuring high availability to the Hyperscale tier of Azure SQL Database

Getting ready

How to do it…

How it works…

Implementing vertical scaling for an Azure SQL database using PowerShell

Getting ready

How to do it…

How it works…

Monitoring an Azure SQL database using the Azure portal

Getting ready

How to do it…

Configuring auditing for Azure SQL Database

Getting ready

How to do it…

How it works…

7

Processing Data Using Azure Databricks

Technical requirements

Configuring the Azure Databricks environment

Getting ready

How to do it…

Integrating Databricks with Azure Key Vault

Getting ready

How to do it…

How it works…

Mounting an Azure Data Lake container in Databricks

Getting ready

How to do it…

How it works…

Processing data using notebooks

Getting ready

How to do it…

How it works…

Scheduling notebooks using job clusters

Getting ready

How to do it…

How it works…

Working with Delta Lake tables

Getting ready

How to do it…

How it works…

Connecting a Databricks Delta Lake table to Power BI

Getting ready

How to do it…

How it works…

8

Processing Data Using Azure Synapse Analytics

Technical requirements

Provisioning an Azure Synapse Analytics workspace

Getting ready

How to do it…

Analyzing data using serverless SQL pool

Getting ready

How to do it…

How it works…

Provisioning and configuring Spark pools

Getting ready

How to do it…

How it works…

Processing data using Spark pools and a lake database

Getting ready

How to do it…

How it works…

Querying the data in a lake database from serverless SQL pool

Getting ready

How to do it…

How it works…

Scheduling notebooks to process data incrementally

Getting ready

How to do it…

How it works…

Visualizing data using Power BI by connecting to serverless SQL pool

Getting ready

How to do it…

How it works…

9

Transforming Data Using Azure Synapse Dataflows

Technical requirements

Copying data using a Synapse data flow

Getting ready

How to do it…

How it works…

Performing data transformation using activities such as join, sort, and filter

Getting ready

How to do it…

How it works…

Monitoring data flows and pipelines

Getting ready

How to do it…

How it works…

Configuring partitions to optimize data flows

Getting ready

How to do it…

How it works…

Parameterizing Synapse data flows

Getting ready

How to do it…

How it works…

Handling schema changes dynamically in data flows using schema drift

Getting ready

How to do it…

How it works…

10

Building the Serving Layer in Azure Synapse SQL Pool

Technical requirements

Loading data into dedicated SQL pools using PolyBase and T-SQL

Getting ready

How to do it…

How it works...

Loading data into a dedicated SQL pool using COPY INTO

Getting ready

How to do it...

How it works...

Creating distributed tables and modifying table distribution

Getting ready

How to do it…

How it works...

Creating statistics and automating the update of statistics

Getting ready

How to do it…

How it works…

Creating partitions and archiving data using partitioned tables

Getting ready

How to do it…

How it works...

Implementing workload management in an Azure Synapse dedicated SQL pool

Getting ready

How to do it…

How it works…

Creating workload groups for advanced workload management

Getting ready

How to do it…

How it works...

11

Monitoring Synapse SQL and Spark Pools

Technical requirements

Configuring a Log Analytics workspace for Synapse SQL pools

Getting ready

How to do it…

How it works…

Configuring a Log Analytics workspace for Synapse Spark pools

Getting ready

How to do it…

How it works…

Using Kusto queries to monitor SQL and Spark pools

Getting ready

How to do it…

How it works…

Creating workbooks in a Log Analytics workspace to visualize monitoring data

Getting ready

How to do it…

How it works…

Monitoring table distribution, data skew, and index health using Synapse DMVs

Getting ready

How to do it…

Building monitoring dashboards for Synapse with Azure Monitor

Getting ready

How to do it…

How it works…

12

Optimizing and Maintaining Synapse SQL and Spark Pools

Technical requirements

Analyzing a query plan and fixing table distribution

Getting ready

How to do it…

How it works…

Monitoring and rebuilding a replication table cache

Getting ready

How to do it…

How it works…

Configuring result set caching in Azure Synapse dedicated SQL pool

Getting ready

How to do it…

How it works…

Configuring longer backup retention for a Synapse SQL database

Getting ready

How to do it…

How it works…

Auto pausing Synapse dedicated SQL pool

Getting ready

How to do it…

How it works…

Optimizing Delta tables in a Synapse Spark pool lake database

Getting ready

How to do it…

How it works…

Optimizing query performance in Synapse Spark pools

Getting ready

How to do it…

How it works…

13

Monitoring and Maintaining Azure Data Engineering Pipelines

Technical requirements

Monitoring Synapse integration pipelines using Log Analytics and workbooks

Getting ready

How to do it…

How it works…

Tracing SQL queries for dedicated SQL pool to Synapse integration pipelines

Getting ready

How to do it…

How it works…

Provisioning a Microsoft Purview account and creating a data catalog

Getting ready

How to do it…

How it works…

Integrating a Synapse workspace with Microsoft Purview and tracking data lineage

Getting ready

How to do it…

How it works…

Applying Azure tags using PowerShell to multiple Azure resources

Getting ready

How to do it…

How it works…

Index

Other Books You May Enjoy

Preface

Data is the new oil and probably the most valuable resource. Data engineering covers how one can gain insights out of data. This book will introduce the key processes in data engineering (ingesting, storing, processing, and consuming) and share a few common recipes that can help us develop data engineering pipelines to gain insights into our data.

The book follows the logical data engineering process by beginning with Azure Data Lake and covering data ingestion using Azure Data Factory into Azure Data Lake and Azure SQL Database in the first few chapters. In these chapters, the book also covers the management of common storage layers such as Azure Data Lake and Azure SQL Database, focusing on topics such as security, high availability, and performance monitoring. The middle chapters focus on data processing using Azure Databricks, Azure Synapse Analytics Spark pools, and Synapse dataflows, and data exploration using Synapse serverless SQL pools. The final few chapters focus on the consumption of the data using Synapse dedicated SQL pool and Synapse Spark lake databases, covering the tips and tricks to optimize and maintain Synapse dedicated SQL pool databases and lake databases. Finally, the book also has a bonus chapter on managing the overall data engineering pipeline, which covers pipeline monitoring using Azure Log Analytics and tracking data lineage using Microsoft Purview.

While the book can be consumed in parts or any sequence, following along sequentially will help the readers experience building an end-to-end data engineering solution on Azure.

Who this book is for

The book is for anyone working on data engineering projects in Azure. Azure data engineers, data architects, developers, and database administrators working on Azure will find the book extremely useful.

What this book covers

Chapter 1, Creating and Managing Data in Azure Data Lake, focuses on provisioning, uploading, and managing the data life cycle in Azure Data Lake accounts.

Chapter 2, Securing and Monitoring Data in Azure Data Lake, covers securing an Azure Data Lake account using firewall and private links, accessing data lake accounts using managed identities, and monitoring an Azure Data Lake account using Azure Monitor.

Chapter 3, Building Data Ingestion Pipelines Using Azure Data Factory, covers ingesting data using Azure Data Factory and copying data between Azure SQL Database and Azure Data Lake.

Chapter 4, Azure Data Factory Integration Runtime, focuses on configuring and managing self-hosted integration runtimes and running SSIS packages in Azure using Azure-SSIS integration runtimes.

Chapter 5, Configuring and Securing Azure SQL Database, covers configuring a Serverless SQL database, Hyperscale SQL database, and securing Azure SQL Database using virtual networks and private links.

Chapter 6, Implementing High Availability and Monitoring in Azure SQL Database, explains configuring high availability to Azure SQL Database using auto-failover groups and read replicas, monitoring Azure SQL Database, and the automated scaling of Azure SQL Database during utilization spikes.

Chapter 7, Processing Data Using Azure Databricks, covers integrating Azure Databricks with Azure Data Lake and Azure Key Vault, processing data using Databricks notebooks, working with Delta tables, and visualizing Delta tables using Power BI.

Chapter 8, Processing Data Using Azure Synapse Analytics covers exploring data using Synapse Serverless SQL pool, processing data using Synapse Spark Pools, Working with Synapse Lake database, and integrating Synapse Analytics with Power BI.

Chapter 9, Transforming Data Using Azure Synapse Dataflows, focuses on performing transformations using Synapse Dataflows, optimizing data flows using partitioning, and managing dynamic source schema changes using schema drifting.

Chapter 10, Building the Serving Layer in Azure Synapse SQL Pools, covers loading processed data into Synapse dedicated SQL pools, performing data archival using partitioning, managing table distributions, and optimizing performance using statistics and workload management.

Chapter 11, Monitoring Synapse SQL and Spark Pools, covers monitoring Synapse dedicated SQL and Spark pools using Azure Log Analytics workbooks, Kusto scripts, and Azure Monitor, and monitoring Synapse dedicated SQL pools using Dynamic Management Views (DMVs).

Chapter 12, Optimizing and Maintaining Synapse SQL and Spark Pools, offers techniques for tuning query performance by optimizing query plans, rebuilding replication caches and maintenance scripts to optimize Delta tables, and automatically pausing SQL pools during inactivity, among other things.

Chapter 13, Monitoring and Maintaining Azure Data Engineering Pipelines, covers monitoring and managing end-to-end data engineering pipelines, which includes tracking data lineage using Microsoft Purview and improving the observability of pipeline executions using log analytics and query labeling.

To get the most out of this book

Readers with exposure to Azure and a basic understanding of data engineering should easily be able to follow this book:

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Azure-Data-Engineering-Cookbook-2nd-edition. In case there’s an update to the code, it will be updated in the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/CJshA.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Observe that the CopyFiles package is now listed under the AzureSSIS | Projects folder."

A block of code is set as follows:

CREATE TABLE dbo.transaction_tbl WITH (DISTRIBUTION = ROUND_ROBIN) AS  Select * from dbo.ext_transaction_tbl; GO Select TOP 100 *  from dbo.transaction_tbl GO

Any command-line input or output is written as follows:

Connect-AzAccount

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "The Configuration section under the Source section of Copy Data tool can remain with defaults."

Tips or important notes

Appear like this.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There’s more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows.

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There’s more…

This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share your thoughts

Once you’ve read Azure Data Engineering Cookbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

2

Securing and Monitoring Data in Azure Data Lake

Data Lake forms the key storage layer for data engineering pipelines. Security and the monitoring of Data Lake accounts are key aspects of Data Lake maintenance. This chapter will focus on configuring security controls such as firewalls, encryption, and creating private links to a Data Lake account. By the end of this chapter, you will have learned how to configure a firewall, virtual network, and private link to secure the Data Lake, encrypt Data Lake using Azure Key Vault, and monitor key user actions in Data Lake.

We will be covering the following recipes in this chapter:

Configuring a firewall for an Azure Data Lake account using the Azure portalConfiguring virtual networks for an Azure Data Lake account using the Azure portalConfiguring private links for an Azure Data Lake accountConfiguring encryption using Azure Key Vault for Azure Data LakeAccessing Blob storage accounts using managed identitiesCreating an alert to monitor an Azure Data Lake accountSecuring an Azure Data Lake account with an SAS using PowerShell

Configuring a firewall for an Azure Data Lake account using the Azure portal

Data Lake account access can be restricted to an IP or a range of IPs by whitelisting the allowed IPs in the storage account firewall. In this recipe, we'll learn to restrict access to a Data Lake account using a firewall.

Getting ready

Before you start, perform the following steps:

Open a web browser and go to the Azure portal at https://portal.azure.com.Make sure you have an existing storage account. If not, create one using the Provisioning an Azure storage account using the Azure portal recipe in Chapter 1, Creating and Managing Data in Azure Data Lake.

How to do it…

To provide access to an IP or range of IPs, follow these steps:

In the Azure portal, locate and open the Azure storage account. In our case, the storage account is packtadestoragev2, created in the Provisioning an Azure storage account using the Azure portal recipe of Chapter 1, Creating and Managing Data in Azure Data Lake.On the storage account page, in the Security + Networking section, locate and select Firewalls and virtual networks.

As the packtadestoragev2 account was created with public access, it can be accessed from all networks.

To allow access from an IP or an IP range, click on the Selected networks option on the storage account on the Firewalls and virtual networks page:

Figure 2.1 – Azure Storage – Firewalls and virtual networks

In the Selected networks option, scroll down to the Firewall section. To give access to your machine only, select the Add your client IP address option. To give access to a different IP or range of IPs, type in the IPs in the Address range section:

Figure 2.2 – The whitelist IPs in the Azure Storage Firewall section

To access storage accounts from Azure services such as Azure Data Factory and Azure Functions, check Allow Azure services on the trusted services list to access this storage account under the Exceptions heading.Click Save to save the configuration changes.

How it works…

Firewall settings are used to restrict access to an Azure storage account to an IP or range of IPs. Even if a storage account is public, it will only be accessible to the whitelisted IPs defined in the firewall configuration.

Configuring virtual networks for an Azure Data Lake account using the Azure portal

A storage account can be public which is accessible to everyone, public with access to an IP or range of IPs, or private with access to selected virtual networks. In this recipe, we'll learn how to restrict access to an Azure storage account in a virtual network.

Getting ready

Before you start, perform the following steps:

Open a web browser and go to the Azure portal at https://portal.azure.com.Make sure you have an existing storage account. If not, create one using the Provisioning an Azure storage account using the Azure portal recipe in Chapter 1, Creating and Managing Data in Azure Data Lake.

How to do it…

To restrict access to a virtual network, follow the given steps:

In the Azure portal, locate and open the storage account. In our case, it's packtadestoragev2. On the storage account page, in the Security + Network section, locate and select Firewalls and virtual networks | Selected networks:

Figure 2.3 – Azure Storage – Selected networks

In the Virtual networks section, select + Add new virtual network:

Figure 2.4 – Adding a virtual network

In the Create virtual network blade, provide the virtual network name, Address space details, and Subnet address range. The remainder of the configuration values are pre-filled, as shown in the following screenshot:

Figure 2.5 – Creating a new virtual network

Click on Create to create the virtual network. This is created and listed in the Virtual Network section, as shown in the following screenshot:

Figure 2.6 – Saving a virtual network configuration

Click Save to save the configuration changes.

How it works…

We first created an Azure virtual network and then added it to the Azure storage account. Creating the Azure virtual network from the storage account page automatically fills in the resource group, location, and subscription information. The virtual network and the storage account should be in the same location.

The address space specifies the number of IP addresses in a given virtual network.

We also need to define the subnet within the virtual network that the storage account will belong to. We can also create a custom subnet. In our case, for the sake of simplicity, we have used the default subnet.

This allows the storage account to only be accessed by resources that belong to the given virtual network. The storage account is inaccessible to any network other than the specified virtual network.

Configuring private links for an Azure Data Lake account

In this recipe, we will be creating a private link to a storage account and using private endpoints to connect to it.

Private links and private endpoints ensure that all communication to the storage account goes through the Azure backbone network. Communications to the storage account don't use a public internet network, which makes them very secure.

Getting ready

Before you start, perform the following steps:

Open a web browser and go to the Azure portal at https://portal.azure.com.Make sure you have an existing storage account. If not, create one using the Provisioning an Azure storage account using the Azure Portal recipe in Chapter 1, Creating and Managing Data in Azure Data Lake.Make sure you have an existing virtual network configured to the storage account. If not, create one using the Configuring virtual networks for an Azure Data Lake account using the Azure portal recipe in this chapter.

How to do it…

Perform the following steps to configure private links to a Data Lake account:

Log in to the Azure portal and click on the storage account.Click on Networking | the Private Endpoints tab.Click on the + Private endpoint button, as shown here:

Figure 2.7 – Creating a private endpoint to a storage account

Provide an endpoint name, as shown in the following screenshot:

Figure 2.8 – Providing an endpoint name

In the Resource tab, set Target sub-resource to dfs. Distributed File Systems (DFS) is sub-source if we are connecting to Data Lake Storage Gen2. The rest of the fields are auto-populated. Proceed to the Configuration section:

Figure 2.9 – Setting the target resource type to dfs

Create a private Domain Name System (DNS) zone by picking the same resource group where you created the storage account, as shown in the following screenshot:

Figure 2.10 – Creating a private DNS

Hit the Create button to create the private DNS link.After the private endpoint is created, open it in the Azure portal. Click on DNS configuration:

Figure 2.11 – Copy the FQD9

Make a note of the FQDN and IP addresses details. The FQDN is the Fully Qualified Domain Name, which will resolve to the private IP address if, and only if, you are connected to the virtual network.

With the preceding steps, we have created a private endpoint that will use private links to connect to a storage account.

How it works…

We have created a private link to a storage account and ensured that traffic goes through the Microsoft backbone network (and not the public internet), as we will be accessing the storage account via a private endpoint. To show how it works, let's resolve the private URL link from the following locations. Let's perform the following:

Use nslookup to look up a private URL link from your local machine.Use nslookup to look up a private URL link from a virtual machine inside the virtual network.

On your machine, open Command Prompt and type nslookup <FQDN of private link>, as shown in the following screenshot:

Figure 2.12 – Testing a private endpoint connection outside of the virtual network

nslookup resolves the private link to an incorrect IP address, as your machine is not part of the virtual network. To see it working, perform the following instructions:

Create a new virtual machine in the Azure portal. Ensure to allow a remote desktop connection to the virtual machine, as shown in the following screenshot:

Figure 2.13 – Creating a new virtual machine and allowing a remote desktop

Under Networking, select the virtual network in which the storage account resides:

Figure 2.14 – Configuring the virtual machine to use the virtual network