Azure Data Factory Cookbook - Dmitry Foshin - E-Book

Azure Data Factory Cookbook E-Book

Dmitry Foshin

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This new edition of the Azure Data Factory Cookbook, fully updated to reflect ADS V2, will help you get up and running by showing you how to create and execute your first job in ADF.

You’ll learn how to branch and chain activities, create custom activities, and schedule pipelines, as well as discovering the benefits of Cloud Data Warehousing, Azure Synapse Analytics, and Azure Data Lake Storage Gen2.

With practical recipes, you’ll learn how to actively engage with analytical tools from Azure's data services and leverage your on-premises infrastructure with cloud-native tools to get relevant business insights. As you advance, you’ll be able to integrate the most commonly used Azure services into ADF and understand how Azure services can be useful in designing ETL pipelines. You'll familiarize yourself with the common errors that you may encounter while working with ADF and find out how to use the Azure portal to monitor pipelines. You’ll also understand error messages and resolve problems in Connectors and Data flows with the debugging capabilities of ADF.

Two new chapters covering Azure Data Explorer and key best practices have been added, along with new recipes throughout.

By the end of this book, you’ll be able to use ADF as the main ETL and orchestration tool for your Data Warehouse or Data Platform projects.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 401

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Azure Data Factory Cookbook

Second Edition

A data engineer’s guide to building and managing ETL and ELT pipelines with data integration

Dmitry Foshin

Tonya Chernyshova

Dmitry Anoshin

Xenia Hertzenberg

BIRMINGHAM—MUMBAI

Azure Data Factory Cookbook

Second Edition

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Senior Publishing Product Manager: Gebin George

Acquisition Editor – Peer Reviews: Tejas Mhasvekar

Project Editor: Janice Gonsalves

Content Development Editors: Soham Amburle, Elliot Dallow

Copy Editor: Safis Editing

Technical Editor: Anjitha Murali

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Presentation Designer: Ajay Patule

Developer Relations Marketing Executive: Vignesh Raju

First published: December 2020

Second edition: February 2024

Production reference: 1220224

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80324-659-8

www.packt.com

Contributors

About the authors

Dmitry Foshin is a lead data engineer with 12+ years of experience in IT and Big Data. He focuses on delivering business insights through adept data engineering, analytics, and visualization. He excels in leading and executing full-stack big data analytics solutions, from ETL processes to data warehouses, utilizing Azure cloud technologies and modern BI tools. Along with being a co-author of the Azure Data Factory Cookbook, Dmitry has also launched successful data analytics projects for FMCG corporations across Europe.

I would like to express my heartfelt gratitude to my wife, Mariia, my parents, my brother, Ilya, and all my family and friends who supported me and provided encouragement throughout the journey of producing this book. Your unwavering support has been invaluable, and I am deeply grateful for your presence in my life.

Tonya Chernyshova is an experienced data engineer with a proven track record of successfully delivering scalable, maintainable, and impactful data products.

She is highly proficient in data modeling, automation, cloud computing, and data visualization, consistently driving data-driven insights and business growth.

Dmitry Anoshin is a data engineering leader with 15 years of experience working in business intelligence, data warehousing, data integration, big data, cloud, and machine learning space across North America and Europe.

He leads data engineering initiatives on a petabyte-scale data platform that was built using cloud and big data technologies to support machine learning experiments, data science models, business intelligence reporting, and data exchange with internal and external partners. He is also responsible for handling privacy compliance and security-critical datasets.

Besides that work, he teaches a cloud computing course at the University of Victoria, Canada, wherein he mentors high-school students in the CS faculty, and he also teaches people how to land data jobs at Surfalytics.com. In addition, he is an author of analytics books and a speaker at data-related conferences and user groups.

I want to thank my beautiful wife, Lana, and my kids, Vasily, Anna, and Michael, who give me the energy to work, grow, and contribute to the data industry.

Xenia Hertzenberg is a software engineer at Microsoft and has extensive knowledge in the field of data engineering, big data pipelines, data warehousing, and systems architecture.

About the reviewers

Deepak Goyal is a certified Azure Cloud Solution Architect, and he posseses over fifteen years of expertise in Designing, Developing and Managing Enterprise Cloud solutions. He is also a Big Data Certified professional and a passionate Cloud advocate.

Saikat Dutta is an Azure Data Engineer with over 13 years of experience. He has worked extensively with Microsoft Data products, from SQL Server 2000 to ADF, Synapse Pipelines, and MS Fabric. His career is shaped by various employers. The highlights of his career have been adaptability and a commitment to staying at the forefront of technology.

This is his first book review, wherein he has tried to provide practical insights into Microsoft Data products and tried to help the book become more than a cookbook. He has also contributed to a popular Data Newsletter and blog to share knowledge in the tech community.

Excited about the book’s impact, I look forward to continuing my journey in the evolving field of Data Engineering.

I express gratitude to my family for their unwavering support during the review process. Balancing work and family, especially with a younger kid, wouldn’t have been possible without their cooperation.

Join our community on Discord

Join our community’s Discord space for discussions with the authors and other readers:

https://discord.gg/U229qmBmT3

Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Getting Started with ADF

Introduction to the Azure data platform

Getting ready

How to do it...

How it works...

Creating and executing our first job in ADF

Getting ready

How to do it...

How it works...

There’s more...

See also

Creating an ADF pipeline using the Copy Data tool

Getting ready

How to do it...

How it works...

There’s more...

Creating an ADF pipeline using Python

Getting ready

How to do it...

How it works...

There’s more...

See also

Creating a data factory using PowerShell

Getting ready

How to do it…

How it works...

There’s more...

See also

Using templates to create ADF pipelines

Getting ready

How to do it...

How it works...

See also

Creating an Azure Data Factory using Azure Bicep

Getting ready

How to do it...

How it works...

There’s more...

See also

Orchestration and Control Flow

Technical requirements

Using parameters and built-in functions

Getting ready

How to do it…

How it works…

There’s more…

See also

Using the Metadata and Stored Procedure activities

Getting ready

How to do it…

How it works…

There’s more…

Using the ForEach and Filter activities

Getting ready

How to do it…

How it works…

Chaining and branching activities within a pipeline

Getting ready

How to do it…

There’s more…

Using the Lookup, Web, and Execute Pipeline activities

Getting ready

How to do it…

How it works…

There’s more…

See also

Creating event-based pipeline triggers

Getting ready

How to do it…

How it works…

There’s more…

See also

Setting Up Synapse Analytics

Technical requirements

Creating an Azure Synapse workspace

Getting ready

How to do it…

There’s more…

Loading data to Azure Synapse Analytics using Azure Data Factory

Getting ready

How to do it…

How it works…

There’s more…

Loading data to Azure Synapse Analytics using Azure Data Studio

Getting ready

How to do it…

How it works…

There’s more…

Loading data to Azure Synapse Analytics using bulk load

Getting ready

How to do it…

How it works…

Pausing/resuming an Azure Synapse SQL pool from Azure Data Factory

Getting ready

How to do it…

How it works…

There’s more…

Working with Azure Purview using Azure Synapse

Getting ready

How to do it…

How it works…

There’s more...

Copying data in Azure Synapse Integrate

Getting ready

How to do it…

How it works…

Using a Synapse serverless SQL pool

Getting ready

How to do it…

How it works…

There’s more…

Working with Data Lake and Spark Pools

Technical requirements

Setting up Azure Data Lake Storage Gen2

Getting ready

How to do it...

There’s more...

Creating a Synapse Analytics Spark pool

Getting ready

How to do it...

How it works...

There’s more...

Integrating Azure Data Lake and running Spark pool jobs

Getting ready

How to do it...

How it works...

Building and orchestrating a data pipeline for Data Lake and Spark

Getting ready

How to do it...

How it works...

There’s more...

Working with Big Data and Databricks

Introduction

Technical requirements

Setting up an HDInsight cluster

Getting ready

How to do it…

How it works…

There is more…

Processing data from Azure Data Lake with HDInsight and Hive

Getting ready

How to do it…

How it works…

Building data models in Delta Lake and data pipeline jobs with Databricks

Getting ready

How to do it…

How it works…

There is more…

Ingesting data into Delta Lake using Mapping Data Flows

Getting ready

How to do it…

How it works…

There is more…

External integrations with other compute engines (Snowflake)

Getting ready

How to do it…

How it works…

There is more…

Data Migration – Azure Data Factory and Other Cloud Services

Technical requirements

Copying data from Amazon S3 to Azure Blob storage

Getting ready

How to do it…

How it works…

Copying large datasets from S3 to ADLS

Getting ready

How to do it…

Creating the linked services and dataset for the pipeline

Creating the inner pipeline

Creating the outer pipeline

How it works…

See also

Copying data from Google Cloud Storage to Azure Data Lake

Getting ready

How to do it…

How it works…

See also

Copying data from Google BigQuery to Azure Data Lake Store

Getting ready

How to do it…

Migrating data from Google BigQuery to Azure Synapse

Getting ready

How to do it…

See also

Extending Azure Data Factory with Logic Apps and Azure Functions

Technical requirements

Triggering your data processing with Logic Apps

Getting ready

How to do it…

How it works…

There’s more…

Using the Web activity to call an Azure logic app

Getting ready

How to do it…

How it works…

There’s more…

Adding flexibility to your pipelines with Azure Functions

Getting ready…

How to do it…

How it works…

There’s more…

Microsoft Fabric and Power BI, Azure ML, and Cognitive Services

Technical requirements

Introducing Microsoft Fabric and Data Factory

Getting ready

How to do it...

How it works...

Microsoft Fabric Data Factory: A closer look at the pipelines

Getting ready

How to do it...

How it works...

Loading data with Microsoft Fabric Dataflows

Getting ready

How to do it...

How it works...

There’s more...

Automatically building ML models with speed and scale

Getting ready

How to do it...

How it works…

There’s more...

Analyzing and transforming data with Azure AI and prebuilt ML models

Getting ready

How to do it...

How it works...

There’s more...

Managing Deployment Processes with Azure DevOps

Technical requirements

Setting up Azure DevOps

Getting ready

How to do it...

How it works...

Publishing changes to ADF

Getting ready

How to do it...

How it works...

Deploying your features into the master branch

Getting ready

How to do it...

How it works...

Getting ready for the CI/CD of ADF

Getting ready

How to do it...

How it works...

Creating an Azure pipeline for CD

Getting ready

How to do it...

How it works...

There’s more...

Install and configure Visual Studio to work with ADF deployment

Getting ready

How to do it...

How it works...

Setting up ADF as a Visual Studio project

Getting ready

How to do it...

How it works…

Running Airflow DAGs with ADF

Getting ready

How to do it...

How it works...

Monitoring and Troubleshooting Data Pipelines

Technical requirements

Monitoring pipeline runs and integration runtimes

Getting ready

How to do it…

How it works…

Investigating failures – running pipelines in debug mode

Getting ready

How to do it…

How it works…

There’s more…

See also

Rerunning activities

Getting ready

How to do it…

How it works…

Configuring alerts for your Azure Data Factory runs

Getting ready

How to do it…

How it works…

There’s more…

See also

Working with Azure Data Explorer

Introduction to ADX

See also

Creating an ADX cluster and ingesting data

Getting ready

How to do it...

How it works...

How it works…

See also

Orchestrating ADX data with Azure Data Factory

Getting ready

How to do it...

How it works...

There’s more...

See also

Ingesting data from Azure Blob storage to ADX in Azure Data Factory using the Copy activity

Getting ready

How to do it...

How it works...

There’s more...

The Best Practices of Working with ADF

Technical requirements

Setting up roles and permissions with access levels in ADF

Getting ready

How to do it…

How it works…

There is more…

See also

Setting up Meta ETL with ADF

Getting ready

How to do it...

How it works…

There’s more...

Leveraging ADF scalability: Performance tuning of an ADF pipeline

Getting ready

How to do it…

How it works…

There is more…

See also

Using ADF disaster recovery built-in features

Getting ready

How to do it...

How it works...

Change Data Capture

Getting ready

How to do it…

How it works…

There is more…

Managing Data Factory costs with FinOps

Getting ready

How to do it...

How it works...

Other Books You May Enjoy

Index

Landmarks

Cover

Index

Preface

Azure Data Factory (ADF) is a modern data integration tool available on Microsoft Azure. This Azure Data Cookbook, Second Edition helps you get up and running by showing you how to create and execute your first job in ADF. You’ll learn how to branch and chain activities, create custom activities, and schedule pipelines. This book will help you discover the benefits of cloud data warehousing, Azure Synapse Analytics, Azure Data Lake Storage Gen2, and Databricks, which are frequently used for Big Data Analytics. Through practical recipes, you’ll learn how to actively engage with analytical tools from Azure Data Services and leverage your on-premises infrastructure with cloud-native tools to get relevant business insights.

As you advance, you’ll be able to integrate the most commonly used Azure services into ADF and understand how Azure services can be useful in designing ETL pipelines. The book will take you through the common errors that you may encounter while working with ADF and guide you in using the Azure portal to monitor pipelines. You’ll also understand error messages and resolve problems in connectors and data flows with the debugging capabilities of ADF.

Additionally, there is also a focus on the latest cutting-edge technology in Microsoft Fabric. You’ll explore how this technology enhances its capabilities for data integration and orchestration.

By the end of this book, you’ll be able to use ADF as the main ETL and orchestration tool for your data warehouse and data platform projects.

Who this book is for

This book is for ETL developers, data warehouse and ETL architects, software professionals, and anyone who wants to learn about the common and not-so-common challenges that are faced while developing traditional and hybrid ETL solutions using Microsoft’s ADF, Synapse Analytics, and Fabric. You’ll also find this book useful if you are looking for recipes to improve or enhance your existing ETL pipelines. Basic knowledge of data warehousing is expected.

What this book covers

Chapter 1, Getting Started with ADF, will provide a general introduction to the Azure data platform. In this chapter, you will learn about the ADF interface and options as well as common use cases. You will perform hands-on exercises in order to find ADF in the Azure portal and create your first ADF job.

Chapter 2, Orchestration and Control Flow, will introduce you to the building blocks of data processing in ADF. The chapter contains hands-on exercises that show you how to set up linked services and datasets for your data sources, use various types of activities, design data-processing workflows, and create triggers for data transfers.

Chapter 3, Setting Up Synapse Analytics, covers key features and benefits of cloud data warehousing and Azure Synapse Analytics. You will learn how to connect and configure Azure Synapse Analytics, load data, build transformation processes, and operate data flows.

Chapter 4, Working with Data Lake and Spark Pools, will cover the main features of the Azure Data Lake Storage Gen2. It is a multimodal cloud storage solution that is frequently used for big data analytics. We will load and manage the datasets that we will use for analytics in the next chapter.

Chapter 5, Working with Big Data and Databricks, will actively engage with analytical tools from Azure’s data services. You will learn how to build data models in Delta Lake using Azure Databricks and mapping data flows. Also, this recipe will show you how to set up HDInsights clusters and how to work with delta tables.

Chapter 6, Data Migration – Azure Data Factory and Other Cloud Services, will walk though several illustrative examples on migrating data from Amazon Web Services and Google Cloud providers. In addition, you will learn how to use ADF’s custom activities to work with providers who are not supported by Microsoft’s built-in connectors.

Chapter 7, Extending Azure Data Factory with Logic Apps and Azure Functions, will show you how to harness the power of serverless execution by integrating some of the most commonly used Azure services: Azure Logic Apps and Azure Functions. These recipes will help you understand how Azure services can be useful in designing Extract, Transform, Load (ETL) pipelines.

Chapter 8, Microsoft Fabric and Power BI, Azure ML, and Cognitive Services, will teach you how to build an ADF pipeline that operates on a pre-built Azure ML model. You will also create and run an ADF pipeline that leverages Azure AI for text data analysis. In the last three recipes, you’ll familiarize yourself with the primary components of Microsoft Fabric Data Factory.

Chapter 9, Managing Deployment Processes with Azure DevOps, will delve into setting up CI and CD for data analytics solutions in ADF using Azure DevOps. Throughout the process, we will also demonstrate how to use Visual Studio Code to facilitate the deployment of changes to ADF.

Chapter 10, Monitoring and Troubleshooting Data Pipelines, will introduce tools to help you manage and monitor your ADF pipelines. You will learn where and how to find more information about what went wrong when a pipeline failed, how to debug a failed run, how to set up alerts that notify you when there is a problem, and how to identify problems with your integration runtimes.

Chapter 11, Working with Azure Data Explorer, will help you to set up a data ingestion pipeline from ADF to Azure Data Explorer: it includes a step-by-step guide to ingesting JSON data from Azure Storage and will teach you how to transform data in Azure Data Explorer with ADF activities.

Chapter 12, The Best Practices of Working with ADF, will guide you through essential considerations, strategies, and practical recipes that will elevate your ADF projects to new heights of efficiency, security, and scalability.

To get the most out of this book

Basic knowledge of data warehousing is expected. You’ll need an Azure subscription to follow all the recipes given in the book. If you’re using a paid subscription, make sure to pause or delete the services after you are done using them, to avoid high usage costs.

Software/Hardware covered in the book

OS Requirements

Azure subscription (portal.azure.com)

Windows, macOS, or Linux

SQL Server Management Studio

Windows

Azure Data Studio

Windows, macOS, or Linux

Power BI or Microsoft Fabric subscription account

Windows, macOS, or Linux

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook-Second-Edition. In case there’s an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803246598.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

[default]exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100) exten => i,1,Voicemail(s0)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100)exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample /etc/asterisk/cdr_mysql.conf

Bold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes also appear in the text like this. For example: “Select System info from the Administration panel.”

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected], and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read Azure Data Factory Cookbook, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781803246598

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

1

Getting Started with ADF

Microsoft Azure is a public cloud vendor. It offers different services for modern organizations. The Azure cloud has several key components, such as compute, storage, databases, and networks. They serve as building blocks for any organization that wants to reap the benefits of cloud computing. There are many benefits to using the cloud, including utilities, metrics, elasticity, and security. Many organizations across the world already benefit from cloud deployment and have fully moved to the Azure cloud. They deploy business applications and run their business on the cloud. As a result, their data is stored in cloud storage and cloud applications.

Microsoft Azure offers a cloud analytics stack that helps us to build modern analytics solutions, extract data from on-premises and the cloud, and use data for decision-making progress, searching patterns in data, and deploying machine learning applications.

In this chapter, we will meet Azure data platform services and the main cloud data integration service – Azure Data Factory (ADF). We will log in to Azure and navigate to the Data Factory service in order to create the first data pipeline and run the copy activity. Then, we will do the same exercise but will use different methods of data factory management and control by using Python, PowerShell, and the Copy Data tool.

If you don’t have an Azure account, we will cover how you can get a free Azure account.

In this chapter, we will cover the following recipes:

Introduction to the Azure data platformCreating and executing our first job in ADFCreating an ADF pipeline using the Copy Data toolCreating an ADF pipeline using PythonCreating a data factory using PowerShellUsing templates to create ADF pipelinesCreating an Azure Data Factory using Azure Bicep

Introduction to the Azure data platform

The Azure data platform provides us with a number of data services for databases, data storage, and analytics. In Table 1.1, you can find a list of services and their purpose:

Figure 1.1: Azure data platform services

Using Azure data platform services can help you build a modern analytics solution that is secure and scalable. The following diagram shows an example of a typical modern cloud analytics architecture:

Figure 1.2: Modern analytics solution architecture

You can find most of the Azure data platform services here. ADF is a core service for data movement and transformation.

Let’s learn more about the reference architecture in Figure 1.1. It starts with source systems. We can collect data from files, databases, APIs, IoT, and so on. Then, we can use Event Hubs for streaming data and ADF for batch operations. ADF will push data into Azure Data Lake as a staging area, and then we can prepare data for analytics and reporting in Azure Synapse Analytics. Moreover, we can use Databricks for big data processing and machine learning models. Power BI is the ultimate data visualization service. Finally, we can push data into Azure Cosmos DB if we want to use data in business applications.

Getting ready

In this recipe, we will create a free Azure account, log in to the Azure portal, and locate ADF services. If you have an Azure account already, you can skip the creation of the account and log straight in to the portal.

How to do it...

Open https://azure.microsoft.com/free/, then take the following steps:

Click Start Free.You can sign into your existing Microsoft account or create a new one. Let’s create one as an example.Enter an email address in the format [email protected] and click Next.Enter a password of your choice.Verify your email by entering the code and click Next.Fill in the information for your profile (Country, Name, and so on). It will also require your credit card information.After you have finished the account creation, it will bring you to the Microsoft Azure portal, as shown in the following screenshot:

Figure 1.3: Azure portal

Now, we can explore the Azure portal and find Azure data services. Let’s find Azure Synapse Analytics. In the search bar, enter Azure Synapse Analytics and choose Azure Synapse Analytics. It will open the Synapse control panel, as shown in the following screenshot:

Figure 1.4: Azure Synapse Analytics menu

Here, we can launch a new instance of a Synapse Analytics workspace. Or you can find the Data Factories menu and launch a new Data Factory by using the Azure portal:

Figure 1.5: Azure Data factories menu

In the next recipe, we will create a new data factory.

Before doing anything with ADF, though, let’s review what we have covered about an Azure account and the difference between Synapse Analytics and Data Factories.

How it works...

Now that we have created a free Azure account, it gives us the following benefits:

12 months of free access to popular products$250 worth of credit25+ always-free products

The Azure account we created is free and you won’t be charged unless you choose to upgrade.

Moreover, we discovered the Azure data platform products, which we will use over the course of the book. The Azure portal has a friendly UI where we can easily locate, launch, pause, or terminate the service. Aside from the UI, Azure offers us other ways of communicating with Azure services, using the Command-line Interface (CLI), APIs, SDKs, and so on.

Using the Microsoft Azure portal, you can choose the Analytics category and it will show you all the analytics services, as shown in the following screenshot:

Figure 1.6: Azure analytics services

Both Azure Synapse Analytics (ASA) and ADF have overlap. ASA workspaces are an integrated analytics service that combines big data and data warehousing. It allows you to perform data integration, data warehousing, and big data analytics using a single service. Moreover, it allows you to do a wide range of data integration options, including SQL Server Integration Services (SSIS), ADF, or Spark transformations.

So, when to use what? If you need a simple and cost-effective way to move and transform data from various sources to various destinations, ADF is a good choice. However, if you need a more comprehensive analytics solution that can handle both big data and data warehousing, ASA is the way to go. In other words, standalone ADF is good for the orchestration of your data pipelines and workloads in general. But if you are willing to leverage Synapse Data Warehouse or big data solutions, you should consider using ADF as a part of ASA workspaces. Both have similar interfaces and functionality.

In this chapter, we will be using a standalone ADF.

Creating and executing our first job in ADF

ADF allows us to create workflows for transforming and orchestrating data movement. You may think of ADF as an Extract, Transform, Load (ETL) tool for the Azure cloud and the Azure data platform. ADF is Software as a Service (SaaS). This means that we don’t need to deploy any hardware or software. We pay for what we use. Often, ADF is referred to as code-free ETL as a service or managed service. The key operations of ADF are listed here:

Ingest: Allows us to collect data and load it into Azure data platform storage or any other target location. ADF has 90+ data connectors.Control flow: Allows us to design code-free extracting and loading workflows.Data flow: Allows us to design code-free data transformations.Schedule: Allows us to schedule ETL jobs.Monitor: Allows us to monitor ETL jobs.

We have learned about the key operations of ADF. Next, we should try them.

Getting ready

In this recipe, we will continue from the previous recipe, where we found ASA in the Azure console. We will create a data factory using a straightforward method – through the ADF User Interface (UI) via the Azure portal UI. It is important to have the correct permissions to create a new data factory. In our example, we are using a super admin, so we should be good to go.

During the exercise, we will create a new resource group. It is a collection of resources that share the same life cycle, permissions, and policies.

How to do it...

Let’s get back to our data factory:

If you have closed the Data Factory console, you should open it again. Search for Data factories and click Enter.Click Create data factory, or Add if you are on the Data factories screen, and it will open the project details, where we will choose a subscription (in our case, Free Trial).We haven’t created a resource group yet. Click Create new and type the name ADFCookbook. Choose East US for Region, give the name as ADFcookbookJob-<YOUR NAME> (in my case, ADFcookbookJob-Dmitry), and leave the version as V2. Then, click Next: Git Configuration.We can use GitHub or Azure DevOps. We won’t configure anything yet so we will select Configure Git later. Then, click Next: Networking.We have an option to increase the security of our pipelines using Managed Virtual Network and Private endpoint. For this recipe, we will use the default settings. Click Next.Optionally, you can specify tags. Then, click Next: Review + Create. ADF will validate your settings and will allow you to click Create.Azure will deploy the data factory. We can choose our data factory and click Launch Studio. This will open the ADF UI home page, where we can find lots of useful tutorials and webinars under Help/Information in the top-right corner.From the left panel, choose the New Pipeline icon, as shown in the following screenshot, and it will open a window where we will start the creation of the pipeline. Choose New pipeline and it will open the pipeline1 window, where we must provide the following information: input, output, and compute. Add the name ADF-cookbook-pipeline1 and click Validate All:

Figure 1.7 : ADF resources

When executing Step 8, you will find out that you can’t save the pipeline without the activity. For our new data pipeline, we will do a simple copy data activity. We will copy the file from one blob folder to another. In this chapter, we won’t spend time on spinning resources such as databases, Synapse, or Databricks. Later in this book, you will learn about using ADF with other data platform services. In order to copy data from Blob storage, we should create an Azure storage account and a Blob container.Let’s create the Azure storage account. Go to All Services | Storage | StorageAccounts.Click + Add.Use our Free Trial subscription. For the resource group, we will use ADFCookbook. Give a name for the storage account, such as adfcookbookstoragev2, then click Review and Create. The name should be unique to you.Click Go to Resource and select Containers on the left sidebar:

Figure 1.8 : Azure storage account UI

Click +Container and enter the name adfcookbook.Now, we want to upload a data file into the SalesOrders.txt file. You can get this file from the book’s GitHub account at https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook-Second-Edition/Chapter01/. Go to the adfcookbook container and click Upload. We will specify the folder name as input. We just uploaded the file to the cloud! You can find it with the /container/folder/file – adfcookbook/input/SalesOrders.txt path.Next, we can go back to ADF. In order to finish the pipeline, we should add an input dataset and create a new linked service.In the ADF studio, click the Manage icon from the left sidebar. This will open the linked services. Click +New and choose Azure Blob Storage, then click Continue.We can optionally change the name or leave it as the default, but we have to specify the subscription in From Azure Subscription and choose the Azure Subscriptions and Storage account name that we just created.Click Test Connection and, if all is good, click Create.Next, we will add a dataset. Go to our pipeline and click New dataset, as shown in the following screenshot:

Figure 1.9: ADF resources

Choose Azure Blob Storage and click Continue. Choose the Binary format type for our text file and click Continue.Now, we can specify the AzureBlobStorage1 linked services and we will specify the path to the adfcookbook/input/SalesOrders.txt file and click Create.We can give the name of the dataset in Properties. Type in SalesOrdersDataset and click Validate all. We shouldn’t encounter any issues with data.We should add one more dataset as the output for our job. Let’s create a new dataset with the name SalesOrdersDatasetOutput and path adfcookbook/output/SalesOrders.txt.Now, we can go back to our data pipeline. We couldn’t save it when we created it without a proper activity. Now, we have all that we need in order to finish the pipeline. Add the new pipeline and give it the name ADF-cookbook-pipeline1. Then, from the activity list, expand Move & transform and drag and drop the Copy Data step to the canvas.We have to specify the parameters of the step – the source and sink information. Click the Source tab and choose our dataset, SalesOrdersDataset.Click the Sink tab and choose SalesOrdersDatasetOutput. This will be our output folder.Now, we can publish two datasets and one pipeline. Click Publish All.Then, we can trigger our pipeline manually. Click Add trigger, as shown in the following screenshot:

Figure 1.10 : ADF canvas with the Copy Data activity

Select Trigger Now. It will launch our job.We can click on Monitor from the left sidebar and find the pipeline runs. In the case of failure, we can pick up the logs here and find the root cause. In our case, the ADF-cookbook-pipeline1 pipeline succeeds. In order to see the outcome, we should go to Azure Storage and open our container. You can find the additional Output folder and a file named SalesOrders.txt there.

We have just created our first job using the UI. Let’s learn more about ADF.

How it works...

Using the ADF UI, we created a new pipeline – an ETL job. We specified input and output datasets and used Azure Blob storage as a linked service. The linked service itself is a kind of connection string. ADF is using the linked service in order to connect external resources. On the other hand, we have datasets. They represent the data structure for the data stores. We performed the simple activity of copying data from one folder to another. After the job ran, we reviewed the Monitor section with the job run logs.

There’s more...

An ADF pipeline is a set of JSON config files. You can also view the JSON for each pipeline, dataset, and so on in the portal by clicking the three dots in the top-right corner. We are using the UI to create the configuration file and run the job. You can review the JSON config file by clicking on Download support files to download a JSON file, as shown in the following figure:

Figure 1.11 : Downloading the pipeline config files

This will save the archive file. Extract it and you will find a folder with the following subfolders:

DatasetLinkedServicePipeline

Each folder has a corresponding JSON config file.

See also

You can find more information about ADF in this Microsoft video, Introduction to Azure Data Factory: https://azure.microsoft.com/en-us/resources/videos/detailed-introduction-to-azure-data-factory/.

Creating an ADF pipeline using the Copy Data tool

We just reviewed how to create the ADF job using the UI. However, we can also use the Copy Data tool (CDT). The CDT allows us to load data into Azure storage faster. We don’t need to set up linked services, pipelines, and datasets as we did in the previous recipe. In other words, depending on your activity, you can use the ADF UI or the CDT. Usually, we will use the CDT for simple load operations, when we have lots of data files and we would like to ingest them into Data Lake as fast as possible.

Getting ready

In this recipe, we will use the CDT in order to do the same task of copying data from one folder to another.

How to do it...

We already created the ADF job with the UI. Let’s review the CDT:

In the previous recipe, we created the Azure Blob storage instance and container. We will use the same file and the same container. However, we have to delete the file from the output location.Go to Azure Storage Accounts, choose adfcookbookstorage, and click Containers. Choose adfcookbook. Go to the Output folder and delete the SalesOrders.txt file.Now, we can go back to the Data Factory Studio. On the home page, we can see the tile for Ingest. Click on it. It will open with the CDT wizard.Click Built-in copy task. Choose Run once now. Click Next.We should choose the data source – AzureBlobStorage1 – and specify the folder and file. You can browse the blob storage and you will find the filename. The path should look like adfcookbook/input/SalesOrders.txt. Mark Binary copy. When we choose the binary option, the file will be treated as binary and won’t enforce the schema. This is a great option to just copy the file as is. Click Next.Next, we will choose the destination. Choose AzureBlobStorage2 and click Next. Enter the adfcookbook/output output path and click Next until you reach the end.Give it the task name CDT-copy-job and Click Next. As a result, you should get a similar output as I have, as you can see in the following screenshot:

Figure 1.12: CDT UI

If we go to the storage account, we will find that CDT copied data into the Output folder.

We have created a copy job using CDT.

How it works...

CDT basically created the data pipeline for us. If you go to ADF author, you will find a new job and new datasets.

There’s more...

You can learn more about the CDT at the Microsoft documentation page: https://docs.microsoft.com/en-us/azure/data-factory/copy-data-tool.

Creating an ADF pipeline using Python

We can use PowerShell, .NET, and Python for ADF deployment and data integration automation. Here is an extract from the Microsoft documentation:

”Azure Automation delivers a cloud-based automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources.”

In this recipe, we want to cover the Python scenario because Python is one of the most popular languages for analytics and data engineering. We will use Jupyter Notebook with example code.

You can use Jupyter notebooks or Visual Code notebooks.

Getting ready

For this exercise, we will use Python in order to create a data pipeline and copy our file from one folder to another. We need to use the azure-mgmt-datafactory and azure-mgmt-resource Python packages as well as some other libraries that we will cover in the example.

How to do it...

We will create an ADF pipeline using Python. We will start with some preparatory steps:

We will start with the deletion of our file in the output directory. Go to Azure Storage Accounts, choose adfcookbookstorage, and click Containers. Choose adfcookbook. Go to the Output folder and delete the SalesOrders.txt file.We will install the Azure management resources Python package by running this command from the CLI. In my example, I used Terminal on macOS: pip install azure-mgmt-resource Next, we will install the ADF Python package by running this command from the CLI: pip install azure-mgmt-datafactory Also, I installed these packages to run code from Jupyter: pip install msrestazure pip install azure.mgmt.datafactory pip install azure.identity

When we finish installing the Python packages, we should use these packages in order to create the data pipeline, datasets, and linked service, as well as to run the code. Python gives us flexibility and we could embed this into our analytics application or Spark/Databricks.

The code itself is quite big and you can find the code in the Git repo for this chapter, ADF_Python_Run.ipynb.

In order to control Azure resources from the Python code, we have to register the app with Azure Active Directory and assign a contributor role to this app in Identity and Access Management (IAM) under our subscription. We have to get tenant_id, client_id, and client_secret. You can learn more about this process at the Microsoft official documentation: https://learn.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal. We will provide brief steps.Go to Azure Active Directory and click Appregistrations. Click + New registration. Enter the name ADFcookbookapp and click Register. From the app properties, you have to copy Application(client)ID and Directory (tenant)ID.Still in ADFcookbookapp, go to Certificates & secrets on the left sidebar. Click + New client secret and add new client secret. Copy the value.Next, we should give permissions to our app. Go to the subscriptions. Choose Free Trial. Click on IAM. Click on Add role assignments. Select the Contributor role under Privileged administrator roles and click Next. Assign access to a user, group, or service principal. Finally, search for our app, ADFcookbookapp, and click Save. As a result, we just granted access to the app and we can use these credentials in our Python code. If you don’t give permission, you will get the following error message: AuthorizationFailed.Open ADF_Python_Run.ipynb and make sure that you have all the libraries in place by executing the first code block. You can open the file in Jupyter Notebook: from azure.identityimportClientSecretCredentialfrom azure.mgmt.resourceimportResourceManagementClientfrom azure.mgmt.datafactoryimportDataFactoryManagementClientfrom azure.mgmt.datafactory.modelsimport * from datetime import datetime, timedelta import time You should run this piece without any problems. If you encounter an issue, it means you are missing the Python package. Make sure that you have installed all of the packages. Run section 2 in the notebook. You can find the notebook in the GitHub repository with the book files.In section 3, Authenticate Azure, you have to enter the user_name, subscription_id, tenant_id, client_id, and client_secret values. The resource group and data factory name we can leave as is. Then, run section 4, Created Data Factory.The Python code will also interact with the Azure storage account, and we should provide the storage account name and key. For this chapter, we are using the adfcookbookstorage storage account and you can find the key under the Access keys section of this storage account menu. Copy the key value and paste it into section 5, Created a Linked Service, and run it.In sections 6 and 7, we are creating input and output datasets. You can run the code as is. In section 8, we will create the data pipeline and specify the CopyActivity activity.Finally, we will run the pipeline at section 9, Create a pipeline run.In section 10, Monitor a pipeline run, we will check the output of the run. We should get the following: Pipeline run status: Succeeded

We just created an ADF job with Python. Let’s add more details.

How it works...