39,59 €
This new edition of the Azure Data Factory Cookbook, fully updated to reflect ADS V2, will help you get up and running by showing you how to create and execute your first job in ADF.
You’ll learn how to branch and chain activities, create custom activities, and schedule pipelines, as well as discovering the benefits of Cloud Data Warehousing, Azure Synapse Analytics, and Azure Data Lake Storage Gen2.
With practical recipes, you’ll learn how to actively engage with analytical tools from Azure's data services and leverage your on-premises infrastructure with cloud-native tools to get relevant business insights. As you advance, you’ll be able to integrate the most commonly used Azure services into ADF and understand how Azure services can be useful in designing ETL pipelines. You'll familiarize yourself with the common errors that you may encounter while working with ADF and find out how to use the Azure portal to monitor pipelines. You’ll also understand error messages and resolve problems in Connectors and Data flows with the debugging capabilities of ADF.
Two new chapters covering Azure Data Explorer and key best practices have been added, along with new recipes throughout.
By the end of this book, you’ll be able to use ADF as the main ETL and orchestration tool for your Data Warehouse or Data Platform projects.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 401
Veröffentlichungsjahr: 2024
Azure Data Factory Cookbook
Second Edition
A data engineer’s guide to building and managing ETL and ELT pipelines with data integration
Dmitry Foshin
Tonya Chernyshova
Dmitry Anoshin
Xenia Hertzenberg
BIRMINGHAM—MUMBAI
Azure Data Factory Cookbook
Second Edition
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Senior Publishing Product Manager: Gebin George
Acquisition Editor – Peer Reviews: Tejas Mhasvekar
Project Editor: Janice Gonsalves
Content Development Editors: Soham Amburle, Elliot Dallow
Copy Editor: Safis Editing
Technical Editor: Anjitha Murali
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Presentation Designer: Ajay Patule
Developer Relations Marketing Executive: Vignesh Raju
First published: December 2020
Second edition: February 2024
Production reference: 1220224
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80324-659-8
www.packt.com
Dmitry Foshin is a lead data engineer with 12+ years of experience in IT and Big Data. He focuses on delivering business insights through adept data engineering, analytics, and visualization. He excels in leading and executing full-stack big data analytics solutions, from ETL processes to data warehouses, utilizing Azure cloud technologies and modern BI tools. Along with being a co-author of the Azure Data Factory Cookbook, Dmitry has also launched successful data analytics projects for FMCG corporations across Europe.
I would like to express my heartfelt gratitude to my wife, Mariia, my parents, my brother, Ilya, and all my family and friends who supported me and provided encouragement throughout the journey of producing this book. Your unwavering support has been invaluable, and I am deeply grateful for your presence in my life.
Tonya Chernyshova is an experienced data engineer with a proven track record of successfully delivering scalable, maintainable, and impactful data products.
She is highly proficient in data modeling, automation, cloud computing, and data visualization, consistently driving data-driven insights and business growth.
Dmitry Anoshin is a data engineering leader with 15 years of experience working in business intelligence, data warehousing, data integration, big data, cloud, and machine learning space across North America and Europe.
He leads data engineering initiatives on a petabyte-scale data platform that was built using cloud and big data technologies to support machine learning experiments, data science models, business intelligence reporting, and data exchange with internal and external partners. He is also responsible for handling privacy compliance and security-critical datasets.
Besides that work, he teaches a cloud computing course at the University of Victoria, Canada, wherein he mentors high-school students in the CS faculty, and he also teaches people how to land data jobs at Surfalytics.com. In addition, he is an author of analytics books and a speaker at data-related conferences and user groups.
I want to thank my beautiful wife, Lana, and my kids, Vasily, Anna, and Michael, who give me the energy to work, grow, and contribute to the data industry.
Xenia Hertzenberg is a software engineer at Microsoft and has extensive knowledge in the field of data engineering, big data pipelines, data warehousing, and systems architecture.
Deepak Goyal is a certified Azure Cloud Solution Architect, and he posseses over fifteen years of expertise in Designing, Developing and Managing Enterprise Cloud solutions. He is also a Big Data Certified professional and a passionate Cloud advocate.
Saikat Dutta is an Azure Data Engineer with over 13 years of experience. He has worked extensively with Microsoft Data products, from SQL Server 2000 to ADF, Synapse Pipelines, and MS Fabric. His career is shaped by various employers. The highlights of his career have been adaptability and a commitment to staying at the forefront of technology.
This is his first book review, wherein he has tried to provide practical insights into Microsoft Data products and tried to help the book become more than a cookbook. He has also contributed to a popular Data Newsletter and blog to share knowledge in the tech community.
Excited about the book’s impact, I look forward to continuing my journey in the evolving field of Data Engineering.
I express gratitude to my family for their unwavering support during the review process. Balancing work and family, especially with a younger kid, wouldn’t have been possible without their cooperation.
Join our community’s Discord space for discussions with the authors and other readers:
https://discord.gg/U229qmBmT3
Preface
Who this book is for
What this book covers
To get the most out of this book
Get in touch
Getting Started with ADF
Introduction to the Azure data platform
Getting ready
How to do it...
How it works...
Creating and executing our first job in ADF
Getting ready
How to do it...
How it works...
There’s more...
See also
Creating an ADF pipeline using the Copy Data tool
Getting ready
How to do it...
How it works...
There’s more...
Creating an ADF pipeline using Python
Getting ready
How to do it...
How it works...
There’s more...
See also
Creating a data factory using PowerShell
Getting ready
How to do it…
How it works...
There’s more...
See also
Using templates to create ADF pipelines
Getting ready
How to do it...
How it works...
See also
Creating an Azure Data Factory using Azure Bicep
Getting ready
How to do it...
How it works...
There’s more...
See also
Orchestration and Control Flow
Technical requirements
Using parameters and built-in functions
Getting ready
How to do it…
How it works…
There’s more…
See also
Using the Metadata and Stored Procedure activities
Getting ready
How to do it…
How it works…
There’s more…
Using the ForEach and Filter activities
Getting ready
How to do it…
How it works…
Chaining and branching activities within a pipeline
Getting ready
How to do it…
There’s more…
Using the Lookup, Web, and Execute Pipeline activities
Getting ready
How to do it…
How it works…
There’s more…
See also
Creating event-based pipeline triggers
Getting ready
How to do it…
How it works…
There’s more…
See also
Setting Up Synapse Analytics
Technical requirements
Creating an Azure Synapse workspace
Getting ready
How to do it…
There’s more…
Loading data to Azure Synapse Analytics using Azure Data Factory
Getting ready
How to do it…
How it works…
There’s more…
Loading data to Azure Synapse Analytics using Azure Data Studio
Getting ready
How to do it…
How it works…
There’s more…
Loading data to Azure Synapse Analytics using bulk load
Getting ready
How to do it…
How it works…
Pausing/resuming an Azure Synapse SQL pool from Azure Data Factory
Getting ready
How to do it…
How it works…
There’s more…
Working with Azure Purview using Azure Synapse
Getting ready
How to do it…
How it works…
There’s more...
Copying data in Azure Synapse Integrate
Getting ready
How to do it…
How it works…
Using a Synapse serverless SQL pool
Getting ready
How to do it…
How it works…
There’s more…
Working with Data Lake and Spark Pools
Technical requirements
Setting up Azure Data Lake Storage Gen2
Getting ready
How to do it...
There’s more...
Creating a Synapse Analytics Spark pool
Getting ready
How to do it...
How it works...
There’s more...
Integrating Azure Data Lake and running Spark pool jobs
Getting ready
How to do it...
How it works...
Building and orchestrating a data pipeline for Data Lake and Spark
Getting ready
How to do it...
How it works...
There’s more...
Working with Big Data and Databricks
Introduction
Technical requirements
Setting up an HDInsight cluster
Getting ready
How to do it…
How it works…
There is more…
Processing data from Azure Data Lake with HDInsight and Hive
Getting ready
How to do it…
How it works…
Building data models in Delta Lake and data pipeline jobs with Databricks
Getting ready
How to do it…
How it works…
There is more…
Ingesting data into Delta Lake using Mapping Data Flows
Getting ready
How to do it…
How it works…
There is more…
External integrations with other compute engines (Snowflake)
Getting ready
How to do it…
How it works…
There is more…
Data Migration – Azure Data Factory and Other Cloud Services
Technical requirements
Copying data from Amazon S3 to Azure Blob storage
Getting ready
How to do it…
How it works…
Copying large datasets from S3 to ADLS
Getting ready
How to do it…
Creating the linked services and dataset for the pipeline
Creating the inner pipeline
Creating the outer pipeline
How it works…
See also
Copying data from Google Cloud Storage to Azure Data Lake
Getting ready
How to do it…
How it works…
See also
Copying data from Google BigQuery to Azure Data Lake Store
Getting ready
How to do it…
Migrating data from Google BigQuery to Azure Synapse
Getting ready
How to do it…
See also
Extending Azure Data Factory with Logic Apps and Azure Functions
Technical requirements
Triggering your data processing with Logic Apps
Getting ready
How to do it…
How it works…
There’s more…
Using the Web activity to call an Azure logic app
Getting ready
How to do it…
How it works…
There’s more…
Adding flexibility to your pipelines with Azure Functions
Getting ready…
How to do it…
How it works…
There’s more…
Microsoft Fabric and Power BI, Azure ML, and Cognitive Services
Technical requirements
Introducing Microsoft Fabric and Data Factory
Getting ready
How to do it...
How it works...
Microsoft Fabric Data Factory: A closer look at the pipelines
Getting ready
How to do it...
How it works...
Loading data with Microsoft Fabric Dataflows
Getting ready
How to do it...
How it works...
There’s more...
Automatically building ML models with speed and scale
Getting ready
How to do it...
How it works…
There’s more...
Analyzing and transforming data with Azure AI and prebuilt ML models
Getting ready
How to do it...
How it works...
There’s more...
Managing Deployment Processes with Azure DevOps
Technical requirements
Setting up Azure DevOps
Getting ready
How to do it...
How it works...
Publishing changes to ADF
Getting ready
How to do it...
How it works...
Deploying your features into the master branch
Getting ready
How to do it...
How it works...
Getting ready for the CI/CD of ADF
Getting ready
How to do it...
How it works...
Creating an Azure pipeline for CD
Getting ready
How to do it...
How it works...
There’s more...
Install and configure Visual Studio to work with ADF deployment
Getting ready
How to do it...
How it works...
Setting up ADF as a Visual Studio project
Getting ready
How to do it...
How it works…
Running Airflow DAGs with ADF
Getting ready
How to do it...
How it works...
Monitoring and Troubleshooting Data Pipelines
Technical requirements
Monitoring pipeline runs and integration runtimes
Getting ready
How to do it…
How it works…
Investigating failures – running pipelines in debug mode
Getting ready
How to do it…
How it works…
There’s more…
See also
Rerunning activities
Getting ready
How to do it…
How it works…
Configuring alerts for your Azure Data Factory runs
Getting ready
How to do it…
How it works…
There’s more…
See also
Working with Azure Data Explorer
Introduction to ADX
See also
Creating an ADX cluster and ingesting data
Getting ready
How to do it...
How it works...
How it works…
See also
Orchestrating ADX data with Azure Data Factory
Getting ready
How to do it...
How it works...
There’s more...
See also
Ingesting data from Azure Blob storage to ADX in Azure Data Factory using the Copy activity
Getting ready
How to do it...
How it works...
There’s more...
The Best Practices of Working with ADF
Technical requirements
Setting up roles and permissions with access levels in ADF
Getting ready
How to do it…
How it works…
There is more…
See also
Setting up Meta ETL with ADF
Getting ready
How to do it...
How it works…
There’s more...
Leveraging ADF scalability: Performance tuning of an ADF pipeline
Getting ready
How to do it…
How it works…
There is more…
See also
Using ADF disaster recovery built-in features
Getting ready
How to do it...
How it works...
Change Data Capture
Getting ready
How to do it…
How it works…
There is more…
Managing Data Factory costs with FinOps
Getting ready
How to do it...
How it works...
Other Books You May Enjoy
Index
Cover
Index
Azure Data Factory (ADF) is a modern data integration tool available on Microsoft Azure. This Azure Data Cookbook, Second Edition helps you get up and running by showing you how to create and execute your first job in ADF. You’ll learn how to branch and chain activities, create custom activities, and schedule pipelines. This book will help you discover the benefits of cloud data warehousing, Azure Synapse Analytics, Azure Data Lake Storage Gen2, and Databricks, which are frequently used for Big Data Analytics. Through practical recipes, you’ll learn how to actively engage with analytical tools from Azure Data Services and leverage your on-premises infrastructure with cloud-native tools to get relevant business insights.
As you advance, you’ll be able to integrate the most commonly used Azure services into ADF and understand how Azure services can be useful in designing ETL pipelines. The book will take you through the common errors that you may encounter while working with ADF and guide you in using the Azure portal to monitor pipelines. You’ll also understand error messages and resolve problems in connectors and data flows with the debugging capabilities of ADF.
Additionally, there is also a focus on the latest cutting-edge technology in Microsoft Fabric. You’ll explore how this technology enhances its capabilities for data integration and orchestration.
By the end of this book, you’ll be able to use ADF as the main ETL and orchestration tool for your data warehouse and data platform projects.
This book is for ETL developers, data warehouse and ETL architects, software professionals, and anyone who wants to learn about the common and not-so-common challenges that are faced while developing traditional and hybrid ETL solutions using Microsoft’s ADF, Synapse Analytics, and Fabric. You’ll also find this book useful if you are looking for recipes to improve or enhance your existing ETL pipelines. Basic knowledge of data warehousing is expected.
Chapter 1, Getting Started with ADF, will provide a general introduction to the Azure data platform. In this chapter, you will learn about the ADF interface and options as well as common use cases. You will perform hands-on exercises in order to find ADF in the Azure portal and create your first ADF job.
Chapter 2, Orchestration and Control Flow, will introduce you to the building blocks of data processing in ADF. The chapter contains hands-on exercises that show you how to set up linked services and datasets for your data sources, use various types of activities, design data-processing workflows, and create triggers for data transfers.
Chapter 3, Setting Up Synapse Analytics, covers key features and benefits of cloud data warehousing and Azure Synapse Analytics. You will learn how to connect and configure Azure Synapse Analytics, load data, build transformation processes, and operate data flows.
Chapter 4, Working with Data Lake and Spark Pools, will cover the main features of the Azure Data Lake Storage Gen2. It is a multimodal cloud storage solution that is frequently used for big data analytics. We will load and manage the datasets that we will use for analytics in the next chapter.
Chapter 5, Working with Big Data and Databricks, will actively engage with analytical tools from Azure’s data services. You will learn how to build data models in Delta Lake using Azure Databricks and mapping data flows. Also, this recipe will show you how to set up HDInsights clusters and how to work with delta tables.
Chapter 6, Data Migration – Azure Data Factory and Other Cloud Services, will walk though several illustrative examples on migrating data from Amazon Web Services and Google Cloud providers. In addition, you will learn how to use ADF’s custom activities to work with providers who are not supported by Microsoft’s built-in connectors.
Chapter 7, Extending Azure Data Factory with Logic Apps and Azure Functions, will show you how to harness the power of serverless execution by integrating some of the most commonly used Azure services: Azure Logic Apps and Azure Functions. These recipes will help you understand how Azure services can be useful in designing Extract, Transform, Load (ETL) pipelines.
Chapter 8, Microsoft Fabric and Power BI, Azure ML, and Cognitive Services, will teach you how to build an ADF pipeline that operates on a pre-built Azure ML model. You will also create and run an ADF pipeline that leverages Azure AI for text data analysis. In the last three recipes, you’ll familiarize yourself with the primary components of Microsoft Fabric Data Factory.
Chapter 9, Managing Deployment Processes with Azure DevOps, will delve into setting up CI and CD for data analytics solutions in ADF using Azure DevOps. Throughout the process, we will also demonstrate how to use Visual Studio Code to facilitate the deployment of changes to ADF.
Chapter 10, Monitoring and Troubleshooting Data Pipelines, will introduce tools to help you manage and monitor your ADF pipelines. You will learn where and how to find more information about what went wrong when a pipeline failed, how to debug a failed run, how to set up alerts that notify you when there is a problem, and how to identify problems with your integration runtimes.
Chapter 11, Working with Azure Data Explorer, will help you to set up a data ingestion pipeline from ADF to Azure Data Explorer: it includes a step-by-step guide to ingesting JSON data from Azure Storage and will teach you how to transform data in Azure Data Explorer with ADF activities.
Chapter 12, The Best Practices of Working with ADF, will guide you through essential considerations, strategies, and practical recipes that will elevate your ADF projects to new heights of efficiency, security, and scalability.
Basic knowledge of data warehousing is expected. You’ll need an Azure subscription to follow all the recipes given in the book. If you’re using a paid subscription, make sure to pause or delete the services after you are done using them, to avoid high usage costs.
Software/Hardware covered in the book
OS Requirements
Azure subscription (portal.azure.com)
Windows, macOS, or Linux
SQL Server Management Studio
Windows
Azure Data Studio
Windows, macOS, or Linux
Power BI or Microsoft Fabric subscription account
Windows, macOS, or Linux
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook-Second-Edition. In case there’s an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803246598.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”
A block of code is set as follows:
[default]exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100) exten => i,1,Voicemail(s0)When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
[default]exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100)exten => i,1,Voicemail(s0)Any command-line input or output is written as follows:
# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample /etc/asterisk/cdr_mysql.confBold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes also appear in the text like this. For example: “Select System info from the Administration panel.”
Warnings or important notes appear like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome.
General feedback: Email [email protected], and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Once you’ve read Azure Data Factory Cookbook, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781803246598
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyMicrosoft Azure is a public cloud vendor. It offers different services for modern organizations. The Azure cloud has several key components, such as compute, storage, databases, and networks. They serve as building blocks for any organization that wants to reap the benefits of cloud computing. There are many benefits to using the cloud, including utilities, metrics, elasticity, and security. Many organizations across the world already benefit from cloud deployment and have fully moved to the Azure cloud. They deploy business applications and run their business on the cloud. As a result, their data is stored in cloud storage and cloud applications.
Microsoft Azure offers a cloud analytics stack that helps us to build modern analytics solutions, extract data from on-premises and the cloud, and use data for decision-making progress, searching patterns in data, and deploying machine learning applications.
In this chapter, we will meet Azure data platform services and the main cloud data integration service – Azure Data Factory (ADF). We will log in to Azure and navigate to the Data Factory service in order to create the first data pipeline and run the copy activity. Then, we will do the same exercise but will use different methods of data factory management and control by using Python, PowerShell, and the Copy Data tool.
If you don’t have an Azure account, we will cover how you can get a free Azure account.
In this chapter, we will cover the following recipes:
Introduction to the Azure data platformCreating and executing our first job in ADFCreating an ADF pipeline using the Copy Data toolCreating an ADF pipeline using PythonCreating a data factory using PowerShellUsing templates to create ADF pipelinesCreating an Azure Data Factory using Azure BicepThe Azure data platform provides us with a number of data services for databases, data storage, and analytics. In Table 1.1, you can find a list of services and their purpose:
Figure 1.1: Azure data platform services
Using Azure data platform services can help you build a modern analytics solution that is secure and scalable. The following diagram shows an example of a typical modern cloud analytics architecture:
Figure 1.2: Modern analytics solution architecture
You can find most of the Azure data platform services here. ADF is a core service for data movement and transformation.
Let’s learn more about the reference architecture in Figure 1.1. It starts with source systems. We can collect data from files, databases, APIs, IoT, and so on. Then, we can use Event Hubs for streaming data and ADF for batch operations. ADF will push data into Azure Data Lake as a staging area, and then we can prepare data for analytics and reporting in Azure Synapse Analytics. Moreover, we can use Databricks for big data processing and machine learning models. Power BI is the ultimate data visualization service. Finally, we can push data into Azure Cosmos DB if we want to use data in business applications.
In this recipe, we will create a free Azure account, log in to the Azure portal, and locate ADF services. If you have an Azure account already, you can skip the creation of the account and log straight in to the portal.
Open https://azure.microsoft.com/free/, then take the following steps:
Click Start Free.You can sign into your existing Microsoft account or create a new one. Let’s create one as an example.Enter an email address in the format [email protected] and click Next.Enter a password of your choice.Verify your email by entering the code and click Next.Fill in the information for your profile (Country, Name, and so on). It will also require your credit card information.After you have finished the account creation, it will bring you to the Microsoft Azure portal, as shown in the following screenshot:Figure 1.3: Azure portal
Now, we can explore the Azure portal and find Azure data services. Let’s find Azure Synapse Analytics. In the search bar, enter Azure Synapse Analytics and choose Azure Synapse Analytics. It will open the Synapse control panel, as shown in the following screenshot:Figure 1.4: Azure Synapse Analytics menu
Here, we can launch a new instance of a Synapse Analytics workspace. Or you can find the Data Factories menu and launch a new Data Factory by using the Azure portal:
Figure 1.5: Azure Data factories menu
In the next recipe, we will create a new data factory.
Before doing anything with ADF, though, let’s review what we have covered about an Azure account and the difference between Synapse Analytics and Data Factories.
Now that we have created a free Azure account, it gives us the following benefits:
12 months of free access to popular products$250 worth of credit25+ always-free productsThe Azure account we created is free and you won’t be charged unless you choose to upgrade.
Moreover, we discovered the Azure data platform products, which we will use over the course of the book. The Azure portal has a friendly UI where we can easily locate, launch, pause, or terminate the service. Aside from the UI, Azure offers us other ways of communicating with Azure services, using the Command-line Interface (CLI), APIs, SDKs, and so on.
Using the Microsoft Azure portal, you can choose the Analytics category and it will show you all the analytics services, as shown in the following screenshot:
Figure 1.6: Azure analytics services
Both Azure Synapse Analytics (ASA) and ADF have overlap. ASA workspaces are an integrated analytics service that combines big data and data warehousing. It allows you to perform data integration, data warehousing, and big data analytics using a single service. Moreover, it allows you to do a wide range of data integration options, including SQL Server Integration Services (SSIS), ADF, or Spark transformations.
So, when to use what? If you need a simple and cost-effective way to move and transform data from various sources to various destinations, ADF is a good choice. However, if you need a more comprehensive analytics solution that can handle both big data and data warehousing, ASA is the way to go. In other words, standalone ADF is good for the orchestration of your data pipelines and workloads in general. But if you are willing to leverage Synapse Data Warehouse or big data solutions, you should consider using ADF as a part of ASA workspaces. Both have similar interfaces and functionality.
In this chapter, we will be using a standalone ADF.
ADF allows us to create workflows for transforming and orchestrating data movement. You may think of ADF as an Extract, Transform, Load (ETL) tool for the Azure cloud and the Azure data platform. ADF is Software as a Service (SaaS). This means that we don’t need to deploy any hardware or software. We pay for what we use. Often, ADF is referred to as code-free ETL as a service or managed service. The key operations of ADF are listed here:
Ingest: Allows us to collect data and load it into Azure data platform storage or any other target location. ADF has 90+ data connectors.Control flow: Allows us to design code-free extracting and loading workflows.Data flow: Allows us to design code-free data transformations.Schedule: Allows us to schedule ETL jobs.Monitor: Allows us to monitor ETL jobs.We have learned about the key operations of ADF. Next, we should try them.
In this recipe, we will continue from the previous recipe, where we found ASA in the Azure console. We will create a data factory using a straightforward method – through the ADF User Interface (UI) via the Azure portal UI. It is important to have the correct permissions to create a new data factory. In our example, we are using a super admin, so we should be good to go.
During the exercise, we will create a new resource group. It is a collection of resources that share the same life cycle, permissions, and policies.
Let’s get back to our data factory:
If you have closed the Data Factory console, you should open it again. Search for Data factories and click Enter.Click Create data factory, or Add if you are on the Data factories screen, and it will open the project details, where we will choose a subscription (in our case, Free Trial).We haven’t created a resource group yet. Click Create new and type the name ADFCookbook. Choose East US for Region, give the name as ADFcookbookJob-<YOUR NAME> (in my case, ADFcookbookJob-Dmitry), and leave the version as V2. Then, click Next: Git Configuration.We can use GitHub or Azure DevOps. We won’t configure anything yet so we will select Configure Git later. Then, click Next: Networking.We have an option to increase the security of our pipelines using Managed Virtual Network and Private endpoint. For this recipe, we will use the default settings. Click Next.Optionally, you can specify tags. Then, click Next: Review + Create. ADF will validate your settings and will allow you to click Create.Azure will deploy the data factory. We can choose our data factory and click Launch Studio. This will open the ADF UI home page, where we can find lots of useful tutorials and webinars under Help/Information in the top-right corner.From the left panel, choose the New Pipeline icon, as shown in the following screenshot, and it will open a window where we will start the creation of the pipeline. Choose New pipeline and it will open the pipeline1 window, where we must provide the following information: input, output, and compute. Add the name ADF-cookbook-pipeline1 and click Validate All:Figure 1.7 : ADF resources
When executing Step 8, you will find out that you can’t save the pipeline without the activity. For our new data pipeline, we will do a simple copy data activity. We will copy the file from one blob folder to another. In this chapter, we won’t spend time on spinning resources such as databases, Synapse, or Databricks. Later in this book, you will learn about using ADF with other data platform services. In order to copy data from Blob storage, we should create an Azure storage account and a Blob container.Let’s create the Azure storage account. Go to All Services | Storage | StorageAccounts.Click + Add.Use our Free Trial subscription. For the resource group, we will use ADFCookbook. Give a name for the storage account, such as adfcookbookstoragev2, then click Review and Create. The name should be unique to you.Click Go to Resource and select Containers on the left sidebar:Figure 1.8 : Azure storage account UI
Click +Container and enter the name adfcookbook.Now, we want to upload a data file into the SalesOrders.txt file. You can get this file from the book’s GitHub account at https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook-Second-Edition/Chapter01/. Go to the adfcookbook container and click Upload. We will specify the folder name as input. We just uploaded the file to the cloud! You can find it with the /container/folder/file – adfcookbook/input/SalesOrders.txt path.Next, we can go back to ADF. In order to finish the pipeline, we should add an input dataset and create a new linked service.In the ADF studio, click the Manage icon from the left sidebar. This will open the linked services. Click +New and choose Azure Blob Storage, then click Continue.We can optionally change the name or leave it as the default, but we have to specify the subscription in From Azure Subscription and choose the Azure Subscriptions and Storage account name that we just created.Click Test Connection and, if all is good, click Create.Next, we will add a dataset. Go to our pipeline and click New dataset, as shown in the following screenshot:Figure 1.9: ADF resources
Choose Azure Blob Storage and click Continue. Choose the Binary format type for our text file and click Continue.Now, we can specify the AzureBlobStorage1 linked services and we will specify the path to the adfcookbook/input/SalesOrders.txt file and click Create.We can give the name of the dataset in Properties. Type in SalesOrdersDataset and click Validate all. We shouldn’t encounter any issues with data.We should add one more dataset as the output for our job. Let’s create a new dataset with the name SalesOrdersDatasetOutput and path adfcookbook/output/SalesOrders.txt.Now, we can go back to our data pipeline. We couldn’t save it when we created it without a proper activity. Now, we have all that we need in order to finish the pipeline. Add the new pipeline and give it the name ADF-cookbook-pipeline1. Then, from the activity list, expand Move & transform and drag and drop the Copy Data step to the canvas.We have to specify the parameters of the step – the source and sink information. Click the Source tab and choose our dataset, SalesOrdersDataset.Click the Sink tab and choose SalesOrdersDatasetOutput. This will be our output folder.Now, we can publish two datasets and one pipeline. Click Publish All.Then, we can trigger our pipeline manually. Click Add trigger, as shown in the following screenshot:Figure 1.10 : ADF canvas with the Copy Data activity
Select Trigger Now. It will launch our job.We can click on Monitor from the left sidebar and find the pipeline runs. In the case of failure, we can pick up the logs here and find the root cause. In our case, the ADF-cookbook-pipeline1 pipeline succeeds. In order to see the outcome, we should go to Azure Storage and open our container. You can find the additional Output folder and a file named SalesOrders.txt there.We have just created our first job using the UI. Let’s learn more about ADF.
Using the ADF UI, we created a new pipeline – an ETL job. We specified input and output datasets and used Azure Blob storage as a linked service. The linked service itself is a kind of connection string. ADF is using the linked service in order to connect external resources. On the other hand, we have datasets. They represent the data structure for the data stores. We performed the simple activity of copying data from one folder to another. After the job ran, we reviewed the Monitor section with the job run logs.
An ADF pipeline is a set of JSON config files. You can also view the JSON for each pipeline, dataset, and so on in the portal by clicking the three dots in the top-right corner. We are using the UI to create the configuration file and run the job. You can review the JSON config file by clicking on Download support files to download a JSON file, as shown in the following figure:
Figure 1.11 : Downloading the pipeline config files
This will save the archive file. Extract it and you will find a folder with the following subfolders:
DatasetLinkedServicePipelineEach folder has a corresponding JSON config file.
You can find more information about ADF in this Microsoft video, Introduction to Azure Data Factory: https://azure.microsoft.com/en-us/resources/videos/detailed-introduction-to-azure-data-factory/.
We just reviewed how to create the ADF job using the UI. However, we can also use the Copy Data tool (CDT). The CDT allows us to load data into Azure storage faster. We don’t need to set up linked services, pipelines, and datasets as we did in the previous recipe. In other words, depending on your activity, you can use the ADF UI or the CDT. Usually, we will use the CDT for simple load operations, when we have lots of data files and we would like to ingest them into Data Lake as fast as possible.
In this recipe, we will use the CDT in order to do the same task of copying data from one folder to another.
We already created the ADF job with the UI. Let’s review the CDT:
In the previous recipe, we created the Azure Blob storage instance and container. We will use the same file and the same container. However, we have to delete the file from the output location.Go to Azure Storage Accounts, choose adfcookbookstorage, and click Containers. Choose adfcookbook. Go to the Output folder and delete the SalesOrders.txt file.Now, we can go back to the Data Factory Studio. On the home page, we can see the tile for Ingest. Click on it. It will open with the CDT wizard.Click Built-in copy task. Choose Run once now. Click Next.We should choose the data source – AzureBlobStorage1 – and specify the folder and file. You can browse the blob storage and you will find the filename. The path should look like adfcookbook/input/SalesOrders.txt. Mark Binary copy. When we choose the binary option, the file will be treated as binary and won’t enforce the schema. This is a great option to just copy the file as is. Click Next.Next, we will choose the destination. Choose AzureBlobStorage2 and click Next. Enter the adfcookbook/output output path and click Next until you reach the end.Give it the task name CDT-copy-job and Click Next. As a result, you should get a similar output as I have, as you can see in the following screenshot:Figure 1.12: CDT UI
If we go to the storage account, we will find that CDT copied data into the Output folder.We have created a copy job using CDT.
CDT basically created the data pipeline for us. If you go to ADF author, you will find a new job and new datasets.
You can learn more about the CDT at the Microsoft documentation page: https://docs.microsoft.com/en-us/azure/data-factory/copy-data-tool.
We can use PowerShell, .NET, and Python for ADF deployment and data integration automation. Here is an extract from the Microsoft documentation:
”Azure Automation delivers a cloud-based automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources.”
In this recipe, we want to cover the Python scenario because Python is one of the most popular languages for analytics and data engineering. We will use Jupyter Notebook with example code.
You can use Jupyter notebooks or Visual Code notebooks.
For this exercise, we will use Python in order to create a data pipeline and copy our file from one folder to another. We need to use the azure-mgmt-datafactory and azure-mgmt-resource Python packages as well as some other libraries that we will cover in the example.
We will create an ADF pipeline using Python. We will start with some preparatory steps:
We will start with the deletion of our file in the output directory. Go to Azure Storage Accounts, choose adfcookbookstorage, and click Containers. Choose adfcookbook. Go to the Output folder and delete the SalesOrders.txt file.We will install the Azure management resources Python package by running this command from the CLI. In my example, I used Terminal on macOS: pip install azure-mgmt-resource Next, we will install the ADF Python package by running this command from the CLI: pip install azure-mgmt-datafactory Also, I installed these packages to run code from Jupyter: pip install msrestazure pip install azure.mgmt.datafactory pip install azure.identityWhen we finish installing the Python packages, we should use these packages in order to create the data pipeline, datasets, and linked service, as well as to run the code. Python gives us flexibility and we could embed this into our analytics application or Spark/Databricks.
The code itself is quite big and you can find the code in the Git repo for this chapter, ADF_Python_Run.ipynb.
In order to control Azure resources from the Python code, we have to register the app with Azure Active Directory and assign a contributor role to this app in Identity and Access Management (IAM) under our subscription. We have to get tenant_id, client_id, and client_secret. You can learn more about this process at the Microsoft official documentation: https://learn.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal. We will provide brief steps.Go to Azure Active Directory and click Appregistrations. Click + New registration. Enter the name ADFcookbookapp and click Register. From the app properties, you have to copy Application(client)ID and Directory (tenant)ID.Still in ADFcookbookapp, go to Certificates & secrets on the left sidebar. Click + New client secret and add new client secret. Copy the value.Next, we should give permissions to our app. Go to the subscriptions. Choose Free Trial. Click on IAM. Click on Add role assignments. Select the Contributor role under Privileged administrator roles and click Next. Assign access to a user, group, or service principal. Finally, search for our app, ADFcookbookapp, and click Save. As a result, we just granted access to the app and we can use these credentials in our Python code. If you don’t give permission, you will get the following error message: AuthorizationFailed.Open ADF_Python_Run.ipynb and make sure that you have all the libraries in place by executing the first code block. You can open the file in Jupyter Notebook: from azure.identityimportClientSecretCredentialfrom azure.mgmt.resourceimportResourceManagementClientfrom azure.mgmt.datafactoryimportDataFactoryManagementClientfrom azure.mgmt.datafactory.modelsimport * from datetime import datetime, timedelta import time You should run this piece without any problems. If you encounter an issue, it means you are missing the Python package. Make sure that you have installed all of the packages. Run section 2 in the notebook. You can find the notebook in the GitHub repository with the book files.In section 3, Authenticate Azure, you have to enter the user_name, subscription_id, tenant_id, client_id, and client_secret values. The resource group and data factory name we can leave as is. Then, run section 4, Created Data Factory.The Python code will also interact with the Azure storage account, and we should provide the storage account name and key. For this chapter, we are using the adfcookbookstorage storage account and you can find the key under the Access keys section of this storage account menu. Copy the key value and paste it into section 5, Created a Linked Service, and run it.In sections 6 and 7, we are creating input and output datasets. You can run the code as is. In section 8, we will create the data pipeline and specify the CopyActivity activity.Finally, we will run the pipeline at section 9, Create a pipeline run.In section 10, Monitor a pipeline run, we will check the output of the run. We should get the following: Pipeline run status: SucceededWe just created an ADF job with Python. Let’s add more details.
