MCA Microsoft Certified Associate Azure Data Engineer Study Guide - Benjamin Perkins - E-Book

MCA Microsoft Certified Associate Azure Data Engineer Study Guide E-Book

Benjamin Perkins

0,0
46,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Prepare for the Azure Data Engineering certification--and an exciting new career in analytics--with this must-have study aide In the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203, accomplished data engineer and tech educator Benjamin Perkins delivers a hands-on, practical guide to preparing for the challenging Azure Data Engineer certification and for a new career in an exciting and growing field of tech. In the book, you'll explore all the objectives covered on the DP-203 exam while learning the job roles and responsibilities of a newly minted Azure data engineer. From integrating, transforming, and consolidating data from various structured and unstructured data systems into a structure that is suitable for building analytics solutions, you'll get up to speed quickly and efficiently with Sybex's easy-to-use study aids and tools. This Study Guide also offers: * Career-ready advice for anyone hoping to ace their first data engineering job interview and excel in their first day in the field * Indispensable tips and tricks to familiarize yourself with the DP-203 exam structure and help reduce test anxiety * Complimentary access to Sybex's expansive online study tools, accessible across multiple devices, and offering access to hundreds of bonus practice questions, electronic flashcards, and a searchable, digital glossary of key terms A one-of-a-kind study aid designed to help you get straight to the crucial material you need to succeed on the exam and on the job, the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203 belongs on the bookshelves of anyone hoping to increase their data analytics skills, advance their data engineering career with an in-demand certification, or hoping to make a career change into a popular new area of tech.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1502

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Acknowledgments

About the Author

About the Technical Editor

Table of Exercises

Introduction

Who This Book Is For

What This Book Covers

How This Book Is Structured

What You Need to Use This Book

Interactive Online Learning Environment and TestBank

DP‐203 Exam Objectives

Reader Support for This Book

Assessment Test

Answers to Assessment Test

PART I: Azure Data Engineer Certification and Azure Products

Chapter 1: Gaining the Azure Data Engineer Associate Certification

The Journey to Certification

How to Pass Exam DP‐203

Azure Product Name Recognition

Azure Data Analytics

Azure Storage Products

Azure Databases

Azure Security

Azure Networking

Azure Compute

Azure Management and Governance

Summary

Exam Essentials

Review Questions

Chapter 2: CREATE DATABASE dbName; GO

The Brainjammer

A Historical Look at Data

Data Structures, Types, and Concepts

Data Programming and Querying for Data Engineers

Understanding Big Data Processing

Summary

Exam Essentials

Review Questions

PART II: Design and Implement Data Storage

Chapter 3: Data Sources and Ingestion

Where Does Data Come From?

Design a Data Storage Structure

Design a Partition Strategy

Design the Serving/Data Exploration Layer

The Ingestion of Data into a Pipeline

Migrating and Moving Data

Summary

Exam Essentials

Review Questions

Chapter 4: The Storage of Data

Implement Physical Data Storage Structures

Implement Logical Data Structures

Implement a Partition Strategy

Design and Implement the Data Exploration Layer

Additional Data Storage Topics

Summary

Exam Essentials

Review Questions

PART III: Develop Data Processing

Chapter 5: Transform, Manage, and Prepare Data

Ingest and Transform Data

Transformation and Data Management Concepts

Data Modeling and Usage

Summary

Exam Essentials

Review Questions

Chapter 6: Create and Manage Batch Processing and Pipelines

Design and Develop a Batch Processing Solution

Manage Batches and Pipelines

Summary

Exam Essentials

Review Questions

Chapter 7: Design and Implement a Data Stream Processing Solution

Develop a Stream Processing Solution

Ingest and Transform Data

Monitor Data Storage and Data Processing

Summary

Exam Essentials

Review Questions

PART IV: Secure, Monitor, and Optimize Data Storage and Data Processing

Chapter 8: Keeping Data Safe and Secure

Design Security for Data Policies and Standards

Implement Data Security

Develop a Batch Processing Solution

Design and Implement the Data Exploration Layer

Summary

Exam Essentials

Review Questions

Chapter 9: Monitoring Azure Data Storage and Processing

Monitoring Data Storage and Data Processing

Develop a Batch Processing Solution

Develop a Stream Processing Solution

Azure Monitoring Overview

Summary

Exam Essentials

Review Questions

Chapter 10: Troubleshoot Data Storage Processing

Optimize and Troubleshoot Data Storage and Data Processing

Design and Develop a Batch Processing Solution

Monitor Batches and Pipelines

Design and Develop a Stream Processing Solution

Summary

Exam Essentials

Review Questions

Appendix: Answers to Review Questions

Chapter 1: Gaining the Azure Data Engineer Associate Certification

Chapter 2: CREATE DATABASE dbName; GO

Chapter 3: Data Sources and Ingestion

Chapter 4: The Storage of Data

Chapter 5: Transform, Manage, and Prepare Data

Chapter 6. Create and Manage Batch Processing and Pipelines

Chapter 7: Design and Implement a Data Stream Processing Solution

Chapter 8: Keeping Data Safe and Secure

Chapter 9: Monitoring Azure Data Storage and Processing

Chapter 10: Troubleshoot Data Storage Processing

Index

End User License Agreement

List of Tables

Chapter 1

TABLE 1.1 Azure certifications

TABLE 1.2 Popular cloud service offerings

TABLE 1.3 Technical terms and definitions

TABLE 1.4 ADLS‐supported platforms

TABLE 1.5 File/folder access permission levels

TABLE 1.6 Azure storage redundancy

TABLE 1.7 Azure Cosmos DB APIs

TABLE 1.8 NSG example

Chapter 2

TABLE 2.1 File comparison

TABLE 2.2 Common data types

TABLE 2.3 Table category distribution matrix

TABLE 2.4 Wildcard location examples

TABLE 2.5 Spark pool magic commands

TABLE 2.6

PySpark vs. Spark

TABLE 2.7 Azure SDKs packages

TABLE 2.8 Aggregate and mathematical functions

TABLE 2.9

JOIN

types

TABLE 2.10 Big Data processing stages

TABLE 2.11 Azure products and Big Data stages

Chapter 3

TABLE 3.1 Types and tools for ingestion

TABLE 3.2 Analytical datastores

TABLE 3.3 File type use cases

TABLE 3.4 Data landing zones

TABLE 3.5 Slowly changing dimension types

TABLE 3.6 Hot path serving layer and MPP products

TABLE 3.7 Dedicated vs. serverless SQL pools

TABLE 3.8 Dedicated SQL pool performance level

TABLE 3.9 Spark pool node sizes

TABLE 3.10 Apache Spark components

TABLE 3.11 Data Explorer pool workload size

TABLE 3.12 Integration runtimes core count

TABLE 3.13 Azure Databricks cluster modes

TABLE 3.14 Databricks runtime versions

TABLE 3.15 Azure Databricks worker types

TABLE 3.16 Azure Databricks environments

TABLE 3.17 Azure Databricks job types

TABLE 3.18 Azure Databricks user entitlements

TABLE 3.19 Event Hubs vs. IoT Hub

TABLE 3.20 Azure Stream Analytics built‐in functions

TABLE 3.21 Azure Stream Analytics data types

TABLE 3.22 Apache Kafka vs. Event Hubs terminology

Chapter 4

TABLE 4.1 Supported codecs by file format

TABLE 4.2 Cross‐region replication pairings, paired datacenters

TABLE 4.3 ADLS archiving actions

TABLE 4.4 Data flow schema modifiers

TABLE 4.5 Data flow transformation features

TABLE 4.6 Slowly changing dimension types

TABLE 4.7 External location endpoints and protocols

Chapter 5

TABLE 5.1 Data file split recommendation

TABLE 5.2 Brainjammer brain wave values

Chapter 6

TABLE 6.1 Azure Batch resource components

TABLE 6.2 Azure Storage limits

TABLE 6.3 Exercise 6.6 pipeline parameters

TABLE 6.4 Types of pipeline triggers

TABLE 6.5 Copy Data activity—verification results

TABLE 6.6 Copy Data activity—inconsistent data results

TABLE 6.7 Azure DevOps components

Chapter 7

TABLE 7.1 Streaming product capabilities

TABLE 7.2 Additional streaming product capabilities

TABLE 7.3 Streaming scalability by product

TABLE 7.4 Azure streaming products' pricing units

TABLE 7.5 Azure Event Hubs tiers

TABLE 7.6 Stream Analytics input/output partitioning

TABLE 7.7 Data stream illustration

TABLE 7.8 Azure Stream Analytics exceptions

Chapter 8

TABLE 8.1 Azure data product security support

TABLE 8.2 Azure storage account authorization methods

TABLE 8.3 Managed identity types

Chapter 9

TABLE 9.1 Logging verbosity and severity

TABLE 9.2 Synapse platform system dynamic management views

TABLE 9.3 database_transaction_state column description

TABLE 9.4 DMVs for troubleshooting PolyBase

TABLE 9.5 Azure Synapse Analytics workspace metrics

TABLE 9.6 Dedicated SQL pool metrics

TABLE 9.7 Apache Spark pool metrics

TABLE 9.8 Different types of testing

TABLE 9.9 Azure Stream Analytics metrics

Chapter 10

TABLE 10.1 Performance and troubleshooting antipatterns

TABLE 10.2 Database partition analysis features

TABLE 10.3 Dedicated SQL pool indexes

TABLE 10.4 Index‐related Dynamic Management Views

TABLE 10.5 Query Store stored procedures

TABLE 10.6 Transaction and HTAP dynamic management views

TABLE 10.7 Data Flow Compute size

List of Illustrations

Chapter 1

FIGURE 1.1 Comparing the Azure data scientist and analyst roles

FIGURE 1.2 The Azure data engineer role

FIGURE 1.3 The Azure database administrator associate role

FIGURE 1.4 A path to the Azure Data Engineer Associate certification

FIGURE 1.5 The extract, transform, and load (ETL) approach

FIGURE 1.6 A data streaming pipeline

FIGURE 1.7 Azure portal security product and feature security hierarchy

FIGURE 1.8 An Azure data security diagram with products and features

FIGURE 1.9 Using Azure Key Vault and MI

FIGURE 1.10 Azure privacy and governance products

FIGURE 1.11 Azure health and monitoring products

FIGURE 1.12 Azure Synapse pools, performance, and debugging

FIGURE 1.13 Azure feature updates

FIGURE 1.14 Azure Analytics product documentation

FIGURE 1.15 Azure products in preview

FIGURE 1.16 Azure Synapse Analytics services

FIGURE 1.17 Azure Synapse Analytics Studio

FIGURE 1.18 Azure Databricks workspace

FIGURE 1.19 Azure HDInsight most popular supported open source frameworks

FIGURE 1.20 Azure Analysis Services

FIGURE 1.21 Azure Data Factory Studio

FIGURE 1.22 Azure Stream Analytics data flow

FIGURE 1.23 Azure Cosmos DB and supported APIs

FIGURE 1.24 Azure Active Directory portal

FIGURE 1.25 Role‐based access control scope

FIGURE 1.26 Azure App Service Managed Identity

FIGURE 1.27 Azure Managed Identity in Azure Active Directory

FIGURE 1.28 Azure Managed Identity in Azure Key Vault

FIGURE 1.29 Azure Monitor

FIGURE 1.30 Tags for Azure products

Chapter 2

FIGURE 2.1 Big Data characteristics

FIGURE 2.2 Tables in a relational database

FIGURE 2.3 The Select SQL Deployment Option blade

FIGURE 2.4 Azure Data Studio

FIGURE 2.5 A view of data tables in Azure Data Studio

FIGURE 2.6 Azure Cosmos DB APIs

FIGURE 2.7 Azure Cosmos Data Explorer

FIGURE 2.8 Azure Cosmos Data Explorer SQL query

FIGURE 2.9 Azure Synapse Analytics Sharding example

FIGURE 2.10 Azure Synapse Analytics hash table distribution

FIGURE 2.11 Azure Synapse Analytics replicated table distribution

FIGURE 2.12 Azure Synapse Analytics external tables

FIGURE 2.13 Azure Synapse Analytics external tables example

FIGURE 2.14 ADLS directory hierarchy example

FIGURE 2.15 Schemas, views, and users as seen in SSMS

FIGURE 2.16 A star schema example

FIGURE 2.17 A snowflake schema example

FIGURE 2.18 SQL Query from view table and new schema

FIGURE 2.19 A data skew example

FIGURE 2.20 Data streaming solution using Apache Kafka and Apache Spark

FIGURE 2.21 Visual Studio data workloads

FIGURE 2.22 Visual Studio C# code example

FIGURE 2.23 Where to implement and use an SDK

FIGURE 2.24 Output of running the

PDW_SHOWSPACEUSED

command

FIGURE 2.25 Output of running the

SHOW_STATISTICS

command

FIGURE 2.26 Using

CONVERT

and

CAST

SQL commands

FIGURE 2.27 SQL query output from

FIRST_VALUE

and

LAST_VALUE

FIGURE 2.28 Representation of SQL JOINs

FIGURE 2.29

OPENJSON

query

FIGURE 2.30 Big Data stages and Azure products

FIGURE 2.31 Pipeline data flow Azure Synapse transform stage

Chapter 3

FIGUER 3.1 Data producers and processing services

FIGUER 3.2 An Azure storage account Overview blade

FIGUER 3.3 The Upload folder in Azure Storage Explorer

FIGUER 3.4 Files uploaded to ADLS using Azure Storage Explorer

FIGUER 3.5 Storing Parquet files in an ADLS container

FIGUER 3.6 The brainjammer directory structure

FIGUER 3.7 The brainjammer

raw‐files

directory

FIGUER 3.8 The brainjammer

cleansed‐data

directory

FIGUER 3.9 The brainjammer

business‐data

directory

FIGUER 3.10 Changing the access tier of files in an ADLS container

FIGUER 3.11 Configuring a new dedicated SQL pool

FIGUER 3.12 Partitioning files

FIGUER 3.13 The lambda architecture serving layer

FIGUER 3.14 A relational star schema

FIGUER 3.15 Type 1 SCD

FIGUER 3.16 Type 2 SCD

FIGUER 3.17 Type 3 SCD

FIGUER 3.18 Type 6 SCD

FIGUER 3.19 A dimensional hierarchy

FIGUER 3.20 A temporal table

FIGUER 3.21 A Hive metadata metastore

FIGUER 3.22 A Hive metadata metastore database

FIGUER 3.23 A Hive metadata metastore table

FIGUER 3.24 A Hive metadata metastore spark pool

FIGUER 3.25 Querying metadata in a SQL database

FIGUER 3.26 An overview of Azure Synapse Analytics

FIGUER 3.27 The Azure Synapse Analytics Manage hub

FIGUER 3.28 Creating an Azure Synapse Analytics dedicated SQL pool

FIGUER 3.29 Azure Synapse Analytics Apache Spark pool Basics tab

FIGUER 3.30 Azure Synapse Analytics Apache Spark pool Additional Settings ta...

FIGUER 3.31 Azure Synapse Analytics Data Explorer pool Basics tab

FIGUER 3.32 Azure Synapse Analytics Data Explorer pool Additional Settings t...

FIGUER 3.33 Azure Synapse Analytics External connections Linked services

FIGUER 3.34 Azure Synapse Analytics Linked Azure SQL Database

FIGUER 3.35 Azure Synapse Analytics linked services

FIGUER 3.36 Azure Synapse Analytics integration triggers

FIGUER 3.37 Choosing an Azure Synapse Analytics integration runtime

FIGUER 3.38 Creating an Azure Synapse Analytics integration runtime

FIGUER 3.39 The Access Control page in Azure Synapse Analytics

FIGUER 3.40 Adding a role assignment in Azure Synapse Analytics

FIGUER 3.41 Azure Synapse Analytics private endpoints

FIGUER 3.42 Adding a workspace package in Azure Synapse Analytics

FIGUER 3.43 Consuming a workspace package in Azure Synapse Analytics

FIGUER 3.44 Setting up a code repository in Azure Synapse Analytics

FIGUER 3.45 Azure Synapse Analytics configure GitHub

FIGUER 3.46 Azure Synapse Analytics configure GitHub repository

FIGUER 3.47 Azure Synapse Analytics configure GitHub saved

FIGUER 3.48 Azure Synapse Analytics Data SQL database

FIGUER 3.49 Azure Synapse Analytics Data external data

FIGUER 3.50 Azure Synapse Analytics Data connect Azure Cosmos DB

FIGUER 3.51 How a dataset fits in the data ingestion scheme

FIGUER 3.52 Azure Synapse Analytics data integration dataset formats

FIGUER 3.53 Azure Synapse Analytics data integration dataset linked service...

FIGUER 3.54 Azure Synapse Analytics data integration dataset properties

FIGUER 3.55 The Azure Synapse Analytics Browse Gallery

FIGUER 3.56 Azure Synapse Analytics Integrate Pipeline

FIGUER 3.57 Azure Synapse Analytics Pipeline Copy data Source tab

FIGUER 3.58 Azure Synapse Analytics Pipeline Copy data Sink tab

FIGUER 3.59 Azure Synapse Analytics Pipeline Copy data Mapping tab

FIGUER 3.60 Azure Synapse Analytics Copy Data tool

FIGUER 3.61 Azure Data Factory Manage Git configuration example

FIGUER 3.62 Azure Data Factory Manage global parameters

FIGUER 3.63 Azure Data Factory Author dataset

FIGUER 3.64 The Azure Databricks platform

FIGUER 3.65 An Azure Databrick cluster

FIGUER 3.66 An Azure Databrick notebook

FIGUER 3.67 Azure Databricks Notebook command

FIGUER 3.68 Azure Databricks Settings Admins Console Users

FIGUER 3.69 Azure Databricks Workspace User notebook Permission

FIGUER 3.70 Azure Databricks Workspace User notebook revision history

FIGUER 3.71 Azure Databricks brain wave charting example

FIGUER 3.72 Azure Databricks workspace jobs

FIGUER 3.73 Azure Databricks Repos

FIGUER 3.74 Azure Databricks Jobs

FIGUER 3.75 Azure Databricks data ingestion

FIGUER 3.76 A data lake

FIGUER 3.77 Event Hubs data ingestion

FIGUER 3.78 Provisioning an Azure Stream Analytics job

FIGUER 3.79 An Azure Stream Analytics job query

FIGUER 3.80 An Azure Stream Analytics hopping window

FIGUER 3.81 An Azure Stream Analytics session window

FIGUER 3.82 An Azure Stream Analytics sliding window

FIGUER 3.83 An Azure Stream Analytics snapshot window

FIGUER 3.84 An Azure Stream Analytics tumbling window

FIGUER 3.85 Choosing an Apache Kafka for HDInsight cluster type

FIGUER 3.86 Apache Kafka for HDInsight Kafka nodes

FIGUER 3.87 Azure Monitor and Azure Data Box

Chapter 4

FIGURE 4.1 Azure Synapse Analytics compression

FIGURE 4.2 Azure Synapse Analytics SQL partitioning

FIGURE 4.3 Azure Synapse Analytics Spark partitioning from Storage explorer...

FIGURE 4.4 Azure Synapse Analytics Spark partitioning

FIGURE 4.5 Sharding original table

FIGURE 4.6 Sharding sharded table

FIGURE 4.7 Data redundancy snapshots

FIGURE 4.8 Data redundancy dedicated SQL pool

FIGURE 4.9 Data redundancy restore dedicated SQL pool

FIGURE 4.10 Data redundancy restore ADLS

FIGURE 4.11 Data redundancy restore ADLS container

FIGURE 4.12 Data redundancy ADLS replication options

FIGURE 4.13 Data redundancy backups Azure Synapse Analytics regional redunda...

FIGURE 4.14 Data redundancy Azure Synapse Analytics single redundancy model ...

FIGURE 4.15 Data redundancy storage account for Azure Databricks

FIGURE 4.16 Implementing table distributions

FIGURE 4.17 Implement data archiving access tier

FIGURE 4.18 Implement data archiving lifecycle management

FIGURE 4.19 Azure Synapse Analytics Data hub ADLS directory

FIGURE 4.20 Azure Synapse Analytics Data hub ADLS directory

FIGURE 4.21 Azure Synapse Analytics Develop hub load Notebook

FIGURE 4.22 Azure Synapse Analytics Develop hub write Notebook Parquet files...

FIGURE 4.23 Azure Synapse Analytics data flow

FIGURE 4.24 Azure Synapse Analytics data flow transformations

FIGURE 4.25 Azure Synapse Analytics Develop hub, data flow Source Settings t...

FIGURE 4.26 Azure Synapse Analytics Develop hub, data flow Optimize tab

FIGURE 4.27 Azure Synapse Analytics Develop hub, data flow Derived Column sc...

FIGURE 4.28 Azure Synapse Analytics Develop hub, Visual Expression Builder

FIGURE 4.29 Azure Synapse Analytics Develop hub, Apache Spark job definition...

FIGURE 4.30 Logical data structure

FIGURE 4.31 Finding the history table

FIGURE 4.32 A temporal data solution

FIGURE 4.33 Building an external table

FIGURE 4.34 An efficient file and folder structure

FIGURE 4.35 Serving layer using star schema distribution types

FIGURE 4.36 Serving layer using star schema integration dataset

FIGURE 4.37 Serving layer using star schema tmp integration dataset

FIGURE 4.38 Serving layer using star schema data flow

FIGURE 4.39 Serving layer using star schema pipeline

FIGURE 4.40 Storing raw data in Azure Databricks

FIGURE 4.41 Storing data using Azure HDInsight

FIGURE 4.42 Storing prepared, trained, and modeled data

Chapter 5

FIGURE 5.1 A Big Data pipeline process example

FIGURE 5.2 Azure Synapse Analytics—ingesting Brainjammer brain waves

FIGURE 5.3 Azure Synapse Analytics—transformating Brainjammer brain waves

FIGURE 5.4 Azure Synapse Analytics—monitoring Brainjammer brain wave transfo...

FIGURE 5.5 Azure Synapse Analytics—SQL pool vs. a linked service stored proc...

FIGURE 5.6 Azure Data Factory Synapse—linked service

FIGURE 5.7 Azure Data Factory—Synapse dataset

FIGURE 5.8 Azure Data Factory—Synapse pipeline

FIGURE 5.9 Azure Data Factory Synapse—pipeline transformation

FIGURE 5.10 Transforming data using an Apache Spark Azure Synapse Spark pool...

FIGURE 5.11 Transforming data using Apache Spark Azure Synapse Analytics

FIGURE 5.12 Transforming data using an Apache Spark Azure Databricks workspa...

FIGURE 5.13 Transforming data using Apache Spark Azure Databricks configurat...

FIGURE 5.14 Jupyter notebooks—Azure Databricks Import option

FIGURE 5.15 Jupyter notebooks—Azure Databricks imported

FIGURE 5.16 Transforming data using Apache Spark Jupyter notebooks Azure HDI...

FIGURE 5.17 Transforming data using Apache Spark Jupyter notebooks Azure HDI...

FIGURE 5.18 Transforming data using Apache Spark Jupyter notebooks Azure Dat...

FIGURE 5.19 Transforming data using Apache Spark Jupyter notebooks Azure Dat...

FIGURE 5.20 Transforming data using Azure Stream Analytics JSON

FIGURE 5.21 Splitting the data source—Projection tab

FIGURE 5.22 Splitting the data sink—Optimize tab

FIGURE 5.23 Shredding JSON with Azure Cosmos DB

FIGURE 5.24 Encoding and decoding data Unicode

FIGURE 5.25 Encoding and decoding data—

VARCHAR

FIGURE 5.26 Encoding and decoding data—

NVARCHAR

FIGURE 5.27 Configuring error handling for the transformation

FIGURE 5.28 Normalizing and denormalizing brainjammer values

FIGURE 5.29 Not normalized brain waves data

FIGURE 5.30 Normalized brain waves data

FIGURE 5.31 Performing exploratory data analysis—visualizing data in Power B...

FIGURE 5.32 Performing exploratory data analysis—visualizing data in Power B...

FIGURE 5.33 Performing exploratory data analysis—Power BI workspace

FIGURE 5.34 Performing exploratory data analysis—visualizing brain waves in ...

FIGURE 5.35 Performing exploratory data analysis—visualizing brain waves alp...

FIGURE 5.36 Performing exploratory data analysis—visualizing brain waves gam...

FIGURE 5.37 The data transformation process

FIGURE 5.38 Transforming and enriching the data pipeline

FIGURE 5.39 Transforming and enriching data—pipeline drop script

FIGURE 5.40 Transforming and enriching data —filter transformation data flow...

FIGURE 5.41 Data management disciplines

FIGURE 5.42 Azure Databricks—configuring a box plot chart

FIGURE 5.43 Azure Databricks—configuring a box plot chart (2)

FIGURE 5.44 Azure Machine Learning—brainjammer contributor access

FIGURE 5.45 Azure Machine Learning—brainjammer table

FIGURE 5.46 Azure Machine Learning—brainjammer job

FIGURE 5.47 Azure Machine Learning—Access Control (IAM)

FIGURE 5.48 Azure Machine Learning—VotingEnsemble algorithm

FIGURE 5.49 Azure Machine Learning—usage prediction with a model workspace

FIGURE 5.50 Azure Machine Learning Usage—predict with a model

Chapter 6

FIGURE 6.1 The role of Azure Batch processing in a data analytics solution

FIGURE 6.2 Azure Batch processing—dependencies

FIGURE 6.3 Azure Batch processing—many‐to‐many dependency

FIGURE 6.4 Batch processing in Big Data architecture

FIGURE 6.5 An Azure Batch workflow

FIGURE 6.6 Azure Batch account and pool configuration

FIGURE 6.7 Azure Batch linked service configuration

FIGURE 6.8 Azure Batch Custom pipeline activity

FIGURE 6.9 Azure Batch task details

FIGURE 6.10 Azure Batch task output

FIGURE 6.11 Azure Batch—Azure Synapse Analytics batch service pipeline

FIGURE 6.12 Azure Synapse Analytics—Apache Spark job definition

FIGURE 6.13 Azure Synapse Analytics—Apache Spark job scenario result

FIGURE 6.14 Azure Synapse Analytics—Apache Spark job diagnostics and Monitor...

FIGURE 6.15 Azure Batch—Azure Synapse Analytics batch service pipeline with ...

FIGURE 6.16 Azure Databricks batch job

FIGURE 6.17 Linked service configuration for the Azure Databricks batch job...

FIGURE 6.18 Azure Databricks batch job pipeline configuration

FIGURE 6.19 Azure Databricks batch job pipeline status

FIGURE 6.20 Azure Batch custom pipeline activity Azure Data Factory

FIGURE 6.21 Azure Batch Explorer

FIGURE 6.22 Azure HDInsight batch processing

FIGURE 6.23 Azure Synapse Analytics parameters

FIGURE 6.24 Azure Synapse Analytics parameters as command arguments

FIGURE 6.25 Azure Synapse Analytics parameter input and output run details

FIGURE 6.26 Azure Synapse Analytics passing parameter between pipeline activ...

FIGURE 6.27 Azure Synapse Analytics pipeline activity with no dependencies

FIGURE 6.28 Azure Synapse Analytics pipeline activity with no dependencies (...

FIGURE 6.29 Azure Synapse Analytics pipeline variables

FIGURE 6.30 Azure Synapse Analytics pipeline dynamic arguments

FIGURE 6.31 Running an Azure Synapse Analytics pipeline

FIGURE 6.32 Azure Synapse Analytics New/Edit trigger

FIGURE 6.33 Azure Synapse Analytics daily scheduled trigger

FIGURE 6.34 Azure Synapse Analytics weekly scheduled trigger

FIGURE 6.35 Azure Synapse Analytics monthly scheduled trigger

FIGURE 6.36 Azure Synapse Analytics many‐to‐many scheduled trigger

FIGURE 6.37 Azure Synapse Analytics tumbling window trigger

FIGURE 6.38 Azure Synapse Analytics storage event trigger

FIGURE 6.39 Azure Synapse Analytics custom event notification flow

FIGURE 6.40 Azure Synapse Analytics custom event trigger

FIGURE 6.41 Azure Databricks scheduled trigger

FIGURE 6.42 Azure Databricks scheduled trigger log

FIGURE 6.43 Handle duplicate data—data flow source

FIGURE 6.44 Handle duplicate data—data flow aggregate group by

FIGURE 6.45 Handle duplicate data—data flow aggregate

FIGURE 6.46 Handle duplicate data—data flow select

FIGURE 6.47 Handle duplicate data—data flow sink

FIGURE 6.48 Upsert data, batching flow diagram

FIGURE 6.49 Upsert data—update methods

FIGURE 6.50 Upsert data—sink data preview

FIGURE 6.51 Upsert data—Delete If

FIGURE 6.52 Upsert data—MD5 Derived Column row hash

FIGURE 6.53 Upsert data—MD5 Exists row hash

FIGURE 6.54 Configure batch retention

FIGURE 6.55 Incremental data loads—Get Metadata activity

FIGURE 6.56 Incremental data loads—ForEach activity

FIGURE 6.57 Incremental data loads—Execute Pipeline activity

FIGURE 6.58 Incremental data loads—pipeline output

FIGURE 6.59 Incremental data loads—ForEach activity settings

FIGURE 6.60 Incremental data loads—Copy Data activity

FIGURE 6.61 Managing batches and pipelines’ triggers

FIGURE 6.62 Validate batch loads with the Copy Data activity

FIGURE 6.63 Validate batch loads with the Validation activity

FIGURE 6.64 Validate batch loads with Validation activity failure

FIGURE 6.65 Validate batch loads with the Lookup activity

FIGURE 6.66 Implementing version control for pipeline artifacts

FIGURE 6.67 Implementing version control for pipeline artifacts, Azure DevOp...

FIGURE 6.68 Manage data pipeline annotations

FIGURE 6.69 Managing data pipeline annotations using Azure PowerShell

FIGURE 6.70 Managing data pipeline annotations using Azure PowerShell data f...

FIGURE 6.71 Managing data pipeline annotations using Azure PowerShell Spark ...

FIGURE 6.72 Handling failed batch loads

Chapter 7

FIGURE 7.1 Azure stream processing

FIGURE 7.2 Azure real‐time stream processing

FIGURE 7.3 Azure near real‐time stream processing

FIGURE 7.4 Input interoperability in Azure products

FIGURE 7.5 Sink interoperability in Azure products

FIGURE 7.6 Azure Stream Analytics scaling

FIGURE 7.7 Lambda architecture speed layer, near real‐time processing

FIGURE 7.8 Azure Stream Analytics ADLS output

FIGURE 7.9 Azure Stream Analytics ADLS container path and file

FIGURE 7.10 Test sample data upload in Azure Stream Analytics

FIGURE 7.11 The result of test data uploaded in Azure Stream Analytics

FIGURE 7.12 Develop a stream processing solution Azure Stream Analytics outp...

FIGURE 7.13 Develop a stream processing solution Azure Stream Analytics simu...

FIGURE 7.14 Develop a stream processing solution Azure Stream Analytics sent...

FIGURE 7.15 Develop a stream processing solution Azure Stream Analytics sent...

FIGURE 7.16 Configure reference data for Azure Stream Analytics use.

FIGURE 7.17 Use reference data with Azure Stream Analytics.

FIGURE 7.18 Use reference data with Azure Stream Analytics.

FIGURE 7.19 Power BI Azure Stream Analytics output configuration

FIGURE 7.20 The brainjammer streaming dataset in Power BI

FIGURE 7.21 Adding a real‐time data tile to the Power BI dashboard

FIGURE 7.22 Configuring a real‐time data tile to the Power BI dashboard

FIGURE 7.23 Viewing a real‐time data tile to the Power BI dashboard

FIGURE 7.24 Azure Databricks stream processing

FIGURE 7.25 Azure Databricks Spark Structured Streaming

FIGURE 7.26 Installing the Event Hubs library on an Azure Databricks cluster...

FIGURE 7.27 The installed Event Hubs library on an Azure Databricks cluster...

FIGURE 7.28 Streamed Event Hubs messages displayed in the Azure Databricks n...

FIGURE 7.29 A brain wave time series chart

FIGURE 7.30 Windowed aggregates output

FIGURE 7.31 Partition key mapping to Azure Stream Analytics partitions

FIGURE 7.32 The Azure Stream Analytics Compatibility Level blade

FIGURE 7.33 Upsert on streamed data using an Azure function, connection stri...

FIGURE 7.34 Upserting streamed data on Azure Cosmos DB—configuring output

FIGURE 7.35 Streaming data into Azure Cosmos DB using the command console

FIGURE 7.36 Inserting streamed data on Azure Cosmos DB, initial load

FIGURE 7.37 Handling schema drift in a stream processing solution

FIGURE 7.38 Handling schema drift in a stream processing solution in ADLS

FIGURE 7.39 Handling schema drift in a stream processing solution in Azure C...

FIGURE 7.40 A data stream with event messages and a watermark

FIGURE 7.41 Watermark progression example

FIGURE 7.42 The

EventEnqueuedUtcTime

and

EventProcessedUtcTime

columns on th...

FIGURE 7.43 Event ordering for a late‐arriving streamed event message

FIGURE 7.44 Azure Stream Analytics monitoring metrics

FIGURE 7.45 An archived data stream solution

FIGURE 7.46 Configurating an archive input alias

FIGURE 7.47 Archive replay data result

FIGURE 7.48 Azure Stream Analytics job metrics, CPU at 99 percent utilizatio...

FIGURE 7.49 Azure Stream Analytics job scaling

FIGURE 7.50 Azure Stream Analytics Diagnostics Setting

FIGURE 7.51 Azure Stream Analytics Activity log warnings and errors

Chapter 8

FIGURE 8.1 Layered security

FIGURE 8.2 Creating an Azure Key Vault key

FIGURE 8.3 Creating an Azure Key Vault secret

FIGURE 8.4 Creating an Azure Key Vault certificate

FIGURE 8.5 Vault access policy operations

FIGURE 8.6 Azure Key Vault x509 certificate details

FIGURE 8.7 Microsoft Purview default root collection

FIGURE 8.8 Microsoft Purview Map view

FIGURE 8.9 The Azure Policy Overview blade

FIGURE 8.10 Data Discovery & Classification

FIGURE 8.11 Data Discovery & Classification, Add Classification window

FIGURE 8.12 Azure storage account encryption type

FIGURE 8.13 Dynamic Data Masking dedicated SQL pool

FIGURE 8.14 ADLS access control access keys

FIGURE 8.15 ADLS Access control shared access signature

FIGURE 8.16 RBAC and ACL permission evaluation

FIGURE 8.17 RBAC Access Control (IAM) Azure storage account

FIGURE 8.18 RBAC role and ACL permission evaluation

FIGURE 8.19 The Manage ACL blade

FIGURE 8.20 Connecting Microsoft Purview to Azure Synapse Analytics workspac...

FIGURE 8.21 Configuring scanning in Microsoft Purview

FIGURE 8.22 The result of a Microsoft Purview scan

FIGURE 8.23 Dedicated SQL pool auditing configuration

FIGURE 8.24 Dedicated SQL pool Diagnostic setting configuration

FIGURE 8.25 View dedicated SQL pool audit logs in Log Analytics.

FIGURE 8.26 Scanning a dedicated SQL pool with Microsoft Purview

FIGURE 8.27 Microsoft Purview Data estate insights schema data classificatio...

FIGURE 8.28 SQL Information Protection policy classification recommendations...

FIGURE 8.29 Data Discovery & Classification, Add classification 2

FIGURE 8.30 Data Discovery & Classification overview

FIGURE 8.31 Protecting sensitive data in files

FIGURE 8.32 Implement a data retention policy in Azure Synapse Analytics.

FIGURE 8.33 Implement a data retention policy schedule pipeline trigger.

FIGURE 8.34 Encrypt data at rest, TDE, dedicated SQL pool.

FIGURE 8.35 Row‐level security

FIGURE 8.36 Column‐level security

FIGURE 8.37 Column‐level security enforcement exception

FIGURE 8.38 Role and membership details on a dedicated SQL pool database

FIGURE 8.39 Implement data masking and masking rule.

FIGURE 8.40 Creating and applying a user‐assigned managed identity

FIGURE 8.41 Creating a shared, credential passthrough spark cluster

FIGURE 8.42 Adding a user to an Azure Databricks workspace using RBAC

FIGURE 8.43 Access key in Key Vault from a blob linked service failure

FIGURE 8.44 Access key from Key Vault to blob linked service failure

FIGURE 8.45 Enabling Microsoft Defender for Storage

FIGURE 8.46 Azure Active Directory created group

FIGURE 8.47 Add role assignment access control Synapse Contributor.

FIGURE 8.48 Add role assignment access control Synapse Contributor Parquet f...

FIGURE 8.49 Managing ACLs for an ADLS folder

FIGURE 8.50 Adding an ACL to allow write access

FIGURE 8.51 Adding an Azure storage account with an ADLS container to a VNet...

FIGURE 8.52 The Azure storage account VNet configuration

FIGURE 8.53 Azure storage account private endpoint configuration

FIGURE 8.54 Network security group rules

FIGURE 8.55 Generating an access token

FIGURE 8.56 Azure Batch networking restrictions

FIGURE 8.57 Configuring a custom Azure Batch RBAC role using the Azure porta...

FIGURE 8.58 Browsing assets in the data catalog

FIGURE 8.59 Browsing assets based on source type

FIGURE 8.60 Browsing assets based on source type

FIGURE 8.61 Viewing Microsoft Purview data lineage

Chapter 9

FIGURE 9.1 The Azure Synapse Analytics Logs blade

FIGURE 9.2 Azure Synapse Analytics dedicated SQL pool metrics

FIGURE 9.3 Azure Event Hub diagnostic settings

FIGURE 9.4 Creating an Azure Synapse Analytics alert condition

FIGURE 9.5 The Azure Synapse Analytics Alerts blade

FIGURE 9.6 The Azure Monitor activity log

FIGURE 9.7 The Azure Storage Account Insights Overview tab

FIGURE 9.8 A summary of the Azure Storage Account Insights blade

FIGURE 9.9 Azure storage account Workbooks

FIGURE 9.10 The Azure Synapse Analytics Monitor hub

FIGURE 9.11 Azure Synapse Analytics integration runtimes

FIGURE 9.12 An Azure Synapse Analytics sample pipeline to generate monitor l...

FIGURE 9.13 Azure Synapse Analytics pipeline runs filtered by annotations

FIGURE 9.14 Azure Synapse Analytics activity runs

FIGURE 9.15 Azure Synapse Analytics data flow modifiers

FIGURE 9.16 Azure Synapse Analytics Apache Spark applications

FIGURE 9.17 Azure Synapse Analytics dedicated SQL pool metrics

FIGURE 9.18 Azure Stream Analytics job diagram

FIGURE 9.19 The Azure Stream Analytics Metrics hub

FIGURE 9.20 Azure Databricks Apache Spark cluster logging

FIGURE 9.21 Azure Databricks cluster metrics

FIGURE 9.22 Monitoring data pipeline performance Gantt chart

FIGURE 9.23 Monitoring and update statistics execution plan

FIGURE 9.24 Monitoring and update statistics view statistics

FIGURE 9.25 Apache Spark application details

FIGURE 9.26 DAG Visualization

FIGURE 9.27 Azure DevOps Azure Test Plans New Test Plan

FIGURE 9.28 Azure DevOps Azure Test Plans New Test Cases

FIGURE 9.29 Azure DevOps Azure Test Plans Execute Test Cases

FIGURE 9.30 Azure Stream Analytics Tools extension in the Visual Studio Code...

FIGURE 9.31 An Azure Stream Analytics job query in the Visual Studio Code bl...

FIGURE 9.32 The Azure Batch Metrics blade

FIGURE 9.33 The Azure Key Vault Metrics blade

FIGURE 9.34 The Azure SQL Metrics blade

Chapter 10

FIGURE 10.1 Azure Advisor score

FIGURE 10.2 Azure Cost Management

FIGURE 10.3 Compacting small files—Source and Sink tabs

FIGURE 10.4 Handling data spill memory capacity

FIGURE 10.5 Finding shuffling in a pipeline—explain plan with shuffle cost

FIGURE 10.6 Tuning queries by using indexer's indexes

FIGURE 10.7 Tuning queries with the Top Resource Consuming Queries report

FIGURE 10.8 Tuning queries with a nonclustered index

FIGURE 10.9 Optimizing pipelines for analytics or transactional purposes

FIGURE 10.10 Optimizing pipelines for analytics or transactional purposes: d...

FIGURE 10.11 Optimizing pipelines for analytics or transactional purposes: d...

FIGURE 10.12 Optimizing pipelines for analytics or transactional purposes: d...

FIGURE 10.13 Optimizing pipelines for analytics or transactional purposes: d...

FIGURE 10.14 Optimizing pipelines for analytics or transactional purposes: d...

FIGURE 10.15 Troubleshooting a failed Spark job:

stderr

FIGURE 10.16 Troubleshooting a failed Spark job: scaling Apache Spark workfl...

FIGURE 10.17 Troubleshooting a failed Spark job: scaling Apache Spark pool j...

FIGURE 10.18 Troubleshooting a failed pipeline run

FIGURE 10.19 Troubleshooting a failed pipeline run: scaling a dedicated SQL ...

FIGURE 10.20 Troubleshooting a failed pipeline run: scaling an Apache Spark ...

FIGURE 10.21 Troubleshooting a failed pipeline run: enabling Data Flow Debug...

FIGURE 10.22 Troubleshooting a failed pipeline run: debug settings

FIGURE 10.23 Troubleshooting a failed pipeline run: breakpoints

FIGURE 10.24 Troubleshooting a failed pipeline run: dependency conditions

FIGURE 10.25 Troubleshooting a failed pipeline run: retries

FIGURE 10.26 Troubleshooting a failed pipeline run: reruns

FIGURE 10.27 Rewriting Azure Stream Analytics user‐defined functions

FIGURE 10.28 Scaling resources: Azure Batch pool

FIGURE 10.29 Handling interruptions: dedicated Azure Stream Analytics cluste...

FIGURE 10.30 Scaling resources: custom autoscale rule

Guide

Cover

Table of Contents

Title Page

Copyright

Acknowledgments

About the Author

About the Technical Editor

Table of Exercises

Introduction

Begin Reading

Appendix: Answers to Review Questions

Index

End User License Agreement

Pages

iii

iv

v

vii

ix

xxiii

xxiv

xxv

xxvii

xxviii

xxix

xxx

xxxi

xxxii

xxxiii

xxxiv

xxxv

xxxvi

xxxvii

xxxviii

xxxix

xl

xli

xlii

xliii

xliv

xlv

xlvi

xlvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

915

916

917

918

919

920

921

922

923

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

MCAMicrosoft Certified Associate Azure® Data Engineer

Study GuideExam DP-203

 

 

Benjamin Perkins

 

Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada and the United Kingdom.

ISBNs: 9781119885429 (paperback), 9781119885443 (ePDF), 9781119885436 (ePub)

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at www.wiley.com/go/permission.

Trademarks: WILEY, the Wiley logo, and the Sybex logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Microsoft and Azure are registered trademarks of Microsoft Corporation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Control Number: 2023941199

Cover image: © Jeremy Woodhouse/Getty ImagesCover design: Wiley

Acknowledgments

Creating a book starts first as an idea, which then iterates through many versions, until it takes the form of something consumable. Many people helped to progress this book from idea to final product. Here is a list of those who played a significant role in the creation of this book and the organization of its content:

Ken Brown, senior acquisitions editor

Robyn Alvarez, project manager

Heini Ilmarinen, technical editor

John Sleeva, copyeditor

Nancy Carrasco, proofreader

Writing this book—and writing in general—has become something I enjoy. Writing gives me the opportunity to share some of my technical knowledge and experiences so that others can gain some knowledge and insights. In addition to sharing my words, I gain an even greater understanding of the topic, as I structure the content, conduct research, and create hands‐on exercises. Writing a book requires a huge effort, but there are many reasons to do it. I'd like to thank my family for their support while I was writing this book. I know it took hours away from them. Thanks, Andrea, Lea, and Noa. You are the reason and my purpose.

About the Author

Benjamin Perkins is currently employed at Microsoft in Munich, Germany, as a Senior Escalation Engineer on the Azure team. He has been working professionally in the IT industry for close to three decades. He started computer programming with QBasic at the age of 11 on an Atari 1200XL desktop computer. He takes pleasure in the challenges that troubleshooting technical issues have to offer and savors in the rewards of a well‐written program. After completing high school, he joined the United States Army. After successfully completing his military service, he attended Texas A&M University in College Station, Texas, where he received a Bachelor of Business Administration in Management Information Systems. He also received a Master of Business Administration from the European University.

His roles in the IT industry have spanned the entire spectrum, including programmer, system architect, technical support engineer, team leader, and mid‐level manager. While employed at Hewlett‐Packard and Compaq Computer Corporation, he received numerous awards, degrees, and certifications. He has a passion for technology and customer service and looks forward to troubleshooting and writing more world‐class technical solutions: “My approach is to write code with support in mind, and to write it once correctly and completely so we do not have to come back to it again, except to enhance it.”

Benjamin has written numerous magazine articles and training courses and is an active blogger. His catalog of books covers C# programming, IIS, NHibernate, and Microsoft Azure.

Benjamin is married to Andrea and has two wonderful children, Lea and Noa.

About the Technical Editor

Heini Ilmarinen is a data enthusiast with a passion for architecture and DevOps. Heini currently works as Azure Lead and DevOps Consultant at Polar Squad, helping customers bring their data platforms to life in Azure.

Heini initially studied to become a mathematics teacher, graduating from Helsinki University with a Master of Science. After graduating, she transitioned to the IT industry, leveraging her skills for problem‐solving and making complex topics easy to understand. In IT, Heini started her career working in infrastructure architecture development projects in hybrid environments. With architecture as a starting point, her career developed from working with Azure to getting deeper into data projects to topics related to DevOps.

Over the years, Heini has worked in a multitude of Azure projects, from application development to data projects, gaining a broad understanding of the requirements for creating functional, production‐ready solutions. For the past two years, she has also engaged in community events and public speaking, gaining the Data Platform MVP award.

Heini can be often found riding her snowboard and enjoying the fresh air, or riding up and down hills on her mountain bike.

Table of Exercises

Exercise 2.1

   Create an Azure SQL DB

Exercise 2.2

   Create an Azure Cosmos DB

Exercise 2.3

   Create a Schema and a View in Azure SQL

Exercise 3.1

   Create an Azure Data Lake Storage Container

Exercise 3.2

   Upload Data to an ADLS Container

Exercise 3.3

   Create an Azure Synapse Analytics Workspace

Exercise 3.4

   Create an Azure Synapse Analytics Linked Service

Exercise 3.5

   Configure an Azure Synapse Analytics Workspace Package

Exercise 3.6

   Configure an Azure Synapse Analytics Workspace with GitHub

Exercise 3.7

   Configure Azure Synapse Analytics Data Hub SQL Pool Staging Tables

Exercise 3.8

   Configure Azure Synapse Analytics Data Hub with Azure Cosmos DB

Exercise 3.9

   Configure an Azure Synapse Analytics Integrated Dataset

Exercise 3.10

   Create an Azure Data Factory

Exercise 3.11

   Create a Linked Service in Azure Data Factory

Exercise 3.12

   Create a Dataset in Azure Data Factory

Exercise 3.13

   Create a Pipeline to Convert XLSX to Parquet

Exercise 3.14

   Create an Azure Databricks Workspace with an External Hive Metastore

Exercise 3.15

   Configure Delta Lake

Exercise 3.16

   Create an Azure Event Namespace and Hub

Exercise 3.17

   Create an Azure Stream Analytics Job

Exercise 4.1

   Implement Compression

Exercise 4.2

   Implement Partitioning

Exercise 4.3

   Implement Data Redundancy

Exercise 4.4

   Implement Distributions

Exercise 4.5

   Implement Data Archiving

Exercise 4.6

   Azure Synapse Analytics Data Hub SQL Script

Exercise 4.7

   Azure Synapse Analytics Develop Hub Notebook

Exercise 4.8

   Azure Synapse Analytics Develop Hub Data Flow

Exercise 4.9

   Build a Temporal Data Solution

Exercise 4.10

   Azure Synapse Analytics Data Hub Data Flow

Exercise 4.11

   Build External Tables on a Serverless SQL Pool

Exercise 4.12

   Implement Efficient File and Folder Structures

Exercise 4.13

   Implement a Serving Layer with a Star Schema

Exercise 4.14

   Implement a Dimensional Hierarchy

Exercise 5.1

   Transform Data Using Azure Synapse Pipeline

Exercise 5.2

   Transform Data Using Azure Data Factory

Exercise 5.3

   Transform Data Using Apache Spark—Azure Synapse Analytics

Exercise 5.4

   Transform Data Using Apache Spark—Azure Databricks

Exercise 5.5

   Cleanse Data

Exercise 5.6

   Split Data

Exercise 5.7

   Azure Cosmos DB—Shred JSON

Exercise 5.8

   Flatten, Explode, and Shred JSON

Exercise 5.9

   Encode and Decode Data

Exercise 5.10

   Normalize and Denormalize Values

Exercise 5.11

   Perform Exploratory Data Analysis—Transform

Exercise 5.12

   Perform Exploratory Data Analysis—Visualize

Exercise 5.13

   Transform and Enrich Data

Exercise 5.14

   Transform Data by Using Apache Spark—Azure Databricks

Exercise 5.15

   Predict Data Using Azure Machine Learning

Exercise 6.1

   Create an Azure Batch Account and Pool

Exercise 6.2

   Develop a Batch Processing Solution Using an Azure Synapse Analytics Pipeline

Exercise 6.3

   Develop a Batch Processing Solution Using an Azure Synapse Analytics Apache Spark

Exercise 6.4

   Develop a Batch Processing Solution Using Azure Databricks

Exercise 6.5

   Develop a Batch Processing Solution Using an Azure Data Factory Pipeline

Exercise 6.6

   Create Data Pipelines—Advanced

Exercise 6.7

   Create a Scheduled Trigger

Exercise 6.8

   Create and Schedule an Azure Databricks Workflow Job

Exercise 6.9

   Handle Duplicate Data with a Data Flow

Exercise 6.10

   Upsert Data

Exercise 6.11

   Implement Incremental Data Loads

Exercise 6.12

   Validate Batch Loads by Using a Validation Activity

Exercise 6.13

   Validate Batch Loads by Using a Lookup Activity

Exercise 7.1

   Add an Output ADLS Container to an Azure Stream Analytics Job

Exercise 7.2

   Develop a Stream Processing Solution with Azure Stream Analytics—Testing the Data

Exercise 7.3

   Develop a Stream Processing Solution with Azure Stream Analytics

Exercise 7.4

   Use Reference Data with Azure Stream Analytics

Exercise 7.5

   Stream Data to Power BI from Azure Stream Analytics

Exercise 7.6

   Stream Data with Azure Databricks

Exercise 7.7

   Develop and Create Windowed Aggregates

Exercise 7.8

   Upsert Stream Processed Data in Azure Cosmos DB

Exercise 7.9

   Handle Schema Drift in Azure Stream Analytics

Exercise 7.10

   Replay an Archived Stream Data in Azure Stream Analytics

Exercise 8.1

   Create an Azure Key Vault Resource

Exercise 8.2

   Create a Microsoft Purview Account

Exercise 8.3

   Configure and Perform a Data Asset Scan Using Microsoft Purview

Exercise 8.4

   Audit an Azure Synapse Analytics Dedicated SQL Pool

Exercise 8.5

   Apply Sensitivity Labels and Data Classifications Using Microsoft Purview and Data Discovery

Exercise 8.6

   Implement a Data Retention Policy

Exercise 8.7

   Implement Column-Level Security

Exercise 8.8

   Implement Data Masking

Exercise 8.9

   Create a User-Assigned Managed Identity

Exercise 8.10

   Connect to an ADLS Container from Azure Databricks Cluster Using ABFSS

Exercise 8.11

   Use an Azure Key Vault Secret to Store an Authentication Key for a Linked Service

Exercise 8.12

   Implement Azure RBAC for ADLS

Exercise 8.13

   Implement POSIX-Like ACLs for ADLS

Exercise 8.14

   Create an Azure Storage Account and ADLS Container with a VNet

Exercise 8.15

   Create an Azure Synapse Analytics Workspace with a VNET

Exercise 9.1

   Create an Azure Monitor Workspace

Exercise 9.2

   Create an Azure Synapse Analytics Alert

Exercise 9.3

   Monitor and Manage Azure Synapse Analytics Logs

Exercise 10.1

   Compact Small Files

Introduction

A long time ago, I was sitting at my desk happily coding my Active Server Page (ASP) and COM component, when someone approached me and asked if I knew anything about databases. Without even a pause, I answered a confident yes, most people in IT know "something" about databases, right? Well, it turned out that a big project was starting, and they needed someone to create and manage a database. I acquired a server, installed a relational database management system (RDMBS), and executed CREATE DATABASE dbName; GO. And the rest is history. I like to call that out because these days, most of the data storage architecture already exists when you start the job. You must learn what someone else created. You experience problems but do not know why, because a lot happened before you started.

The new emerging technology called big data is providing a rare opportunity, kind of like the one I had. The opportunity is to build and/or be involved in creating an IT data analytics solution from the beginning. Being the person or the team who builds the framework and foundation of what could become a system that shapes the future of a company is career‐altering. The experience is a differentiator that stays with you for the rest of your career, as it has in mine. But it could also be a catastrophe for numerous reasons, such as not being able to scale, being too hard to make changes, and not being reliable.