46,99 €
Prepare for the Azure Data Engineering certification--and an exciting new career in analytics--with this must-have study aide In the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203, accomplished data engineer and tech educator Benjamin Perkins delivers a hands-on, practical guide to preparing for the challenging Azure Data Engineer certification and for a new career in an exciting and growing field of tech. In the book, you'll explore all the objectives covered on the DP-203 exam while learning the job roles and responsibilities of a newly minted Azure data engineer. From integrating, transforming, and consolidating data from various structured and unstructured data systems into a structure that is suitable for building analytics solutions, you'll get up to speed quickly and efficiently with Sybex's easy-to-use study aids and tools. This Study Guide also offers: * Career-ready advice for anyone hoping to ace their first data engineering job interview and excel in their first day in the field * Indispensable tips and tricks to familiarize yourself with the DP-203 exam structure and help reduce test anxiety * Complimentary access to Sybex's expansive online study tools, accessible across multiple devices, and offering access to hundreds of bonus practice questions, electronic flashcards, and a searchable, digital glossary of key terms A one-of-a-kind study aid designed to help you get straight to the crucial material you need to succeed on the exam and on the job, the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203 belongs on the bookshelves of anyone hoping to increase their data analytics skills, advance their data engineering career with an in-demand certification, or hoping to make a career change into a popular new area of tech.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1502
Veröffentlichungsjahr: 2023
Cover
Title Page
Copyright
Acknowledgments
About the Author
About the Technical Editor
Table of Exercises
Introduction
Who This Book Is For
What This Book Covers
How This Book Is Structured
What You Need to Use This Book
Interactive Online Learning Environment and TestBank
DP‐203 Exam Objectives
Reader Support for This Book
Assessment Test
Answers to Assessment Test
PART I: Azure Data Engineer Certification and Azure Products
Chapter 1: Gaining the Azure Data Engineer Associate Certification
The Journey to Certification
How to Pass Exam DP‐203
Azure Product Name Recognition
Azure Data Analytics
Azure Storage Products
Azure Databases
Azure Security
Azure Networking
Azure Compute
Azure Management and Governance
Summary
Exam Essentials
Review Questions
Chapter 2: CREATE DATABASE dbName; GO
The Brainjammer
A Historical Look at Data
Data Structures, Types, and Concepts
Data Programming and Querying for Data Engineers
Understanding Big Data Processing
Summary
Exam Essentials
Review Questions
PART II: Design and Implement Data Storage
Chapter 3: Data Sources and Ingestion
Where Does Data Come From?
Design a Data Storage Structure
Design a Partition Strategy
Design the Serving/Data Exploration Layer
The Ingestion of Data into a Pipeline
Migrating and Moving Data
Summary
Exam Essentials
Review Questions
Chapter 4: The Storage of Data
Implement Physical Data Storage Structures
Implement Logical Data Structures
Implement a Partition Strategy
Design and Implement the Data Exploration Layer
Additional Data Storage Topics
Summary
Exam Essentials
Review Questions
PART III: Develop Data Processing
Chapter 5: Transform, Manage, and Prepare Data
Ingest and Transform Data
Transformation and Data Management Concepts
Data Modeling and Usage
Summary
Exam Essentials
Review Questions
Chapter 6: Create and Manage Batch Processing and Pipelines
Design and Develop a Batch Processing Solution
Manage Batches and Pipelines
Summary
Exam Essentials
Review Questions
Chapter 7: Design and Implement a Data Stream Processing Solution
Develop a Stream Processing Solution
Ingest and Transform Data
Monitor Data Storage and Data Processing
Summary
Exam Essentials
Review Questions
PART IV: Secure, Monitor, and Optimize Data Storage and Data Processing
Chapter 8: Keeping Data Safe and Secure
Design Security for Data Policies and Standards
Implement Data Security
Develop a Batch Processing Solution
Design and Implement the Data Exploration Layer
Summary
Exam Essentials
Review Questions
Chapter 9: Monitoring Azure Data Storage and Processing
Monitoring Data Storage and Data Processing
Develop a Batch Processing Solution
Develop a Stream Processing Solution
Azure Monitoring Overview
Summary
Exam Essentials
Review Questions
Chapter 10: Troubleshoot Data Storage Processing
Optimize and Troubleshoot Data Storage and Data Processing
Design and Develop a Batch Processing Solution
Monitor Batches and Pipelines
Design and Develop a Stream Processing Solution
Summary
Exam Essentials
Review Questions
Appendix: Answers to Review Questions
Chapter 1: Gaining the Azure Data Engineer Associate Certification
Chapter 2: CREATE DATABASE dbName; GO
Chapter 3: Data Sources and Ingestion
Chapter 4: The Storage of Data
Chapter 5: Transform, Manage, and Prepare Data
Chapter 6. Create and Manage Batch Processing and Pipelines
Chapter 7: Design and Implement a Data Stream Processing Solution
Chapter 8: Keeping Data Safe and Secure
Chapter 9: Monitoring Azure Data Storage and Processing
Chapter 10: Troubleshoot Data Storage Processing
Index
End User License Agreement
Chapter 1
TABLE 1.1 Azure certifications
TABLE 1.2 Popular cloud service offerings
TABLE 1.3 Technical terms and definitions
TABLE 1.4 ADLS‐supported platforms
TABLE 1.5 File/folder access permission levels
TABLE 1.6 Azure storage redundancy
TABLE 1.7 Azure Cosmos DB APIs
TABLE 1.8 NSG example
Chapter 2
TABLE 2.1 File comparison
TABLE 2.2 Common data types
TABLE 2.3 Table category distribution matrix
TABLE 2.4 Wildcard location examples
TABLE 2.5 Spark pool magic commands
TABLE 2.6
PySpark vs. Spark
TABLE 2.7 Azure SDKs packages
TABLE 2.8 Aggregate and mathematical functions
TABLE 2.9
JOIN
types
TABLE 2.10 Big Data processing stages
TABLE 2.11 Azure products and Big Data stages
Chapter 3
TABLE 3.1 Types and tools for ingestion
TABLE 3.2 Analytical datastores
TABLE 3.3 File type use cases
TABLE 3.4 Data landing zones
TABLE 3.5 Slowly changing dimension types
TABLE 3.6 Hot path serving layer and MPP products
TABLE 3.7 Dedicated vs. serverless SQL pools
TABLE 3.8 Dedicated SQL pool performance level
TABLE 3.9 Spark pool node sizes
TABLE 3.10 Apache Spark components
TABLE 3.11 Data Explorer pool workload size
TABLE 3.12 Integration runtimes core count
TABLE 3.13 Azure Databricks cluster modes
TABLE 3.14 Databricks runtime versions
TABLE 3.15 Azure Databricks worker types
TABLE 3.16 Azure Databricks environments
TABLE 3.17 Azure Databricks job types
TABLE 3.18 Azure Databricks user entitlements
TABLE 3.19 Event Hubs vs. IoT Hub
TABLE 3.20 Azure Stream Analytics built‐in functions
TABLE 3.21 Azure Stream Analytics data types
TABLE 3.22 Apache Kafka vs. Event Hubs terminology
Chapter 4
TABLE 4.1 Supported codecs by file format
TABLE 4.2 Cross‐region replication pairings, paired datacenters
TABLE 4.3 ADLS archiving actions
TABLE 4.4 Data flow schema modifiers
TABLE 4.5 Data flow transformation features
TABLE 4.6 Slowly changing dimension types
TABLE 4.7 External location endpoints and protocols
Chapter 5
TABLE 5.1 Data file split recommendation
TABLE 5.2 Brainjammer brain wave values
Chapter 6
TABLE 6.1 Azure Batch resource components
TABLE 6.2 Azure Storage limits
TABLE 6.3 Exercise 6.6 pipeline parameters
TABLE 6.4 Types of pipeline triggers
TABLE 6.5 Copy Data activity—verification results
TABLE 6.6 Copy Data activity—inconsistent data results
TABLE 6.7 Azure DevOps components
Chapter 7
TABLE 7.1 Streaming product capabilities
TABLE 7.2 Additional streaming product capabilities
TABLE 7.3 Streaming scalability by product
TABLE 7.4 Azure streaming products' pricing units
TABLE 7.5 Azure Event Hubs tiers
TABLE 7.6 Stream Analytics input/output partitioning
TABLE 7.7 Data stream illustration
TABLE 7.8 Azure Stream Analytics exceptions
Chapter 8
TABLE 8.1 Azure data product security support
TABLE 8.2 Azure storage account authorization methods
TABLE 8.3 Managed identity types
Chapter 9
TABLE 9.1 Logging verbosity and severity
TABLE 9.2 Synapse platform system dynamic management views
TABLE 9.3 database_transaction_state column description
TABLE 9.4 DMVs for troubleshooting PolyBase
TABLE 9.5 Azure Synapse Analytics workspace metrics
TABLE 9.6 Dedicated SQL pool metrics
TABLE 9.7 Apache Spark pool metrics
TABLE 9.8 Different types of testing
TABLE 9.9 Azure Stream Analytics metrics
Chapter 10
TABLE 10.1 Performance and troubleshooting antipatterns
TABLE 10.2 Database partition analysis features
TABLE 10.3 Dedicated SQL pool indexes
TABLE 10.4 Index‐related Dynamic Management Views
TABLE 10.5 Query Store stored procedures
TABLE 10.6 Transaction and HTAP dynamic management views
TABLE 10.7 Data Flow Compute size
Chapter 1
FIGURE 1.1 Comparing the Azure data scientist and analyst roles
FIGURE 1.2 The Azure data engineer role
FIGURE 1.3 The Azure database administrator associate role
FIGURE 1.4 A path to the Azure Data Engineer Associate certification
FIGURE 1.5 The extract, transform, and load (ETL) approach
FIGURE 1.6 A data streaming pipeline
FIGURE 1.7 Azure portal security product and feature security hierarchy
FIGURE 1.8 An Azure data security diagram with products and features
FIGURE 1.9 Using Azure Key Vault and MI
FIGURE 1.10 Azure privacy and governance products
FIGURE 1.11 Azure health and monitoring products
FIGURE 1.12 Azure Synapse pools, performance, and debugging
FIGURE 1.13 Azure feature updates
FIGURE 1.14 Azure Analytics product documentation
FIGURE 1.15 Azure products in preview
FIGURE 1.16 Azure Synapse Analytics services
FIGURE 1.17 Azure Synapse Analytics Studio
FIGURE 1.18 Azure Databricks workspace
FIGURE 1.19 Azure HDInsight most popular supported open source frameworks
FIGURE 1.20 Azure Analysis Services
FIGURE 1.21 Azure Data Factory Studio
FIGURE 1.22 Azure Stream Analytics data flow
FIGURE 1.23 Azure Cosmos DB and supported APIs
FIGURE 1.24 Azure Active Directory portal
FIGURE 1.25 Role‐based access control scope
FIGURE 1.26 Azure App Service Managed Identity
FIGURE 1.27 Azure Managed Identity in Azure Active Directory
FIGURE 1.28 Azure Managed Identity in Azure Key Vault
FIGURE 1.29 Azure Monitor
FIGURE 1.30 Tags for Azure products
Chapter 2
FIGURE 2.1 Big Data characteristics
FIGURE 2.2 Tables in a relational database
FIGURE 2.3 The Select SQL Deployment Option blade
FIGURE 2.4 Azure Data Studio
FIGURE 2.5 A view of data tables in Azure Data Studio
FIGURE 2.6 Azure Cosmos DB APIs
FIGURE 2.7 Azure Cosmos Data Explorer
FIGURE 2.8 Azure Cosmos Data Explorer SQL query
FIGURE 2.9 Azure Synapse Analytics Sharding example
FIGURE 2.10 Azure Synapse Analytics hash table distribution
FIGURE 2.11 Azure Synapse Analytics replicated table distribution
FIGURE 2.12 Azure Synapse Analytics external tables
FIGURE 2.13 Azure Synapse Analytics external tables example
FIGURE 2.14 ADLS directory hierarchy example
FIGURE 2.15 Schemas, views, and users as seen in SSMS
FIGURE 2.16 A star schema example
FIGURE 2.17 A snowflake schema example
FIGURE 2.18 SQL Query from view table and new schema
FIGURE 2.19 A data skew example
FIGURE 2.20 Data streaming solution using Apache Kafka and Apache Spark
FIGURE 2.21 Visual Studio data workloads
FIGURE 2.22 Visual Studio C# code example
FIGURE 2.23 Where to implement and use an SDK
FIGURE 2.24 Output of running the
PDW_SHOWSPACEUSED
command
FIGURE 2.25 Output of running the
SHOW_STATISTICS
command
FIGURE 2.26 Using
CONVERT
and
CAST
SQL commands
FIGURE 2.27 SQL query output from
FIRST_VALUE
and
LAST_VALUE
FIGURE 2.28 Representation of SQL JOINs
FIGURE 2.29
OPENJSON
query
FIGURE 2.30 Big Data stages and Azure products
FIGURE 2.31 Pipeline data flow Azure Synapse transform stage
Chapter 3
FIGUER 3.1 Data producers and processing services
FIGUER 3.2 An Azure storage account Overview blade
FIGUER 3.3 The Upload folder in Azure Storage Explorer
FIGUER 3.4 Files uploaded to ADLS using Azure Storage Explorer
FIGUER 3.5 Storing Parquet files in an ADLS container
FIGUER 3.6 The brainjammer directory structure
FIGUER 3.7 The brainjammer
raw‐files
directory
FIGUER 3.8 The brainjammer
cleansed‐data
directory
FIGUER 3.9 The brainjammer
business‐data
directory
FIGUER 3.10 Changing the access tier of files in an ADLS container
FIGUER 3.11 Configuring a new dedicated SQL pool
FIGUER 3.12 Partitioning files
FIGUER 3.13 The lambda architecture serving layer
FIGUER 3.14 A relational star schema
FIGUER 3.15 Type 1 SCD
FIGUER 3.16 Type 2 SCD
FIGUER 3.17 Type 3 SCD
FIGUER 3.18 Type 6 SCD
FIGUER 3.19 A dimensional hierarchy
FIGUER 3.20 A temporal table
FIGUER 3.21 A Hive metadata metastore
FIGUER 3.22 A Hive metadata metastore database
FIGUER 3.23 A Hive metadata metastore table
FIGUER 3.24 A Hive metadata metastore spark pool
FIGUER 3.25 Querying metadata in a SQL database
FIGUER 3.26 An overview of Azure Synapse Analytics
FIGUER 3.27 The Azure Synapse Analytics Manage hub
FIGUER 3.28 Creating an Azure Synapse Analytics dedicated SQL pool
FIGUER 3.29 Azure Synapse Analytics Apache Spark pool Basics tab
FIGUER 3.30 Azure Synapse Analytics Apache Spark pool Additional Settings ta...
FIGUER 3.31 Azure Synapse Analytics Data Explorer pool Basics tab
FIGUER 3.32 Azure Synapse Analytics Data Explorer pool Additional Settings t...
FIGUER 3.33 Azure Synapse Analytics External connections Linked services
FIGUER 3.34 Azure Synapse Analytics Linked Azure SQL Database
FIGUER 3.35 Azure Synapse Analytics linked services
FIGUER 3.36 Azure Synapse Analytics integration triggers
FIGUER 3.37 Choosing an Azure Synapse Analytics integration runtime
FIGUER 3.38 Creating an Azure Synapse Analytics integration runtime
FIGUER 3.39 The Access Control page in Azure Synapse Analytics
FIGUER 3.40 Adding a role assignment in Azure Synapse Analytics
FIGUER 3.41 Azure Synapse Analytics private endpoints
FIGUER 3.42 Adding a workspace package in Azure Synapse Analytics
FIGUER 3.43 Consuming a workspace package in Azure Synapse Analytics
FIGUER 3.44 Setting up a code repository in Azure Synapse Analytics
FIGUER 3.45 Azure Synapse Analytics configure GitHub
FIGUER 3.46 Azure Synapse Analytics configure GitHub repository
FIGUER 3.47 Azure Synapse Analytics configure GitHub saved
FIGUER 3.48 Azure Synapse Analytics Data SQL database
FIGUER 3.49 Azure Synapse Analytics Data external data
FIGUER 3.50 Azure Synapse Analytics Data connect Azure Cosmos DB
FIGUER 3.51 How a dataset fits in the data ingestion scheme
FIGUER 3.52 Azure Synapse Analytics data integration dataset formats
FIGUER 3.53 Azure Synapse Analytics data integration dataset linked service...
FIGUER 3.54 Azure Synapse Analytics data integration dataset properties
FIGUER 3.55 The Azure Synapse Analytics Browse Gallery
FIGUER 3.56 Azure Synapse Analytics Integrate Pipeline
FIGUER 3.57 Azure Synapse Analytics Pipeline Copy data Source tab
FIGUER 3.58 Azure Synapse Analytics Pipeline Copy data Sink tab
FIGUER 3.59 Azure Synapse Analytics Pipeline Copy data Mapping tab
FIGUER 3.60 Azure Synapse Analytics Copy Data tool
FIGUER 3.61 Azure Data Factory Manage Git configuration example
FIGUER 3.62 Azure Data Factory Manage global parameters
FIGUER 3.63 Azure Data Factory Author dataset
FIGUER 3.64 The Azure Databricks platform
FIGUER 3.65 An Azure Databrick cluster
FIGUER 3.66 An Azure Databrick notebook
FIGUER 3.67 Azure Databricks Notebook command
FIGUER 3.68 Azure Databricks Settings Admins Console Users
FIGUER 3.69 Azure Databricks Workspace User notebook Permission
FIGUER 3.70 Azure Databricks Workspace User notebook revision history
FIGUER 3.71 Azure Databricks brain wave charting example
FIGUER 3.72 Azure Databricks workspace jobs
FIGUER 3.73 Azure Databricks Repos
FIGUER 3.74 Azure Databricks Jobs
FIGUER 3.75 Azure Databricks data ingestion
FIGUER 3.76 A data lake
FIGUER 3.77 Event Hubs data ingestion
FIGUER 3.78 Provisioning an Azure Stream Analytics job
FIGUER 3.79 An Azure Stream Analytics job query
FIGUER 3.80 An Azure Stream Analytics hopping window
FIGUER 3.81 An Azure Stream Analytics session window
FIGUER 3.82 An Azure Stream Analytics sliding window
FIGUER 3.83 An Azure Stream Analytics snapshot window
FIGUER 3.84 An Azure Stream Analytics tumbling window
FIGUER 3.85 Choosing an Apache Kafka for HDInsight cluster type
FIGUER 3.86 Apache Kafka for HDInsight Kafka nodes
FIGUER 3.87 Azure Monitor and Azure Data Box
Chapter 4
FIGURE 4.1 Azure Synapse Analytics compression
FIGURE 4.2 Azure Synapse Analytics SQL partitioning
FIGURE 4.3 Azure Synapse Analytics Spark partitioning from Storage explorer...
FIGURE 4.4 Azure Synapse Analytics Spark partitioning
FIGURE 4.5 Sharding original table
FIGURE 4.6 Sharding sharded table
FIGURE 4.7 Data redundancy snapshots
FIGURE 4.8 Data redundancy dedicated SQL pool
FIGURE 4.9 Data redundancy restore dedicated SQL pool
FIGURE 4.10 Data redundancy restore ADLS
FIGURE 4.11 Data redundancy restore ADLS container
FIGURE 4.12 Data redundancy ADLS replication options
FIGURE 4.13 Data redundancy backups Azure Synapse Analytics regional redunda...
FIGURE 4.14 Data redundancy Azure Synapse Analytics single redundancy model ...
FIGURE 4.15 Data redundancy storage account for Azure Databricks
FIGURE 4.16 Implementing table distributions
FIGURE 4.17 Implement data archiving access tier
FIGURE 4.18 Implement data archiving lifecycle management
FIGURE 4.19 Azure Synapse Analytics Data hub ADLS directory
FIGURE 4.20 Azure Synapse Analytics Data hub ADLS directory
FIGURE 4.21 Azure Synapse Analytics Develop hub load Notebook
FIGURE 4.22 Azure Synapse Analytics Develop hub write Notebook Parquet files...
FIGURE 4.23 Azure Synapse Analytics data flow
FIGURE 4.24 Azure Synapse Analytics data flow transformations
FIGURE 4.25 Azure Synapse Analytics Develop hub, data flow Source Settings t...
FIGURE 4.26 Azure Synapse Analytics Develop hub, data flow Optimize tab
FIGURE 4.27 Azure Synapse Analytics Develop hub, data flow Derived Column sc...
FIGURE 4.28 Azure Synapse Analytics Develop hub, Visual Expression Builder
FIGURE 4.29 Azure Synapse Analytics Develop hub, Apache Spark job definition...
FIGURE 4.30 Logical data structure
FIGURE 4.31 Finding the history table
FIGURE 4.32 A temporal data solution
FIGURE 4.33 Building an external table
FIGURE 4.34 An efficient file and folder structure
FIGURE 4.35 Serving layer using star schema distribution types
FIGURE 4.36 Serving layer using star schema integration dataset
FIGURE 4.37 Serving layer using star schema tmp integration dataset
FIGURE 4.38 Serving layer using star schema data flow
FIGURE 4.39 Serving layer using star schema pipeline
FIGURE 4.40 Storing raw data in Azure Databricks
FIGURE 4.41 Storing data using Azure HDInsight
FIGURE 4.42 Storing prepared, trained, and modeled data
Chapter 5
FIGURE 5.1 A Big Data pipeline process example
FIGURE 5.2 Azure Synapse Analytics—ingesting Brainjammer brain waves
FIGURE 5.3 Azure Synapse Analytics—transformating Brainjammer brain waves
FIGURE 5.4 Azure Synapse Analytics—monitoring Brainjammer brain wave transfo...
FIGURE 5.5 Azure Synapse Analytics—SQL pool vs. a linked service stored proc...
FIGURE 5.6 Azure Data Factory Synapse—linked service
FIGURE 5.7 Azure Data Factory—Synapse dataset
FIGURE 5.8 Azure Data Factory—Synapse pipeline
FIGURE 5.9 Azure Data Factory Synapse—pipeline transformation
FIGURE 5.10 Transforming data using an Apache Spark Azure Synapse Spark pool...
FIGURE 5.11 Transforming data using Apache Spark Azure Synapse Analytics
FIGURE 5.12 Transforming data using an Apache Spark Azure Databricks workspa...
FIGURE 5.13 Transforming data using Apache Spark Azure Databricks configurat...
FIGURE 5.14 Jupyter notebooks—Azure Databricks Import option
FIGURE 5.15 Jupyter notebooks—Azure Databricks imported
FIGURE 5.16 Transforming data using Apache Spark Jupyter notebooks Azure HDI...
FIGURE 5.17 Transforming data using Apache Spark Jupyter notebooks Azure HDI...
FIGURE 5.18 Transforming data using Apache Spark Jupyter notebooks Azure Dat...
FIGURE 5.19 Transforming data using Apache Spark Jupyter notebooks Azure Dat...
FIGURE 5.20 Transforming data using Azure Stream Analytics JSON
FIGURE 5.21 Splitting the data source—Projection tab
FIGURE 5.22 Splitting the data sink—Optimize tab
FIGURE 5.23 Shredding JSON with Azure Cosmos DB
FIGURE 5.24 Encoding and decoding data Unicode
FIGURE 5.25 Encoding and decoding data—
VARCHAR
FIGURE 5.26 Encoding and decoding data—
NVARCHAR
FIGURE 5.27 Configuring error handling for the transformation
FIGURE 5.28 Normalizing and denormalizing brainjammer values
FIGURE 5.29 Not normalized brain waves data
FIGURE 5.30 Normalized brain waves data
FIGURE 5.31 Performing exploratory data analysis—visualizing data in Power B...
FIGURE 5.32 Performing exploratory data analysis—visualizing data in Power B...
FIGURE 5.33 Performing exploratory data analysis—Power BI workspace
FIGURE 5.34 Performing exploratory data analysis—visualizing brain waves in ...
FIGURE 5.35 Performing exploratory data analysis—visualizing brain waves alp...
FIGURE 5.36 Performing exploratory data analysis—visualizing brain waves gam...
FIGURE 5.37 The data transformation process
FIGURE 5.38 Transforming and enriching the data pipeline
FIGURE 5.39 Transforming and enriching data—pipeline drop script
FIGURE 5.40 Transforming and enriching data —filter transformation data flow...
FIGURE 5.41 Data management disciplines
FIGURE 5.42 Azure Databricks—configuring a box plot chart
FIGURE 5.43 Azure Databricks—configuring a box plot chart (2)
FIGURE 5.44 Azure Machine Learning—brainjammer contributor access
FIGURE 5.45 Azure Machine Learning—brainjammer table
FIGURE 5.46 Azure Machine Learning—brainjammer job
FIGURE 5.47 Azure Machine Learning—Access Control (IAM)
FIGURE 5.48 Azure Machine Learning—VotingEnsemble algorithm
FIGURE 5.49 Azure Machine Learning—usage prediction with a model workspace
FIGURE 5.50 Azure Machine Learning Usage—predict with a model
Chapter 6
FIGURE 6.1 The role of Azure Batch processing in a data analytics solution
FIGURE 6.2 Azure Batch processing—dependencies
FIGURE 6.3 Azure Batch processing—many‐to‐many dependency
FIGURE 6.4 Batch processing in Big Data architecture
FIGURE 6.5 An Azure Batch workflow
FIGURE 6.6 Azure Batch account and pool configuration
FIGURE 6.7 Azure Batch linked service configuration
FIGURE 6.8 Azure Batch Custom pipeline activity
FIGURE 6.9 Azure Batch task details
FIGURE 6.10 Azure Batch task output
FIGURE 6.11 Azure Batch—Azure Synapse Analytics batch service pipeline
FIGURE 6.12 Azure Synapse Analytics—Apache Spark job definition
FIGURE 6.13 Azure Synapse Analytics—Apache Spark job scenario result
FIGURE 6.14 Azure Synapse Analytics—Apache Spark job diagnostics and Monitor...
FIGURE 6.15 Azure Batch—Azure Synapse Analytics batch service pipeline with ...
FIGURE 6.16 Azure Databricks batch job
FIGURE 6.17 Linked service configuration for the Azure Databricks batch job...
FIGURE 6.18 Azure Databricks batch job pipeline configuration
FIGURE 6.19 Azure Databricks batch job pipeline status
FIGURE 6.20 Azure Batch custom pipeline activity Azure Data Factory
FIGURE 6.21 Azure Batch Explorer
FIGURE 6.22 Azure HDInsight batch processing
FIGURE 6.23 Azure Synapse Analytics parameters
FIGURE 6.24 Azure Synapse Analytics parameters as command arguments
FIGURE 6.25 Azure Synapse Analytics parameter input and output run details
FIGURE 6.26 Azure Synapse Analytics passing parameter between pipeline activ...
FIGURE 6.27 Azure Synapse Analytics pipeline activity with no dependencies
FIGURE 6.28 Azure Synapse Analytics pipeline activity with no dependencies (...
FIGURE 6.29 Azure Synapse Analytics pipeline variables
FIGURE 6.30 Azure Synapse Analytics pipeline dynamic arguments
FIGURE 6.31 Running an Azure Synapse Analytics pipeline
FIGURE 6.32 Azure Synapse Analytics New/Edit trigger
FIGURE 6.33 Azure Synapse Analytics daily scheduled trigger
FIGURE 6.34 Azure Synapse Analytics weekly scheduled trigger
FIGURE 6.35 Azure Synapse Analytics monthly scheduled trigger
FIGURE 6.36 Azure Synapse Analytics many‐to‐many scheduled trigger
FIGURE 6.37 Azure Synapse Analytics tumbling window trigger
FIGURE 6.38 Azure Synapse Analytics storage event trigger
FIGURE 6.39 Azure Synapse Analytics custom event notification flow
FIGURE 6.40 Azure Synapse Analytics custom event trigger
FIGURE 6.41 Azure Databricks scheduled trigger
FIGURE 6.42 Azure Databricks scheduled trigger log
FIGURE 6.43 Handle duplicate data—data flow source
FIGURE 6.44 Handle duplicate data—data flow aggregate group by
FIGURE 6.45 Handle duplicate data—data flow aggregate
FIGURE 6.46 Handle duplicate data—data flow select
FIGURE 6.47 Handle duplicate data—data flow sink
FIGURE 6.48 Upsert data, batching flow diagram
FIGURE 6.49 Upsert data—update methods
FIGURE 6.50 Upsert data—sink data preview
FIGURE 6.51 Upsert data—Delete If
FIGURE 6.52 Upsert data—MD5 Derived Column row hash
FIGURE 6.53 Upsert data—MD5 Exists row hash
FIGURE 6.54 Configure batch retention
FIGURE 6.55 Incremental data loads—Get Metadata activity
FIGURE 6.56 Incremental data loads—ForEach activity
FIGURE 6.57 Incremental data loads—Execute Pipeline activity
FIGURE 6.58 Incremental data loads—pipeline output
FIGURE 6.59 Incremental data loads—ForEach activity settings
FIGURE 6.60 Incremental data loads—Copy Data activity
FIGURE 6.61 Managing batches and pipelines’ triggers
FIGURE 6.62 Validate batch loads with the Copy Data activity
FIGURE 6.63 Validate batch loads with the Validation activity
FIGURE 6.64 Validate batch loads with Validation activity failure
FIGURE 6.65 Validate batch loads with the Lookup activity
FIGURE 6.66 Implementing version control for pipeline artifacts
FIGURE 6.67 Implementing version control for pipeline artifacts, Azure DevOp...
FIGURE 6.68 Manage data pipeline annotations
FIGURE 6.69 Managing data pipeline annotations using Azure PowerShell
FIGURE 6.70 Managing data pipeline annotations using Azure PowerShell data f...
FIGURE 6.71 Managing data pipeline annotations using Azure PowerShell Spark ...
FIGURE 6.72 Handling failed batch loads
Chapter 7
FIGURE 7.1 Azure stream processing
FIGURE 7.2 Azure real‐time stream processing
FIGURE 7.3 Azure near real‐time stream processing
FIGURE 7.4 Input interoperability in Azure products
FIGURE 7.5 Sink interoperability in Azure products
FIGURE 7.6 Azure Stream Analytics scaling
FIGURE 7.7 Lambda architecture speed layer, near real‐time processing
FIGURE 7.8 Azure Stream Analytics ADLS output
FIGURE 7.9 Azure Stream Analytics ADLS container path and file
FIGURE 7.10 Test sample data upload in Azure Stream Analytics
FIGURE 7.11 The result of test data uploaded in Azure Stream Analytics
FIGURE 7.12 Develop a stream processing solution Azure Stream Analytics outp...
FIGURE 7.13 Develop a stream processing solution Azure Stream Analytics simu...
FIGURE 7.14 Develop a stream processing solution Azure Stream Analytics sent...
FIGURE 7.15 Develop a stream processing solution Azure Stream Analytics sent...
FIGURE 7.16 Configure reference data for Azure Stream Analytics use.
FIGURE 7.17 Use reference data with Azure Stream Analytics.
FIGURE 7.18 Use reference data with Azure Stream Analytics.
FIGURE 7.19 Power BI Azure Stream Analytics output configuration
FIGURE 7.20 The brainjammer streaming dataset in Power BI
FIGURE 7.21 Adding a real‐time data tile to the Power BI dashboard
FIGURE 7.22 Configuring a real‐time data tile to the Power BI dashboard
FIGURE 7.23 Viewing a real‐time data tile to the Power BI dashboard
FIGURE 7.24 Azure Databricks stream processing
FIGURE 7.25 Azure Databricks Spark Structured Streaming
FIGURE 7.26 Installing the Event Hubs library on an Azure Databricks cluster...
FIGURE 7.27 The installed Event Hubs library on an Azure Databricks cluster...
FIGURE 7.28 Streamed Event Hubs messages displayed in the Azure Databricks n...
FIGURE 7.29 A brain wave time series chart
FIGURE 7.30 Windowed aggregates output
FIGURE 7.31 Partition key mapping to Azure Stream Analytics partitions
FIGURE 7.32 The Azure Stream Analytics Compatibility Level blade
FIGURE 7.33 Upsert on streamed data using an Azure function, connection stri...
FIGURE 7.34 Upserting streamed data on Azure Cosmos DB—configuring output
FIGURE 7.35 Streaming data into Azure Cosmos DB using the command console
FIGURE 7.36 Inserting streamed data on Azure Cosmos DB, initial load
FIGURE 7.37 Handling schema drift in a stream processing solution
FIGURE 7.38 Handling schema drift in a stream processing solution in ADLS
FIGURE 7.39 Handling schema drift in a stream processing solution in Azure C...
FIGURE 7.40 A data stream with event messages and a watermark
FIGURE 7.41 Watermark progression example
FIGURE 7.42 The
EventEnqueuedUtcTime
and
EventProcessedUtcTime
columns on th...
FIGURE 7.43 Event ordering for a late‐arriving streamed event message
FIGURE 7.44 Azure Stream Analytics monitoring metrics
FIGURE 7.45 An archived data stream solution
FIGURE 7.46 Configurating an archive input alias
FIGURE 7.47 Archive replay data result
FIGURE 7.48 Azure Stream Analytics job metrics, CPU at 99 percent utilizatio...
FIGURE 7.49 Azure Stream Analytics job scaling
FIGURE 7.50 Azure Stream Analytics Diagnostics Setting
FIGURE 7.51 Azure Stream Analytics Activity log warnings and errors
Chapter 8
FIGURE 8.1 Layered security
FIGURE 8.2 Creating an Azure Key Vault key
FIGURE 8.3 Creating an Azure Key Vault secret
FIGURE 8.4 Creating an Azure Key Vault certificate
FIGURE 8.5 Vault access policy operations
FIGURE 8.6 Azure Key Vault x509 certificate details
FIGURE 8.7 Microsoft Purview default root collection
FIGURE 8.8 Microsoft Purview Map view
FIGURE 8.9 The Azure Policy Overview blade
FIGURE 8.10 Data Discovery & Classification
FIGURE 8.11 Data Discovery & Classification, Add Classification window
FIGURE 8.12 Azure storage account encryption type
FIGURE 8.13 Dynamic Data Masking dedicated SQL pool
FIGURE 8.14 ADLS access control access keys
FIGURE 8.15 ADLS Access control shared access signature
FIGURE 8.16 RBAC and ACL permission evaluation
FIGURE 8.17 RBAC Access Control (IAM) Azure storage account
FIGURE 8.18 RBAC role and ACL permission evaluation
FIGURE 8.19 The Manage ACL blade
FIGURE 8.20 Connecting Microsoft Purview to Azure Synapse Analytics workspac...
FIGURE 8.21 Configuring scanning in Microsoft Purview
FIGURE 8.22 The result of a Microsoft Purview scan
FIGURE 8.23 Dedicated SQL pool auditing configuration
FIGURE 8.24 Dedicated SQL pool Diagnostic setting configuration
FIGURE 8.25 View dedicated SQL pool audit logs in Log Analytics.
FIGURE 8.26 Scanning a dedicated SQL pool with Microsoft Purview
FIGURE 8.27 Microsoft Purview Data estate insights schema data classificatio...
FIGURE 8.28 SQL Information Protection policy classification recommendations...
FIGURE 8.29 Data Discovery & Classification, Add classification 2
FIGURE 8.30 Data Discovery & Classification overview
FIGURE 8.31 Protecting sensitive data in files
FIGURE 8.32 Implement a data retention policy in Azure Synapse Analytics.
FIGURE 8.33 Implement a data retention policy schedule pipeline trigger.
FIGURE 8.34 Encrypt data at rest, TDE, dedicated SQL pool.
FIGURE 8.35 Row‐level security
FIGURE 8.36 Column‐level security
FIGURE 8.37 Column‐level security enforcement exception
FIGURE 8.38 Role and membership details on a dedicated SQL pool database
FIGURE 8.39 Implement data masking and masking rule.
FIGURE 8.40 Creating and applying a user‐assigned managed identity
FIGURE 8.41 Creating a shared, credential passthrough spark cluster
FIGURE 8.42 Adding a user to an Azure Databricks workspace using RBAC
FIGURE 8.43 Access key in Key Vault from a blob linked service failure
FIGURE 8.44 Access key from Key Vault to blob linked service failure
FIGURE 8.45 Enabling Microsoft Defender for Storage
FIGURE 8.46 Azure Active Directory created group
FIGURE 8.47 Add role assignment access control Synapse Contributor.
FIGURE 8.48 Add role assignment access control Synapse Contributor Parquet f...
FIGURE 8.49 Managing ACLs for an ADLS folder
FIGURE 8.50 Adding an ACL to allow write access
FIGURE 8.51 Adding an Azure storage account with an ADLS container to a VNet...
FIGURE 8.52 The Azure storage account VNet configuration
FIGURE 8.53 Azure storage account private endpoint configuration
FIGURE 8.54 Network security group rules
FIGURE 8.55 Generating an access token
FIGURE 8.56 Azure Batch networking restrictions
FIGURE 8.57 Configuring a custom Azure Batch RBAC role using the Azure porta...
FIGURE 8.58 Browsing assets in the data catalog
FIGURE 8.59 Browsing assets based on source type
FIGURE 8.60 Browsing assets based on source type
FIGURE 8.61 Viewing Microsoft Purview data lineage
Chapter 9
FIGURE 9.1 The Azure Synapse Analytics Logs blade
FIGURE 9.2 Azure Synapse Analytics dedicated SQL pool metrics
FIGURE 9.3 Azure Event Hub diagnostic settings
FIGURE 9.4 Creating an Azure Synapse Analytics alert condition
FIGURE 9.5 The Azure Synapse Analytics Alerts blade
FIGURE 9.6 The Azure Monitor activity log
FIGURE 9.7 The Azure Storage Account Insights Overview tab
FIGURE 9.8 A summary of the Azure Storage Account Insights blade
FIGURE 9.9 Azure storage account Workbooks
FIGURE 9.10 The Azure Synapse Analytics Monitor hub
FIGURE 9.11 Azure Synapse Analytics integration runtimes
FIGURE 9.12 An Azure Synapse Analytics sample pipeline to generate monitor l...
FIGURE 9.13 Azure Synapse Analytics pipeline runs filtered by annotations
FIGURE 9.14 Azure Synapse Analytics activity runs
FIGURE 9.15 Azure Synapse Analytics data flow modifiers
FIGURE 9.16 Azure Synapse Analytics Apache Spark applications
FIGURE 9.17 Azure Synapse Analytics dedicated SQL pool metrics
FIGURE 9.18 Azure Stream Analytics job diagram
FIGURE 9.19 The Azure Stream Analytics Metrics hub
FIGURE 9.20 Azure Databricks Apache Spark cluster logging
FIGURE 9.21 Azure Databricks cluster metrics
FIGURE 9.22 Monitoring data pipeline performance Gantt chart
FIGURE 9.23 Monitoring and update statistics execution plan
FIGURE 9.24 Monitoring and update statistics view statistics
FIGURE 9.25 Apache Spark application details
FIGURE 9.26 DAG Visualization
FIGURE 9.27 Azure DevOps Azure Test Plans New Test Plan
FIGURE 9.28 Azure DevOps Azure Test Plans New Test Cases
FIGURE 9.29 Azure DevOps Azure Test Plans Execute Test Cases
FIGURE 9.30 Azure Stream Analytics Tools extension in the Visual Studio Code...
FIGURE 9.31 An Azure Stream Analytics job query in the Visual Studio Code bl...
FIGURE 9.32 The Azure Batch Metrics blade
FIGURE 9.33 The Azure Key Vault Metrics blade
FIGURE 9.34 The Azure SQL Metrics blade
Chapter 10
FIGURE 10.1 Azure Advisor score
FIGURE 10.2 Azure Cost Management
FIGURE 10.3 Compacting small files—Source and Sink tabs
FIGURE 10.4 Handling data spill memory capacity
FIGURE 10.5 Finding shuffling in a pipeline—explain plan with shuffle cost
FIGURE 10.6 Tuning queries by using indexer's indexes
FIGURE 10.7 Tuning queries with the Top Resource Consuming Queries report
FIGURE 10.8 Tuning queries with a nonclustered index
FIGURE 10.9 Optimizing pipelines for analytics or transactional purposes
FIGURE 10.10 Optimizing pipelines for analytics or transactional purposes: d...
FIGURE 10.11 Optimizing pipelines for analytics or transactional purposes: d...
FIGURE 10.12 Optimizing pipelines for analytics or transactional purposes: d...
FIGURE 10.13 Optimizing pipelines for analytics or transactional purposes: d...
FIGURE 10.14 Optimizing pipelines for analytics or transactional purposes: d...
FIGURE 10.15 Troubleshooting a failed Spark job:
stderr
FIGURE 10.16 Troubleshooting a failed Spark job: scaling Apache Spark workfl...
FIGURE 10.17 Troubleshooting a failed Spark job: scaling Apache Spark pool j...
FIGURE 10.18 Troubleshooting a failed pipeline run
FIGURE 10.19 Troubleshooting a failed pipeline run: scaling a dedicated SQL ...
FIGURE 10.20 Troubleshooting a failed pipeline run: scaling an Apache Spark ...
FIGURE 10.21 Troubleshooting a failed pipeline run: enabling Data Flow Debug...
FIGURE 10.22 Troubleshooting a failed pipeline run: debug settings
FIGURE 10.23 Troubleshooting a failed pipeline run: breakpoints
FIGURE 10.24 Troubleshooting a failed pipeline run: dependency conditions
FIGURE 10.25 Troubleshooting a failed pipeline run: retries
FIGURE 10.26 Troubleshooting a failed pipeline run: reruns
FIGURE 10.27 Rewriting Azure Stream Analytics user‐defined functions
FIGURE 10.28 Scaling resources: Azure Batch pool
FIGURE 10.29 Handling interruptions: dedicated Azure Stream Analytics cluste...
FIGURE 10.30 Scaling resources: custom autoscale rule
Cover
Table of Contents
Title Page
Copyright
Acknowledgments
About the Author
About the Technical Editor
Table of Exercises
Introduction
Begin Reading
Appendix: Answers to Review Questions
Index
End User License Agreement
iii
iv
v
vii
ix
xxiii
xxiv
xxv
xxvii
xxviii
xxix
xxx
xxxi
xxxii
xxxiii
xxxiv
xxxv
xxxvi
xxxvii
xxxviii
xxxix
xl
xli
xlii
xliii
xliv
xlv
xlvi
xlvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
915
916
917
918
919
920
921
922
923
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
Benjamin Perkins
Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada and the United Kingdom.
ISBNs: 9781119885429 (paperback), 9781119885443 (ePDF), 9781119885436 (ePub)
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at www.wiley.com/go/permission.
Trademarks: WILEY, the Wiley logo, and the Sybex logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Microsoft and Azure are registered trademarks of Microsoft Corporation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Control Number: 2023941199
Cover image: © Jeremy Woodhouse/Getty ImagesCover design: Wiley
Creating a book starts first as an idea, which then iterates through many versions, until it takes the form of something consumable. Many people helped to progress this book from idea to final product. Here is a list of those who played a significant role in the creation of this book and the organization of its content:
Ken Brown, senior acquisitions editor
Robyn Alvarez, project manager
Heini Ilmarinen, technical editor
John Sleeva, copyeditor
Nancy Carrasco, proofreader
Writing this book—and writing in general—has become something I enjoy. Writing gives me the opportunity to share some of my technical knowledge and experiences so that others can gain some knowledge and insights. In addition to sharing my words, I gain an even greater understanding of the topic, as I structure the content, conduct research, and create hands‐on exercises. Writing a book requires a huge effort, but there are many reasons to do it. I'd like to thank my family for their support while I was writing this book. I know it took hours away from them. Thanks, Andrea, Lea, and Noa. You are the reason and my purpose.
Benjamin Perkins is currently employed at Microsoft in Munich, Germany, as a Senior Escalation Engineer on the Azure team. He has been working professionally in the IT industry for close to three decades. He started computer programming with QBasic at the age of 11 on an Atari 1200XL desktop computer. He takes pleasure in the challenges that troubleshooting technical issues have to offer and savors in the rewards of a well‐written program. After completing high school, he joined the United States Army. After successfully completing his military service, he attended Texas A&M University in College Station, Texas, where he received a Bachelor of Business Administration in Management Information Systems. He also received a Master of Business Administration from the European University.
His roles in the IT industry have spanned the entire spectrum, including programmer, system architect, technical support engineer, team leader, and mid‐level manager. While employed at Hewlett‐Packard and Compaq Computer Corporation, he received numerous awards, degrees, and certifications. He has a passion for technology and customer service and looks forward to troubleshooting and writing more world‐class technical solutions: “My approach is to write code with support in mind, and to write it once correctly and completely so we do not have to come back to it again, except to enhance it.”
Benjamin has written numerous magazine articles and training courses and is an active blogger. His catalog of books covers C# programming, IIS, NHibernate, and Microsoft Azure.
Benjamin is married to Andrea and has two wonderful children, Lea and Noa.
Heini Ilmarinen is a data enthusiast with a passion for architecture and DevOps. Heini currently works as Azure Lead and DevOps Consultant at Polar Squad, helping customers bring their data platforms to life in Azure.
Heini initially studied to become a mathematics teacher, graduating from Helsinki University with a Master of Science. After graduating, she transitioned to the IT industry, leveraging her skills for problem‐solving and making complex topics easy to understand. In IT, Heini started her career working in infrastructure architecture development projects in hybrid environments. With architecture as a starting point, her career developed from working with Azure to getting deeper into data projects to topics related to DevOps.
Over the years, Heini has worked in a multitude of Azure projects, from application development to data projects, gaining a broad understanding of the requirements for creating functional, production‐ready solutions. For the past two years, she has also engaged in community events and public speaking, gaining the Data Platform MVP award.
Heini can be often found riding her snowboard and enjoying the fresh air, or riding up and down hills on her mountain bike.
Exercise 2.1
Create an Azure SQL DB
Exercise 2.2
Create an Azure Cosmos DB
Exercise 2.3
Create a Schema and a View in Azure SQL
Exercise 3.1
Create an Azure Data Lake Storage Container
Exercise 3.2
Upload Data to an ADLS Container
Exercise 3.3
Create an Azure Synapse Analytics Workspace
Exercise 3.4
Create an Azure Synapse Analytics Linked Service
Exercise 3.5
Configure an Azure Synapse Analytics Workspace Package
Exercise 3.6
Configure an Azure Synapse Analytics Workspace with GitHub
Exercise 3.7
Configure Azure Synapse Analytics Data Hub SQL Pool Staging Tables
Exercise 3.8
Configure Azure Synapse Analytics Data Hub with Azure Cosmos DB
Exercise 3.9
Configure an Azure Synapse Analytics Integrated Dataset
Exercise 3.10
Create an Azure Data Factory
Exercise 3.11
Create a Linked Service in Azure Data Factory
Exercise 3.12
Create a Dataset in Azure Data Factory
Exercise 3.13
Create a Pipeline to Convert XLSX to Parquet
Exercise 3.14
Create an Azure Databricks Workspace with an External Hive Metastore
Exercise 3.15
Configure Delta Lake
Exercise 3.16
Create an Azure Event Namespace and Hub
Exercise 3.17
Create an Azure Stream Analytics Job
Exercise 4.1
Implement Compression
Exercise 4.2
Implement Partitioning
Exercise 4.3
Implement Data Redundancy
Exercise 4.4
Implement Distributions
Exercise 4.5
Implement Data Archiving
Exercise 4.6
Azure Synapse Analytics Data Hub SQL Script
Exercise 4.7
Azure Synapse Analytics Develop Hub Notebook
Exercise 4.8
Azure Synapse Analytics Develop Hub Data Flow
Exercise 4.9
Build a Temporal Data Solution
Exercise 4.10
Azure Synapse Analytics Data Hub Data Flow
Exercise 4.11
Build External Tables on a Serverless SQL Pool
Exercise 4.12
Implement Efficient File and Folder Structures
Exercise 4.13
Implement a Serving Layer with a Star Schema
Exercise 4.14
Implement a Dimensional Hierarchy
Exercise 5.1
Transform Data Using Azure Synapse Pipeline
Exercise 5.2
Transform Data Using Azure Data Factory
Exercise 5.3
Transform Data Using Apache Spark—Azure Synapse Analytics
Exercise 5.4
Transform Data Using Apache Spark—Azure Databricks
Exercise 5.5
Cleanse Data
Exercise 5.6
Split Data
Exercise 5.7
Azure Cosmos DB—Shred JSON
Exercise 5.8
Flatten, Explode, and Shred JSON
Exercise 5.9
Encode and Decode Data
Exercise 5.10
Normalize and Denormalize Values
Exercise 5.11
Perform Exploratory Data Analysis—Transform
Exercise 5.12
Perform Exploratory Data Analysis—Visualize
Exercise 5.13
Transform and Enrich Data
Exercise 5.14
Transform Data by Using Apache Spark—Azure Databricks
Exercise 5.15
Predict Data Using Azure Machine Learning
Exercise 6.1
Create an Azure Batch Account and Pool
Exercise 6.2
Develop a Batch Processing Solution Using an Azure Synapse Analytics Pipeline
Exercise 6.3
Develop a Batch Processing Solution Using an Azure Synapse Analytics Apache Spark
Exercise 6.4
Develop a Batch Processing Solution Using Azure Databricks
Exercise 6.5
Develop a Batch Processing Solution Using an Azure Data Factory Pipeline
Exercise 6.6
Create Data Pipelines—Advanced
Exercise 6.7
Create a Scheduled Trigger
Exercise 6.8
Create and Schedule an Azure Databricks Workflow Job
Exercise 6.9
Handle Duplicate Data with a Data Flow
Exercise 6.10
Upsert Data
Exercise 6.11
Implement Incremental Data Loads
Exercise 6.12
Validate Batch Loads by Using a Validation Activity
Exercise 6.13
Validate Batch Loads by Using a Lookup Activity
Exercise 7.1
Add an Output ADLS Container to an Azure Stream Analytics Job
Exercise 7.2
Develop a Stream Processing Solution with Azure Stream Analytics—Testing the Data
Exercise 7.3
Develop a Stream Processing Solution with Azure Stream Analytics
Exercise 7.4
Use Reference Data with Azure Stream Analytics
Exercise 7.5
Stream Data to Power BI from Azure Stream Analytics
Exercise 7.6
Stream Data with Azure Databricks
Exercise 7.7
Develop and Create Windowed Aggregates
Exercise 7.8
Upsert Stream Processed Data in Azure Cosmos DB
Exercise 7.9
Handle Schema Drift in Azure Stream Analytics
Exercise 7.10
Replay an Archived Stream Data in Azure Stream Analytics
Exercise 8.1
Create an Azure Key Vault Resource
Exercise 8.2
Create a Microsoft Purview Account
Exercise 8.3
Configure and Perform a Data Asset Scan Using Microsoft Purview
Exercise 8.4
Audit an Azure Synapse Analytics Dedicated SQL Pool
Exercise 8.5
Apply Sensitivity Labels and Data Classifications Using Microsoft Purview and Data Discovery
Exercise 8.6
Implement a Data Retention Policy
Exercise 8.7
Implement Column-Level Security
Exercise 8.8
Implement Data Masking
Exercise 8.9
Create a User-Assigned Managed Identity
Exercise 8.10
Connect to an ADLS Container from Azure Databricks Cluster Using ABFSS
Exercise 8.11
Use an Azure Key Vault Secret to Store an Authentication Key for a Linked Service
Exercise 8.12
Implement Azure RBAC for ADLS
Exercise 8.13
Implement POSIX-Like ACLs for ADLS
Exercise 8.14
Create an Azure Storage Account and ADLS Container with a VNet
Exercise 8.15
Create an Azure Synapse Analytics Workspace with a VNET
Exercise 9.1
Create an Azure Monitor Workspace
Exercise 9.2
Create an Azure Synapse Analytics Alert
Exercise 9.3
Monitor and Manage Azure Synapse Analytics Logs
Exercise 10.1
Compact Small Files
A long time ago, I was sitting at my desk happily coding my Active Server Page (ASP) and COM component, when someone approached me and asked if I knew anything about databases. Without even a pause, I answered a confident yes, most people in IT know "something" about databases, right? Well, it turned out that a big project was starting, and they needed someone to create and manage a database. I acquired a server, installed a relational database management system (RDMBS), and executed CREATE DATABASE dbName; GO. And the rest is history. I like to call that out because these days, most of the data storage architecture already exists when you start the job. You must learn what someone else created. You experience problems but do not know why, because a lot happened before you started.
The new emerging technology called big data is providing a rare opportunity, kind of like the one I had. The opportunity is to build and/or be involved in creating an IT data analytics solution from the beginning. Being the person or the team who builds the framework and foundation of what could become a system that shapes the future of a company is career‐altering. The experience is a differentiator that stays with you for the rest of your career, as it has in mine. But it could also be a catastrophe for numerous reasons, such as not being able to scale, being too hard to make changes, and not being reliable.