Hands-On Data Warehousing with Azure Data Factory - Christian Coté - E-Book

Hands-On Data Warehousing with Azure Data Factory E-Book

Christian Coté

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Leverage the power of Microsoft Azure Data Factory v2 to build hybrid data solutions

Key Features

  • Combine the power of Azure Data Factory v2 and SQL Server Integration Services
  • Design and enhance performance and scalability of a modern ETL hybrid solution
  • Interact with the loaded data in data warehouse and data lake using Power BI

Book Description

ETL is one of the essential techniques in data processing. Given data is everywhere, ETL will always be the vital process to handle data from different sources.

Hands-On Data Warehousing with Azure Data Factory starts with the basic concepts of data warehousing and ETL process. You will learn how Azure Data Factory and SSIS can be used to understand the key components of an ETL solution. You will go through different services offered by Azure that can be used by ADF and SSIS, such as Azure Data Lake Analytics, Machine Learning and Databrick’s Spark with the help of practical examples. You will explore how to design and implement ETL hybrid solutions using different integration services with a step-by-step approach. Once you get to grips with all this, you will use Power BI to interact with data coming from different sources in order to reveal valuable insights.

By the end of this book, you will not only learn how to build your own ETL solutions but also address the key challenges that are faced while building them.

What you will learn

  • Understand the key components of an ETL solution using Azure Data Factory and Integration Services
  • Design the architecture of a modern ETL hybrid solution
  • Implement ETL solutions for both on-premises and Azure data
  • Improve the performance and scalability of your ETL solution
  • Gain thorough knowledge of new capabilities and features added to Azure Data Factory and Integration Services

Who this book is for

This book is for you if you are a software professional who develops and implements ETL solutions using Microsoft SQL Server or Azure cloud. It will be an added advantage if you are a software engineer, DW/ETL architect, or ETL developer, and know how to create a new ETL implementation or enhance an existing one with ADF or SSIS.

Christian Coté has been in IT for more than 12 years. He is an MS-certified technical specialist in business intelligence (MCTS-BI). For about 10 years, he has been a consultant in ETL/BI projects. His ETL projects have used various ETL tools and plain code with various RDBMSes (such as Oracle and SQL Server). He is currently working on his sixth SSIS implementation in 4 years. Michelle Gutzait has been in IT for 30 years as a developer, business analyst, and database consultant. She has worked with MS SQL Server for 20 years. Her skills include infrastructure and database design, performance tuning, security, HADR solutions, consolidation, very large databases, replication, T-SQL coding and optimization, SSIS, SSRS, SSAS, admin and infrastructure tools development, cloud services, training developers, DBAs, and more. She has been an Oracle developer, business analyst, and development team lead. Giuseppe Ciaburro holds a PhD in environmental technical physics and two master's degrees. His research is on machine learning applications in the study of urban sound environments. He works at Built Environment Control Laboratory, Università degli Studi della Campania Luigi Vanvitelli (Italy). He has over 15 years' experience in programming Python, R, and MATLAB, first in the field of combustion, and then in acoustics and noise control. He has several publications to his credit.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 166

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Hands-On Data Warehousing with Azure Data Factory

 

 

ETL techniques to load and transform data from various sources, both on-premises and on cloud

 

 

 

 

 

 

 

 

Christian Coté
Michelle Gutzait
Giuseppe Ciaburro

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Hands-On Data Warehousing with Azure Data Factory

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Vinay ArgekarContent Development Editor: Mayur PawanikarTechnical Editor: Dinesh PawarCopy Editors: Vikrant Phadkay, Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Rekha NairGraphics: Tania DuttaProduction Coordinator: Deepika Naik

First published: May 2018

Production reference: 1300518

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78913-762-0

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Christian Coté has been in IT for more than 12 years. He is an MS-certified technical specialist in business intelligence (MCTS-BI). For about 10 years, he has been a consultant in ETL/BI projects. His ETL projects have used various ETL tools and plain code with various RDBMSes (such as Oracle and SQL Server). He is currently working on his sixth SSIS implementation in 4 years.

 

Michelle Gutzait has been in IT for 30 years as a developer, business analyst, and database consultant. She has worked with MS SQL Server for 20 years. Her skills include infrastructure and database design, performance tuning, security, HADR solutions, consolidation, very large databases, replication, T-SQL coding and optimization, SSIS, SSRS, SSAS, admin and infrastructure tools development, cloud services, training developers, DBAs, and more. She has been an Oracle developer, business analyst, and development team lead.

Giuseppe Ciaburro holds a PhD in environmental technical physics and two master's degrees. His research is on machine learning applications in the study of urban sound environments. He works at Built Environment Control Laboratory, Università degli Studi della Campania Luigi Vanvitelli (Italy). He has over 15 years' experience in programming Python, R, and MATLAB, first in the field of combustion, and then in acoustics and noise control. He has several publications to his credit.

About the reviewer

Chirag Nayyar helps organizations to initiate their digital transformation using the public cloud. He has been actively working on cloud platforms since 2013, providing consultancy services to many organizations, ranging from SMBs to Enterprises. He holds a wide range of certifications from all major public cloud platforms. He also runs meetups and is a regular speaker at various cloud events. 

He has also reviewed few books published by Packt.

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Hands-On Data Warehousing with Azure Data Factory

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

The Modern Data Warehouse

The need for a data warehouse

Driven by IT

Self-service BI

Cloud-based BI – big data and artificial intelligence

The modern data warehouse

Main components of a data warehouse

Staging area

Data warehouse

Cubes

Consumption layer – BI and analytics

What is Azure Data Factory

Limitations of ADF V1.0

What's new in V2.0?

Integration runtime

Linked services

Datasets

Pipelines

Activities

Parameters

Expressions

Controlling the flow of activities

SSIS package deployment in Azure

Spark cluster data store

Summary

Getting Started with Our First Data Factory

Resource group

Azure Data Factory

Datasets

Linked services

Integration runtimes

Activities

Monitoring the data factory pipeline runs

Azure Blob storage

Blob containers

Types of blobs

Block blobs

Page blobs

Replication of storage

Creating an Azure Blob storage account

SQL Azure database

Creating the Azure SQL Server

Attaching the BACPAC to our database

Copying data using our data factory

Summary

SSIS Lift and Shift

SSIS in ADF

Sample setup

Sample databases

SSIS components

Integration services catalog setup

Sample solution in Visual Studio

Deploying the project on-premises

Leveraging our package in ADF V2

Integration runtimes

Azure integration runtime

Self-hosted runtime

SSIS integration runtime

Adding an SSIS integration runtime to the factory

SSIS execution from a pipeline

Summary

Azure Data Lake

Creating and configuring Data Lake Store

Next Steps

Ways to copy/import data from a database to the Data Lake

Ways to store imported data in files in the Data Lake

Easily moving data to the Data Lake Store

Ways to directly copy files into the Data Lake

Prerequisites for the next steps

Creating a Data Lake Analytics resource

Using the data factory to manipulate data in the Data Lake

Task 1 – copy/import data from SQL Server to a blob storage file using data factory

Task 2 – run a U-SQL task from the data factory pipeline to summarize data

Service principal authentication

Run U-SQL from a job in the Data Lake Analytics

Summary

Machine Learning on the Cloud

Machine learning overview

Machine learning algorithms

Supervised learning

Unsupervised learning

Reinforcement learning

Machine learning tasks

Making predictions with regression algorithms

Automated classification using machine learning

Identifying groups using clustering methods

Dimensionality reduction to improve performance

Feature selection

Feature extraction

Azure Machine Learning Studio

Azure Machine Learning Studio account

Azure Machine Learning Studio experiment

Dataset

Module

Work area

Breast cancer detection

Get the data

Prepare the data

Train the model

Score and evaluate the model

Summary

Introduction to Azure Databricks

Azure Databricks setup

Prepare the data to ingest

Setting up the folder in the Azure storage account

Self-hosted integration runtime

Linked service setup

Datasets setup

SQL Server dataset

Blob storage dataset

Linked service

Dataset

Copy data from SQL Server to sales-data

Publish and trigger the copy activity

Databricks notebook

Calling Databricks notebook execution in ADF

Summary

Reporting on the Modern Data Warehouse

Different types of BI

Self-service – personal

Team BI – sharing personal BI data

Corporate BI

Power BI Premium

Power BI Report Server

Power BI consumption

Creating our Power BI reports

Reporting with on-premise data sources

Incorporating Spark data

Summary

Preface

Extract, Transform, and Load (ETL) is one of the essential techniques in data processing. Given that data is everywhere, ETL will always be the best way to handle data from different sources.

This book starts with the basic concepts of data warehousing and ETL. You will learn how Azure Data Factory and SSIS can be used to understand the key components of an ETL solution. You will go through different services offered by Azure that can be used by ADF and SSIS, such as Azure Data Lake Analytics, machine learning, and Databrick's Spark, with the help of practical examples. You will explore how to design and implement ETL hybrid solutions using different integration services in a step-by-step approach. Once you get to grips with all this, you will use Power BI to interact with data coming from different sources in order to reveal valuable insights.

By the end of this book, you will not only know how to build your own ETL solutions, but will also be able to address the key challenges that are faced while building them.

Who this book is for

This book is for you if you are a software professional who develops and implements ETL solutions using Microsoft SQL Server or Azure Cloud. It will be an added advantage if you are a software engineer, DW/ETL architect, or ETL developer and know how to create a new ETL implementation or enhance an existing one with Azure Data Factory or SSIS.

What this book covers

Chapter 1, The Modern Data Warehouse, teaches us the various storage options available in Microsoft Azure that will help us to set up our Azure factory.

Chapter 2, Getting Started with Our First Data Factory, uses the data factory to move data from Azure SQL to Azure storage.

Chapter 3, SSIS Lift and Shift, digs further into the various services available in Azure, as well as how we can integrate an existing SSIS solution into the factory.

 Chapter 4, Azure Data Lake,primarily focuses on the components of the Azure Data Lake and provides a basic implementation of those components.

 Chapter 5, Machine Learning on the Cloud, recognizes the different machine learning algorithms and the tools that Microsoft Azure Machine Learning Studio provides to handle them.

Chapter 6, Introduction to Azure Databricks, shows how Azure Data Factory can trigger Databricks notebook.

Chapter 7, Reporting on the Modern Data Warehouse, explains how we can integrate this data into a Power BI report.

To get the most out of this book

Azure subscription

. If you don't have a subscription, you can create a free trial account in just a couple of minutes at

http://azure.microsoft.com/pricing/free-trial/

.

Azure storage account

. You use the blob storage as a source data store in this tutorial. If you don't have an Azure storage account, see the how create a storage account at 

https://docs.microsoft.com/en-us/azure/storage/common/storage-create-storage-account#create-a-storage-account

.

Azure SQL database

. You'll use an Azure SQL database as a destination data store in this tutorial. If you don't have an Azure SQL database that you can use in the tutorial, see how to create and configure an Azure SQL database to create one at

https://docs.microsoft.com/en-us/azure/sql-database/sql-database-get-started

.

SQL Server 2017 Developer Edition, 

SQL Server Management Studio 

or Visual Studio 2015 or 2017

. You can use SQL Server 2017 Developer Editon,

 

SQL Server Management Studio or Visual Studio to create a sample database and to view the result data in the database.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Data-Warehousing-with-Azure-Data-Factory. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/HandsOnDataWarehousingwithAzureDataFactory_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "When we click on it, the adfv2book blade opens."

A block of code is set as follows:

SELECT [CustomerID] ,[CustomerName] ,[CustomerCategoryName] ,[PrimaryContact] ,[AlternateContact] ,[PhoneNumber] FROM [Website].[Customers]

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select Databases and choose SQL Database, as shown in the following screenshot."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

The Modern Data Warehouse

Azure Data Factory(ADF) is a service that is available in the Microsoft Azure ecosystem. This service allows the orchestration of different data loads and transfers in Azure.

Back in 2014, there were hardly any easy ways to schedule data transfers in Azure. There were a few open source solutions available, such as Apache Falcon and Oozie, but nothing was easily available as a service in Azure. Microsoft introduced ADF in public preview in October 2014, and the service went to general availability in July 2015.

The service allows the following actions:

Copying data from various sources and destinations

Calling various computation services, such as HDInsight and Azure data warehouse data transformations

Orchestrating the preceding activities using time slices and retrying the activities when there is an error

All these activities were available via the Azure portal at first, and in Visual Studio 2013 before general availability (GA).

The need for a data warehouse

A data warehouse is a repository of enterprise data used for reporting and analysis. There have been three waves of data warehouses so far, which we will cover in the upcoming subsections.

Driven by IT

This is the first wave of business intelligence (BI). IT needed to separate operational data and databases from its origin for the following reasons:

Keep data changes history. Some operational applications purge the data after a while.

When users wanted to report on the application's data, they were often affecting the performance of the system. IT replicated the operational data to another server to avoid any performance impact on applications.

Things got more complex when users wanted to do analysis and reports on databases from multiple enterprise's applications. IT had to replicate all the needed systems and make them speak together. This implied that new structures had to be built and new patterns emerged from there: star schemas,

decision support systems

(

DSS

), OLAP cubes, and so on.

Self-service BI

Analysts and users always need data warehouses to evolve at a faster pace. This is the second wave of BI and it happened when major BI players such as Microsoft and Click came with tools that enabled users to merge some data with or without data warehouses. In many enterprises, this is used as a temporary source of analytics or proof of concept. On the other hand, not every data could fit at that time in data warehouses. Many ad hoc reports were, and are still, using self-service BI tools. Here is a short list of such tools:

Microsoft Power Pivot

Microsoft Power BI

Click

Cloud-based BI – big data and artificial intelligence

This is the third wave of BI. The cloud capabilities enable enterprises to do more accurate analysis. Big data technologies allows users to base their analysis on much bigger data volumes. This helps them deriving patterns form the data and have technologies that incorporate and modify these patterns. This leads to artificial intelligence or AI.

Technologies used in big data are not that new. They were used by many search engines in the early 21st century such as Yahoo! and Google. They have also been used quite a lot in research faculties in different enterprises. The third wave of BI broaden the usage of these technologies. Vendors such as Microsoft, Amazon, or Google make it available to almost everyone with their cloud offer.]

The modern data warehouse

Microsoft, as well as many other service providers, have listed the concepts of the modern data warehouse as follows:

Here are some of the many features a modern data warehouse should have:

Integration of relational as well as non-relational sources

: The data warehouse should be able to ingest data that is not easily integrable in the traditional data warehouse, such as big data, non-relational crunched data, and so on.

Hybrid deployment

: The data warehouse should be able to extend the data warehouse from on-premises storage to the cloud.

Advanced analytics

: The data warehouse should be able to analyze the data from all kinds of datasets using different modern machine learning tools.

In-database analytics

: The data warehouse should be able to use Microsoft software that is integrated with some very powerful analytics open tools, such as R and Python, in its database. Also, with PolyBase integration, the data warehouse can integrate more data sources when it's based on SQL Server.

Main components of a data warehouse

This section will discuss the various parts of a data warehouse.

Staging area

In a classic data warehouse, this zone is usually a database and/or a schema in it that used to hold a copy of the data from the source systems. The staging area is necessary because most of the time, data sources are not stored on the same server as the data warehouse. Even if they are on the same server, we prefer a copy of them for the following reasons:

Preserve data integrity. All data is copied over from a specific point in time. This ensures that we have consistency between tables.

We might need specific indexes that we could not create in the source system. When we query the data, we're not necessarily making the same links (joins) in the source system. Therefore, we might have to create indexes to increase query performance.

Querying the source might have an impact on the performance of the source application. Usually, the staging area is used to bring just the changes from the source systems. This prevents processing too much data from the data source.

Not to mention that the data source might be files: CSV, XML, and so on. It’s much easier to bring their content in relational tables. From a modern data warehouse perspective, this means storing the files in HDFS and separating them using dates.

In a modern data warehouse, if we’re in the cloud only, relational data can still be stored in databases. The only difference might be in the location of the databases. In Azure, we can use Azure SQL tables or Azure data warehouse.

Data warehouse

This is where the data is copied over from the staging area. There are several schools of thought that define the data warehouse:

Kimball group data warehouse bus

: Ralph Kimball was a pioneer in data warehousing. He and his colleagues wrote many books and articles on their method. It consists of conformed dimensions that can be used by many business processes. For example, if we have a dimension named DimCustomer, we should link it to all fact tables that store customers. We should not create another dimension that redefines our customers. The following link gives more information on the Kimball group method:

https://www.kimballgroup.com

.

Inmon CIF

: Bill Inmon and his colleagues defined the corporate information factory at the end of 1990s. This consisted of modeling the source systems commonly using the third normal form. All the data in the table was dated, which means that any changes in the data sources were inserted in the data warehouse tables. The following link gives more information on CIF:

http://www.inmoncif.com

.

Data Vault

: Created by Dan Linsted in the 21st century, this is the latest and more efficient modeling method in data warehousing. It consists of breaking down the source data into many different entities. This gives a lot of flexibility when the data is consumed. We have to reconstruct the data and use the necessary pieces for our analysis. Here is a link that gives more information on Data Vault:

http://learndatavault.com

.

Cubes

In addition to the relational data warehouse, we might have a cube such as SQL Server Analysis Services. Cubes don't replace the relational data warehouses, they extend it. They can also connect to the other part of the warehouse that is not necessarily stored in a relational database. By doing this, they become a semantic layer that can be used by the consumption layer described next.

Consumption layer – BI and analytics

This area is where the data is consumed from the data warehouse and/or the data lake. This book has a chapter dedicated to data lake. In short, the data lake is composed of several areas (data ponds) that classify the data inside of it. The data warehouse is a part of the data lake; it contains the certified data. The data outside the data warehouse in the data lake is most of the time noncertified. It is used to do ad hoc analysis or data discovery.

The BI part can be stored in relational databases, analytic cubes, or models. It can also consist of views on top of the data warehouse when the data is suitable for it.

What is Azure Data Factory

Azure data factories are composed of the following components:

Linked services

: Connectors to the various storage and compute services. For example, we can have a pipeline that will use the following artifacts:

HDInsight cluster on demand