Data Modeling for Azure Data Services - Peter ter Braake - E-Book

Data Modeling for Azure Data Services E-Book

Peter ter Braake

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation.
Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory.
By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 619

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Modeling for Azure Data Services

Implement professional data design and structures in Azure

Peter ter Braake

BIRMINGHAM—MUMBAI

Data Modeling for Azure Data Services

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Parikh

Publishing Product Manager: Aditi Gour

Senior Editor: Roshan Kumar

Content Development Editors: Sean Lobo, Priyanka Soam

Technical Editor: Arjun Varma

Copy Editor: Safis Editing

Language Support Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Alishon Mendonca

First published: July 2021

Production reference: 1240621

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN: 978-1-80107-734-7

www.packt.com

Contributors

About the author

Peter ter Braake started working as a developer in 1996 after studying physics in Utrecht, the Netherlands. Databases and business intelligence piqued his interest the most, leading to him specializing in SQL Server and its business intelligence components. He has worked with Power BI from the tool's very beginnings. Peter started working as an independent contractor in 2008. This has enabled him to divide his time between teaching data-related classes, consulting with customers, and writing articles and books.

About the reviewers

Marcel Alsdorf has been working for 4 years as a cloud solution architect at Microsoft. In his role, he consults companies on cloud computing fundamentals, the construction of modern data warehouse pipelines, event systems, and machine learning applications and tooling. In addition, he lectures on big data fundamentals at a German university. Previously, he worked as an FPGA engineer for the LHC project at CERN and as a software engineer in the banking industry.

I would like to thank Christoph Körner for suggesting me as a reviewer and the team at Packt for giving me the opportunity to help Peter ter Braake create a good starting point for people in terms of understanding the fundamentals of data modeling in Azure.

Alexey Bokov is an experienced Azure architect and has been a Microsoft technical evangelist since 2011 – he works closely with Microsoft top-tier customers all around the world to develop applications based on the Azure cloud platform. Building cloud-based applications in the most challenging scenarios is his passion, as well as helping the development community to upskill and learn new things by working hands-on and hacking. He has been a long-time contributor, as a co-author and reviewer, to many Azure books and is an occasional speaker at Kubernetes events.

I'd like to thank my family – my beautiful wife, Yana, and my amazing son, Kostya, who have supported my efforts to help the author and publisher of this book.

Table of Contents

Preface

Section 1 – Operational/OLTP Databases

Chapter 1: Introduction to Databases

Overview of relational databases

Files

Relational databases

Introduction to Structured Query Language

Different categories of SQL

Understanding the database schema

Impact of intended usage patterns on database design

Understanding relational theory

Pillar 1 – Elements of a set are not ordered

Pillar 2 – All elements in a set are unique

Keys

Types of keys

Choosing the primary key

Integrity

The Check and Unique constraints

Types of workload

OLTP

OLAP

Summary

Chapter 2: Entity Analysis

Scope

Project scope

Product scope

Understanding entity relationship diagrams

Entities

Understanding super- and sub-entities

Naming entities

Relationships

Types of relationships

Drawing conventions

Recap

Creating your first ERD

Context of an ERD

Summary

Exercises

Exercise 1 – student registration

Exercise 2 – airline

Chapter 3: Normalizing Data

When to use normalization as a design strategy

Considering all the details

Preventing redundancy

How to avoid redundancy

The normalization steps

Step zero

First normal form

Second normal form

Third normal form

Boyce-Codd and the fourth normal form

Normalizing – a recap

An alternative approach to normalizing data

Step 1

Step 2

Step 3

Step 4

Integrating separate results

Entity relationship diagram

Summary

Exercises

Exercise 1 – Stock management of a bicycle shop

Chapter 4: Provisioning and Implementing an Azure SQL DB

Technical requirements

Understanding SQL Server data types

Numerical data

Alphanumerical data

Varying-length data types

Dates

Other data types

Quantifying the data model

Estimating the database size

Analyzing expected usage patterns

Provisioning an Azure SQL database

Provisioned versus serverless

vCores versus DTU

Hyperscale and Business Critical

Elastic pool

Networking

Additional settings

Tags

Review + create

Connecting to the database

Azure portal

Azure Data Studio

Data definition language

Creating a table

Altering a table

Dropping a table

Inserting data

Indexing

Clustered index

Nonclustered index

Automatic tuning

Summary

Chapter 5: Designing a NoSQL Database

Understanding big data

Understanding big data clusters

Partitioning

Getting to know Cosmos DB

JSON

Modeling JSON

Using embedding versus referencing

Referring to objects

Cosmos DB partitioning

Putting it together

Key-value databases

Modeling key-value databases

Other NoSQL databases

Gremlin

Cassandra

Extra considerations

Polyglot persistence

Concurrency

Summary

Exercise

Chapter 6: Provisioning and Implementing an Azure Cosmos DB Database

Technical requirements

Provisioning a Cosmos DB database

Basics

Networking

Backup policy

Encryption

Creating a container

Uploading documents to a container

Cosmos DB container settings

Importing data using the Azure Cosmos DB Data Migration tool

Summary

Section 2 – Analytics with a Data Lake and Data Warehouse

Chapter 7: Dimensional Modeling

Background to dimensional modeling

Performance

Consistency

Data quality

The complexity of normalized database schemas

Lack of historical data

Understanding dimensional modeling

Minimizing redundancy

Using dependencies between attributes

Understanding star schemas

Understanding fact tables

Understanding dimension tables

Steps in dimensional modeling

Choosing a process and defining the scope

Determining the needed grain

Determining the dimensions

Determining the facts

Designing dimensions

Defining the primary key of a dimension table

Adding an unknown member

Creating star schemas versus creating snowflake schemas

Implementing a date dimension

Slowly changing dimensions

Junk dimension

Degenerate dimension

Designing fact tables

Understanding additive facts

Understanding semi-additive facts

Understanding non-additive facts

Understanding transactional fact tables

Understanding periodic snapshot fact tables

Understanding accumulating snapshot fact tables

Understanding the roleplaying dimension

Using a coverage fact table

Using a Kimball data warehouse versus data marts

Summary

Exercise

Chapter 8: Provisioning and Implementing an Azure Synapse SQL Pool

Overview of Synapse Analytics

Introducing SQL pools

Introducing Spark pools

Introducing data integration

Provisioning a Synapse Analytics workspace

Creating a dedicated SQL pool

Implementing tables in Synapse SQL pools

Using hash distribution

Using replicated distribution

Using ROUND_ROBIN distribution

Implementing columnstore indexes

Understanding workload management

Creating a workload group

Creating a workload classifier

Using PolyBase to load data

Enabling a SQL pool to access a data lake account

Configuring and using PolyBase

Using CTAS to import data

Using COPY to import data

Connecting to and using a dedicated SQL pool

Working with Azure Data Studio

Working with Power BI

Summary

Chapter 9: Data Vault Modeling

Background to Data Vault modeling

Designing Hub tables

Defining the business key

Implementing a hash key

Adding the load date

Adding the name of the source system

Adding optional columns

Designing Link tables

Designing Satellite tables

Adding optional columns to a Satellite

Choosing the number of Satellites to use

Using hash keys

Designing a Data Vault structure

Choosing the Hubs

Choosing the Links

Choosing the Satellites

Designing business vaults

Adding a Meta Mart

Adding a Metrics Vault

Adding an Error Mart

Using Point-in-Time tables

Adding Bridge tables

Adding a hierarchical link

Implementing a Data Vault

Summary

Exercise

Chapter 10: Designing and Implementing a Data Lake Using Azure Storage

Technical requirements

Background of data lakes

Modeling a data lake

Defining data lake zones

Defining a data lake folder structure

Designing time slices

Using different file formats

AVRO file format

Parquet file format

ORC file format

Choosing the proper file size

Provisioning an Azure storage account

Locally redundant storage (LRS)

Zone-redundant storage (ZRS)

Geo-redundant storage (GRS) and geo-zone-redundant storage (GZRS)

Read-access geo-redundant storage (RA_GRS) and read-access geo-zone-redundant storage (RA_GZRS)

Creating a data lake filesystem

Creating multiple storage accounts

Considering DTAP

Considering data diversity

Considering cost sensitivity

Considering management overhead

Summary

Section 3 – ETL with Azure Data Factory

Chapter 11: Implementing ETL Using Azure Data Factory

Technical requirements

Introducing Azure Data Factory

Introducing the main components of Azure Data Factory

Understanding activities

Understanding datasets

Understanding linked services

Understanding pipelines

Understanding triggers and integration runtimes

Using the copy activity

Copying a single table to the data lake

Copying all tables to the data lake

Implementing a data flow

Executing SQL code from Data Factory

Summary

Why subscribe?

Other Books You May Enjoy

Preface

Databases play an important role in almost all the applications that we use. The database has a direct impact on the performance and scalability of the application it supports. That makes choosing the right type of database to use and designing that database correctly a vital part of all development since scalability and performance depend on a well-chosen design. With databases hosted in Azure, the design may also have a direct impact on the costs of the database. The first part of this book teaches you when to use a relational database (Azure SQL Database) and when a NoSQL database (Cosmos DB) is the better option. You will also learn how to design that database and, finally, how to implement the chosen design.

All the data gathered by applications can be used for Business Intelligence (BI). A crucial part of BI is creating a central repository for your data to build your BI solution on. This can be a data lake with data marts, or it may be a data warehouse. In the second part of the book, you will learn to design data warehouses according to the theory of dimensional modeling or by designing a data vault. You will also learn how to design a data lake. You will then learn how to apply what you have learned by creating a data lake in Azure Storage and creating data marts using Azure Synapse Analytics.

The book ends with a chapter on Azure Data Factory. Data Factory is used to get data from the source databases you created in part one and store that data in the data platforms you implemented in part two.

After reading this book, you have a solid understanding of data modeling and of how to implement a database schema in Azure. This will help you to build scalable and cost-effective data solutions in Azure.

Who this book is for

This book is for application developers, BI developers, and data engineers who build databases and data solutions in the Microsoft Azure cloud. This book assumes that you have basic knowledge of Azure and a basic understanding of what a database is and what it is for. The book does not assume any specific development or database knowledge or experience. However, you should have an affinity for working with data. Some T-SQL experience will also be advantageous.

This book will also help data analysts gain an understanding of how databases are set up and why they are created in the way they are. This understanding will help data analysts to get the most out of the different data stores that they may need to use.

What this book covers

Chapter 1, Introduction to Databases, explains what a database is and what it is used for. It also explains the common concepts of databases.

Chapter 2, Entity Analysis, explains how to create and read Entity Relationship Diagrams (ERD) from a top-down perspective.

Chapter 3, Normalizing Data, teaches you how to create a normalized database design and when to use this design technique.

Chapter 4, Provisioning and Implementing an Azure SQL Database, shows hands-on how to create and implement a normalized design in Azure using Microsoft's PaaS offering – Azure SQL Database.

Chapter 5, Designing a NoSQL Database, explains when to choose Cosmos DB over a relational database. It also teaches you how to design different types of NoSQL databases.

Chapter 6, Provisioning and Implementing an Azure Cosmos DB, shows hands-on how to create and implement a Document database in Azure using Microsoft's NoSQL database, Cosmos DB.

Chapter 7, Dimensional Modeling, teaches you how to create a star schema database design according to the rules of dimensional modeling.

Chapter 8, Provisioning and Implementing an Azure Synapse SQL Pool, shows hands-on how to create and implement a star schema database using an Azure Synapse dedicated SQL pool.

Chapter 9, Data Vault Modeling, explains when to use a data vault and how to design a data warehouse using the data vault modeling technique.

Chapter 10, Designing and Implementing a Data Lake Using Azure Storage, discusses when implementing a data lake is a better option than creating a data warehouse. It also shows how to implement a data lake using Azure Storage.

Chapter 11, Implementing ETL Using Azure Data Factory, shows how to create pipelines to automate the process of getting data out of production databases and into data lakes and data marts.

To get the most out of this book

Since this book implements all database services in Azure, a computer with a modern browser and an Azure subscription with permission to create resources in Azure are required:

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-modelling-for-Azure-Data-Services. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801077347_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Column names such as PatientID, Patient_FirstName, and PostalCode would already render it more readable."

A block of code is set as follows:

CREATE TABLE [dbo].[OrderDetail]

(

OrderID      INT NOT NULL,

ProductID    INT NOT NULL,

UnitPrice    MONEY NOT NULL,

Quantity     SMALLINT NOT NULL,

Discount     NUMERIC(5, 4) NOT NULL

);

GOWhen we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]

exten => s,1,Dial(Zap/1|30)

exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Click, in Azure Data Studio, on the New Query button to open a new query file."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Data Modeling for Azure Data Services, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Section 1 – Operational/OLTP Databases

Databases are used in everyday applications such as webshops, CRM systems, and financial systems. These are our Line Of Business (LOB) applications. In this section, we will learn how to choose between a SQL database and a Cosmos DB database. We will also learn how to design a database for optimal performance and scalability.

This section comprises the following chapters:

Chapter 1, Introduction to DatabasesChapter 2, Entity AnalysisChapter 3, Normalizing DataChapter 4, Provisioning and Implementing an Azure SQL DBChapter 5, Designing a NoSQL DatabaseChapter 6, Provisioning and Implementing an Azure Cosmos DB Database