Scalable Data Analytics with Azure Data Explorer - Jason Myerscough - E-Book

Scalable Data Analytics with Azure Data Explorer E-Book

Jason Myerscough

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Azure Data Explorer (ADX) enables developers and data scientists to make data-driven business decisions. This book will help you rapidly explore and query your data at scale and secure your ADX clusters.
The book begins by introducing you to ADX, its architecture, core features, and benefits. You'll learn how to securely deploy ADX instances and navigate through the ADX Web UI, cover data ingestion, and discover how to query and visualize your data using the powerful Kusto Query Language (KQL). Next, you'll get to grips with KQL operators and functions to efficiently query and explore your data, as well as perform time series analysis and search for anomalies and trends in your data. As you progress through the chapters, you'll explore advanced ADX topics, including deploying your ADX instances using Infrastructure as Code (IaC). The book also shows you how to manage your cluster performance and monthly ADX costs by handling cluster scaling and data retention periods. Finally, you'll understand how to secure your ADX environment by restricting access with best practices for improving your KQL query performance.
By the end of this Azure book, you'll be able to securely deploy your own ADX instance, ingest data from multiple sources, rapidly query your data, and produce reports with KQL and Power BI.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 343

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Scalable Data Analytics with Azure Data Explorer

Modern ways to query, analyze, and perform real-time data analysis on large volumes of data

Jason Myerscough

BIRMINGHAM—MUMBAI

Scalable Data Analytics with Azure Data Explorer

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Sunith Shetty

Senior Editor: Roshan Kumar

Content Development Editor: Shreya Moharir

Technical Editor: Sonam Pandey

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Manju Arasan

Production Designer: Aparna Bhagat

Marketing Coordinator: Priyanka Mhatre

First published: March 2022

Production reference: 2150322

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-854-2

www.packt.com

To Harrison and James, my two boys. I love you more than you will ever know. The stories and comics you both write inspired me to write this book.

To my parents, Charlotte and Dave. Thank you for getting Damian and me a PC and letting us watch Hackers. That was the catalyst for everything.

To my mentors, John Vasicek and Arunee Singhchawla. Thank you, John, for starting my Azure journey, supporting the original DevOps movement, and trusting me to lead it. Thank you Arunee, for your mentorship, for encouraging me to write this book, and last but not least, friendship. I learn from you every day.

Thank you, Shreya Moharir, for all your help and feedback. I learned so much from you during this project.

Foreword

It has been almost three decades since I started my career in technology. It started from the fact that I am a curious person and always want to find various ways to solve problems. I love the challenge of finding new ways to do things more efficiently. By doing so, I tend to use the information around me to help decide on the approach that would be best in solving problems. I entered the IT world as a quality assurance analyst, and then that flourished to site reliability engineering. What I have learned over the years is that those positions share a common property, which is analysis. My passion is in using data and information at hand to aid in the decision-making process. My passion for analysis continues to thrive. Throughout those years, I have been blessed to have met and learned from many creditable people.

In 2016, I had the privilege of meeting Jason Myerscough. He joined the team with the purpose of transforming and modernizing the team. Jason pioneered the DevOps culture and was tasked with leading one of the biggest and most notable projects: migrating one of the flagship products to the Azure Cloud. I had the honor of joining his team to help form this initiative. I remember how excited and nervous I was to be part of such an initiative for our team and company. I remember watching him sift through all the data and analyzing it. I recall him walking across the hall back and forth looking at the data and discussing with our data scientists how to formulate the pattern from the large amount of data he collected from production. He then would come up with a monitoring and alerting strategy, as well as performance tuning. I wish I could be like him, I thought to myself. With his strong development background combined with his curiosity, strive for excellence, passion for the Azure Cloud, and analysis skills, I knew we had the perfect person who would bring us to the cloud.

These days, we live in a data-rich age. It has become much more critical to be able to extract and digest data in a meaningful way in order to make business-critical decisions. As a site reliability engineer, some of the key responsibilities are analyzing logs, monitoring production environments, and responding to any issues. It is even more critical for us to be able to have in-depth insight into production data, to be able to digest and set up proactive alerting.

In Jason's book, he shares how you can use Azure Data Explorer to quickly identify patterns, anomalies, and trends. His book walks you through what Azure Data Explorer is, how to set it up, and how to use Kusto Query Language to run as many queries as you need to quickly answer your questions.

With Jason's in-depth experience as a developer, as a site reliability engineer, and as an architect, this book truly reflects his experience and passion. It captures perfectly how you can use Azure Data Explorer to analyze data-rich environments and have a meaningful way to help make the right business decisions. I highly recommend this book as I find it full of various important aspects of Azure Data Explorer. It is such a delightful and easy-to-follow read. If you have a passion for analysis or are in a position where you must make business decisions based on large data volumes, then you should put this book on your required reading list. This book will empower you and your team to improve efficiency and productivity.

– Arunee Singhchawla

Director of Site Reliability Engineering at Nuance Communications

Contributors

About the author

Jason Myerscough is a director of Site Reliability Engineering and cloud architect at Nuance Communications. He has been working with Azure daily since 2015. He has migrated his company's flagship product to Azure and designed the environments to be secure and scalable across 16 different Azure regions by applying cloud best practices and governance. He is currently certified as an Azure Administrator (AZ-103) and an Azure DevOps Expert (AZ-400). He holds a first-class bachelor's degree with honors in software engineering and a first-class master's degree in computing.

A special thanks to all the Microsoft team members—Vladik Branevich, Tzvia Gitlin Troyna, Adi Eldar, Pankaj Suri, Dany Hoter, Oren Hasbani, Guy Reginiano, Slavik Neimer, Michal Bar, Rony Liderman, and Gabi Lehner for contributing to this book with your valuable insights.

About the reviewers

Diana Widjaja has been a technical writer with Salesforce for over 10 years, where she writes developer documentation for teams across the company's UI platform. Previously, she also had the opportunity to work with CBS, IBM, and Google. Diana received a bachelor of science in technical communication from the University of Washington and a master of science in information security policy and management from Carnegie Mellon University. When she's not working with words, pixels, and code, Diana likes to explore the San Francisco Bay Area with her energetic kids and husband, and keep in touch with family in England and Singapore.

Sibelius dos Santos Segala is a software engineer. He has been involved in IT projects for over 25 years in diverse areas such as university logistics, HR and grade systems, mainframe application integration to intranet, multimedia application streaming in consumer devices, support application for automotive tracking, and lately, banking-related software.

Table of Contents

Preface

Section 1: Introduction to Azure Data Explorer

Chapter 1: Introducing Azure Data Explorer

Technical requirements

Introducing the data analytics pipeline

Overview of Azure data analytics services

What is Azure Data Explorer?

ADX features

Introducing Azure Data Explorer architecture

Azure Data Explorer use cases

IoT monitoring and telemetry

Log analysis

Running your first query

Summary

Chapter 2: Building Your Azure Data Explorer Environment

Technical requirements

Creating an Azure subscription

Introducing Azure Cloud Shell

Creating and configuring ADX instances in the Azure portal

Introducing Infrastructure as Code

Creating and configuring ADX instances with PowerShell

Creating ADX clusters with ARM templates

ARM template structure

Parameters

Variables

Resources

Deploying our templates

Summary

Questions

Chapter 3: Exploring the Azure Data Explorer UI

Technical requirements

Ingesting the StormEvents sample dataset

Querying data in the Azure portal

Exploring the ADX Web UI

Summary

Section 2: Querying and Visualizing Your Data

Chapter 4: Ingesting Data in Azure Data Explorer

Technical requirements

Understanding data ingestion

Introducing schema mapping

Ingesting data using one-click ingestion

Ingesting data using KQL management commands

Ingesting data from Blob storage using Azure Event Grid

Enabling streaming on ADX

Creating our table and JSON mapping schema

Creating our storage account

Creating our event hub

Creating our Event Grid

Ingesting data in ADX

Summary

Questions

Chapter 5: Introducing the Kusto Query Language

Technical requirements

What is KQL?

Introducing the basics of KQL

Introducing predicates

Searching and filtering data

Aggregating data and tables

Formatting output

Generating graphs in the ADX Web UI

Converting SQL to KQL

Introducing KQL's scalar operators

Arithmetic operators

Logical operators

Relational operators

String operators

Date and time operators

Joining tables in KQL

Introducing KQL's management commands

Cluster management

Database and table management

Summary

Questions

Chapter 6: Introducing Time Series Analysis

Technical requirements

What is time series analysis?

Creating a time series with KQL

Introducing the helper operators and functions

Generating time series data

Calculating statistics for time series data

Summary

Questions

Chapter 7: Identifying Patterns, Anomalies, and Trends in your Data

Technical requirements

Calculating moving averages with KQL

Trend analysis with KQL

Applying linear regression with KQL

Applying segmented regression with KQL

Anomaly detection and forecasting with KQL

Anomaly detection

Forecasting for the future

Summary

Questions

Chapter 8: Data Visualization with Azure Data Explorer and Power BI

Technical requirements

Introducing data visualization

Creating dashboards with Azure Data Explorer

Navigating the dashboard window

Building our first Data Explorer dashboard

Sharing dashboards

Creating dashboard filters

Connecting Power BI to Azure Data Explorer

Summary

Questions

Section 3: Advanced Azure Data Explorer Topics

Chapter 9: Monitoring and Troubleshooting Azure Data Explorer

Technical requirements

Introducing monitoring and troubleshooting

Monitoring ADX

Azure Service Health

ADX metrics

ADX diagnostics

Alerting in Azure

Troubleshooting ADX

Creating a new data connection

Ingesting data to simulate an error

Observing and troubleshooting ADX

Configuring alerts for ingestion failures

Summary

Questions

Chapter 10: Azure Data Explorer Security

Technical requirements

Introducing identity management

Introducing RBAC and the management and data planes

Granting access to the management plane

Granting access to the data plane

Introducing virtual networking and subnet delegation

Creating a new resource group

Deploying the NSG

Deploying the route table

Deploying the virtual network

Deploying the public IP addresses

Deploying the ADX cluster

Filtering traffic with NSGs

Introducing NSGs

Creating inbound security rules

Summary

Questions

Chapter 11: Performance Tuning in Azure Data Explorer

Technical requirements

Introducing performance tuning

Introducing workload groups

How workload groups work

Creating custom workload groups

Introducing policy management

Managing the cache policy

Managing retention policies

Monitoring queries

KQL best practices

Version controlling your queries

Prioritizing time filtering

Best practices for string operators

Summary

Questions

Chapter 12: Cost Management in Azure Data Explorer

Technical requirements

Scaling and cost management

Selecting the correct ADX cluster SKU

Introducing dev/test clusters

Introducing production clusters

Introducing Azure Advisor

Introducing Cost Management + Billing

Accessing invoices

Configuring budget alerts

Summary

Chapter 13: Assessment

Other Books You May Enjoy

Preface

Azure Data Explorer (ADX) enables developers and data scientists to make data-driven business decisions. This book will help you rapidly get insights from your applications by querying data at scale and implementing best practices for securing your ADX clusters.

The book begins by introducing ADX and discussing its architecture, core features, and benefits. You'll learn how to securely deploy ADX instances and be comfortable navigating and using the ADX Web UI. You'll focus on data ingestion and how to query and visualize your data using the powerful Kusto Query Language (KQL). You'll cover KQL operators and functions to efficiently query and explore your data. You'll learn to perform time series analysis and how to search for anomalies and trends in your data. Later, you'll focus on advanced ADX topics, starting with deploying your ADX instances using Infrastructure as Code (IaC). You will manage your cluster performance and monthly ADX costs by handling cluster scaling and data retention periods. Finally, you will cover how to secure your ADX environment by restricting access using subnet delegation and cover some of the best practices for improving your KQL query performance.

By the end of this book, you will be able to securely deploy your own ADX instance, ingest data from multiple sources, rapidly query your data, and produce reports with KQL and Power BI.

Who this book is for

This book is for data analysts, data engineers, and data scientists who are responsible for analyzing and querying their team's large volumes of data on Azure. This book will also be helpful for SRE and DevOps engineers that are responsible for deploying, maintaining, and securing the infrastructure. Some previous Azure experience and basic data querying knowledge will be beneficial.

What this book covers

Chapter 1, Introduction to Azure Data Explorer, covers what ADX is, the core features of ADX, and where ADX fits in Microsoft's suite of data services. The chapter then discusses some of the different use cases of when to use ADX and demonstrates how to execute your first KQL query.

Chapter 2, Building Your Azure Data Explorer Environment, explains how to quickly deploy and configure ADX clusters and databases using the Azure portal, PowerShell, and Azure ARM templates. By the end of this chapter, you will be ready to start ingesting and analyzing your data.

Chapter 3, Exploring Azure Data Explorer UI, presents the ADX UI to you. You will spend the majority of your time using the ADX UI to query and analyze your data. By the end of this chapter, you will be familiar with the windows and panes in the ADX Web UI.

Chapter 4, Ingesting Data in Azure Data Explorer, discusses the concept of data ingestion and demonstrates how to ingest data from multiple data sources such as Blob storage and Azure Event Hubs, how to create new table schemas, and explains how data maps to those tables. At the end of this chapter, you will understand how ADX ingests data and how to configure the data ingestion.

Chapter 5, Introducing the Kusto Query Language, introduces you to KQL and demonstrates how to query data. The chapter begins by introducing the language, explains the basics of KQL such as searching, filtering, aggregating, and joining tables. By the end of the chapter, you will know enough KQL to comfortably query data.

Chapter 6, Introducing Time Series Analysis, introduces you to ADX's time series features, beginning by defining what time series analysis is, and then demonstrating how to query your time series data using the make-series operator. Finally, we discuss some of the most important and useful time series functions provided by ADX.

Chapter 7, Identifying Patterns, Anomalies, and Trends in Your Data, builds on the previous chapter by discussing how to detect anomalies and trends in your data. The chapter first begins by introducing some of the anomaly functions available within ADX and then covers some of the machine learning capabilities of ADX.

Chapter 8, Data Visualization with Azure Data Explorer and Power BI, explains and demonstrates how to integrate ADX with Power BI. Power BI is a powerful reporting tool used to share rich graphs and reports. By the end of the chapter, you will know how to integrate ADX with Power BI and how to create reports in Power BI powered by ADX datasets.

Chapter 9, Monitoring and Troubleshooting Azure Data Explorer, teaches you how to monitor your ADX clusters using Azure Monitor and ADX Insights. The chapter teaches you how to configure alerts using KQL and action groups and explains how to troubleshoot issues by enabling the ADX diagnostics and examining those logs using Log Analytics. In the troubleshooting section, we will demonstrate how to troubleshoot and resolve a data ingestion problem.

Chapter 10, Azure Data Explorer Security, discusses how to secure your ADX instances using both identity management and virtual networks with subnet delegation. We begin by explaining why security is important on the public cloud and then we discuss identity management at the management and data plane. Next, we will introduce securing ADX instances using virtual networks and subnet delegation and demonstrate how to filter network traffic using network security groups (NSGs).

Chapter 11, Performance Tuning in Azure Data Explorer, begins by explaining why performance matters, and then discusses the KQL best practices and revisits the ADX architecture to explain how time filtering can provide performance improvements. You will also learn how to monitor the performance of your clusters, queries, and external applications.

Chapter 12, Cost Management in Azure Data Explorer, discusses how to plan and manage production deployments. The chapter first discusses how to manage your clusters and what requirements you should take into consideration when planning your deployment, and finally discusses how to estimate your Azure costs.

To get the most out of this book

To get the most out of the book, we recommend that you create an Azure account and take advantage of Microsoft's 30-day free trial to follow along with the practical examples. We will spend most of our time in the Azure portal, Azure Cloud Shell, and the Data Explorer Web UI. We also recommend that you clone the repository to your local machine and use Visual Studio Code to experiment and modify the code samples.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Please remember to turn off/deallocate your resources in Azure to avoid incurring extra charges.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Scalable-Data-Analytics-with-Azure-Data-Explorer. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Code in Action

The Code in Action videos for this book can be viewed at https://bit.ly/3uw1w2U.

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801078542_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Since we know there is a failed ingestion, the table we are interested in is aptly called FailedIngestion."

A block of code is set as follows:

StormEvents | where State =~ "California"

| summarize event=count() by EventType | render columnchart

Any command-line input or output is written as follows:

Get-AzRoleDefinition | Select-Object Name, Description

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Next, click Review + create. Finally, click Create once the validation is complete."

Tips and Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Scalable Data Analytics with Azure Data Explorer, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Section 1: Introduction to Azure Data Explorer

This section introduces you to Azure Data Explorer (ADX) by discussing the core features and benefits of ADX, such as low-latency data ingestion, the ADX architecture, and how to quickly deploy your instance of ADX via the Azure portal, PowerShell, and ARM templates. The final chapter of this section presents an overview of the ADX web UI, where you will spend most of your time analyzing your data. By the end of this section, you will understand the core features of ADX, be able to deploy your own ADX instances, and be comfortable navigating and using the ADX web UI. This section sets the foundations for Section 2, where you will begin to ingest and analyze the data.

This section consists of the following chapters:

Chapter 1, Introducing Azure Data ExplorerChapter 2, Building Your Azure Data Explorer EnvironmentChapter 3, Exploring the Azure Data Explorer UI

Chapter 1: Introducing Azure Data Explorer

Welcome to Scalable Data Analytics with Azure Data Explorer! More than 90% of today's data is digital and most of that data is considered unstructured, such as text messages and other forms of free text. So how can we analyze all our data? The answer is data analytics and Azure Data Explorer (ADX). Data analytics is a complex topic and Microsoft Azure provides a comprehensive selection of data analytics services, which can seem overwhelming when you are first starting your journey into data analytics.

In this chapter, we begin by introducing the data analytics pipeline and learning about each of the steps in the pipeline. These steps are required for taking raw data and producing reports and visuals as a result of your analysis, which will help you understand the workflow used by ADX.

Next, we will introduce some of the popular Azure data services and understand where they fit in the data analytics pipeline. Some of these services, such as Azure Event Hubs, will be used in later chapters when we learn about data ingestion.

We will also learn what ADX is, the features that make it a powerful data exploration platform, the architecture, and key components of ADX, such as the engine cluster, and understand some of the use cases for ADX, for example, in IoT monitoring, telemetry, and log analysis. Finally, we will get our feet wet and dive right into running your first Kusto Query Language (KQL) query using the Data Explorer UI.

In this chapter, we are going to cover the following main topics:

Introducing the data analytics pipelineWhat is Azure Data Explorer?Azure Data Explorer use casesRunning your first query

Technical requirements

If you do not already have an Azure account, head over to https://azure.microsoft.com/en-us/free/search/ and sign up. Microsoft provides 12 months of popular free services and $200 credit, which is enough to cover the cost of our Azure Data Explorer journey with this book. Microsoft also provides a free to use cluster (https://help.kusto.windows.net/) that is already populated with data. We will use this free cluster and create our own clusters throughout this book.

Please remember to clone or download the Git repository that accompanies the book from https://github.com/PacktPublishing/Scalable-Data-Analytics-with-Azure-Data-Explorer. All the code and query samples listed in the book are available in our repository. Download the latest version of Git from https://git-scm.com if you have not already installed the command-line tools.

Important Note

When developing and cloning repositories, I create a development folder in my home directory. On Windows, this is C:\Users\jason\development. On macOS, this is /Users/jason/development. When referencing specific code examples, I will refer to the repository's parent directory as ${HOME}, for example, ${HOME}/Scalable-Data-Analytics-with-Azure-Data-Explorer/Chapterxx/file.kql.

Introducing the data analytics pipeline

Before diving into ADX, it is worth spending some time to understand the data analytics pipeline. Whenever I am learning something new that is large and complex in scope, such as data analytics, I break the topic down into smaller chunks to help with learning and measuring my progress. Therefore, an understanding of the various stages of the data analytics pipeline will help you understand how ADX takes raw data and generates reports and visuals as a result of our analytical tasks, such as time series analysis.

Figure 1.1 illustrates the stages of the data analytics pipeline required to take data from a data source, perform some analysis, and produce the result of the analysis in the form of a visual, such as tables, reports, and graphs:

Figure 1.1 – Data analytics pipeline

In the spirit of breaking a complex subject into smaller chunks, let's look at each stage in detail:

Data: The first step in the pipeline is the data sources. In Chapter 4, Ingesting Data in Azure Data Explorer, we will discuss the different types of data. For now, suffice it to say there are three different categories of data: structured, semi-structured, and unstructured. Data can range from structured, such as tables, to unstructured, such as free-form text. Ingestion: Once the data sources have been identified, the data needs to be ingested by the pipeline. The primary purpose of the ingestion stage is to take the raw data, perform some Extract-Transform-Load (ETL) operations to format the data in a way that helps with your analysis, and send the data to the storage stage. The data can be ingested using tools and services such as Apache Kafka, Azure Event Hubs, and IoT Hub. Chapter 4, Ingesting Data in Azure Data Explorer, discusses the different ingestion methods, such as streaming versus batch, and demonstrates how to ingest data using multiple services, such as Azure Event Hubs and Azure Blob storage.Store: Once ingested, ADX natively compresses and stores the data in a proprietary format. The data is then cached locally on the cluster based on the hot cache settings. The data is phased out of the cluster based on the retention settings. We will discuss these terms a little later in the chapter.Analyze: At this stage, we can start to query, apply machine learning to detect anomalies, and predict trends. We will see examples of anomaly detection and trend prediction in Chapter 7, Identifying Patterns, Anomalies, and Trends in Your Data. In this book, we will perform most of our analysis in the ADX Web UI using Kusto Query Language(KQL).Visualize: The final stage of the pipeline is visualize. Once you have ingested your data and performed your analysis, chances are you will want to share and present your findings. We will present our findings using the ADX Web UI's dashboards and Power BI.

In the next section, we will look at some of the services Azure provides for the different stages of the analytics pipeline.

Overview of Azure data analytics services

You may have noticed that I referenced a few of Azure's data services previously, and you may be wondering what they are used for. Although this book is about Azure Data Explorer, it is worth understanding what some of the common data services are, since some of the services, such as Event Hubs and Blob storage, will be discussed and used in later chapters.

To help map the different data services to the analytics pipeline, Figure 1.2 illustrates an updated pipeline, with the Azure data services mapped to the respective pipeline stages:

Figure 1.2 – Azure data services

Important Note

The list of services depicted in Figure 1.2 is by no means an exhaustive list of Azure data analytics services. For a complete and accurate list, please see https://azure.microsoft.com/en-us/services/#analytics.

The following list of services is a short description of the services shown in Figure 1.2:

Event Hubs: This is an event and streaming Platform as a Service (PaaS). Event Hubs allows us to stream data, which we will demonstrate and use in Chapter 4, Ingesting Data in Azure Data Explorer.Data Factory: This is a PaaS service that allows us to transform data from one format to another. These transformations are commonly referred to as Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT).HDInsight: This is a PaaS service that appears twice in Figure 1.2 and could technically appear in other stages. HDInsight is quite possibly one of the most misunderstood analytical services, with regard to what it does. HDInsight is a PaaS version of the Hortonworks Hadoop framework, which includes a wide range of ingestion, analytics, and storage services, such as Apache Kafka, Hive, HBase, Spark, and the Hadoop Distributed File System (HDFS). Azure Data Lake Gen2: This is a storage solution based on Azure Blob storage that implements HDFS.Blob Storage: This is Azure's object storage service that all other storage services are based on. Azure Databricks: This is Azure's PaaS implementation of Apache Spark.Power BI: Technically not an Azure service, Power BI is a rich reporting product that is commonly integrated with Azure.

You may be wondering where ADX would fit in Figure 1.2. The answer is ingestion, store, analyze, and visualize. In the next section, you will learn how this is possible by understanding what Azure Data Explorer is.

What is Azure Data Explorer?

There is a good chance you have already used ADX to some degree without realizing it. If you have used Azure Security Center, Azure Sentinel, Application Insights, Resource Graph Explorer, or enabled diagnostics on your Azure resources, then you have used ADX. All these services rely on Log Analytics, which is built on top of ADX.

Like many tools and products, ADX was started by a small group of engineers circa 2015 who were trying to solve a problem. A small group of developers from Microsoft's Power BI team needed a high-performing big data solution to ingest and analyze their logging and telemetry data, and being engineers, they built their own when they could not find a service that met their needs. This resulted in the creation of Azure Data Explorer, also known as Kusto.

So, what is ADX? It is a fully managed, append-only columnar store big data service capable of elastic scaling and ingesting literally hundreds of billions of records daily!

Before moving onto the ADX features, it is important to understand what is meant by PaaS and the other cloud offerings referred to as as a service. Understanding the different cloud offerings will help with understanding what you and the cloud provider – in our case, Microsoft – are responsible for.

When you strip away the marketing terms, cloud computing is essentially a data center that is managed for you and has the same layers or elements as an on-premises data center, for example, hardware, storage, and networking.

Figure 1.3 shows the common layers and elements of a data center. The items in white are managed by you, the customer, and the items in gray are managed by the cloud provider:

Figure 1.3 – Cloud offerings

In the case of on-premises, you are responsible for everything, from renting the building and ventilation to physical networking and running your applications. Public cloud providers offer three fundamental cloud offerings, known as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). The provider typically offers a lot more services, such as Azure App Service, but these additional services are built on top of the aforementioned fundamental services.

In the case of ADX, which is a PaaS service, Microsoft manages all layers except the data and application. You are responsible for the data layer, that is, the data ingestion, and the application layer, that is, writing our KQL and creating dashboards.

ADX features

Let's look at some of the key features ADX provides. Most of the features will be discussed in detail in later chapters:

Low-latency ingestion and elastic scaling: ADX nodes are capable of ingesting structured, semi-structured, and unstructured data up to speeds of 200 MBps (megabytes per second). The vertical and horizontal scaling capabilities of ADX enable it to ingest petabytes of data.Time series analysis: As we will see in Chapter 7, Identifying Patterns, Anomalies, and Trends in Your Data, ADX supports near real-time monitoring, and combined with the powerful KQL, we can search for anomalies and trends within our data.Fully managed (PaaS): All the infrastructure, operating system patching, and software updates are taken care of by Microsoft. You can focus on developing your product rather than running a big data platform. You can be up and running in three steps:Create a cluster and database (more details in Chapter 2, Building Your Azure Data Explorer Environment).Ingest data (more details in Chapter 4, Ingesting Data in Azure Data Explorer).Explore your data using KQL (more details in Chapter 5, Introducing the Kusto Query Language).Cost-efficient: Like other Azure services, Microsoft provides a pay-as-you-consume model. For more advanced use cases, there is also the option of purchasing reserved instances, which require upfront payments.High availability: Microsoft provides an uptime SLA of 99.9% and supports Availability Zones, which ensures your infrastructure is deployed across multiple physical data centers within an Azure region.Rapid ad hoc query performance: Due to some of the architecture decisions that are discussed in the next section, ADX is capable of querying billions of records containing structured, semi-structured, and unstructured data, returning results within seconds. ADX is also designed to execute distributed queries across multiple clusters, which we will see later in the book.Security: We will be covering security in depth in Chapter 10, Azure Data Explorer Security. For now, suffice it to say that ADX supports both encryption at rest and in transit, role-based access control (RBAC), and allows you to restrict public access to your clusters by deploying them into virtual private networks (VPNs) and block traffic using network security groups (NSGs). Enables custom solutions: Allows developers to build analytics services on top of ADX.

If you are familiar with database products such as MySQL, MS SQL Server, and Azure SQL, then the core components will be familiar to you. ADX uses the concept of clusters, which can be considered equivalent to Azure SQL Server and are essentially the compute or virtual machines. Next, we have databases and tables; these concepts are the same as a SQL database.

Figure 1.4 shows the hierarchical structure that is shown in the Data Explorer UI. In this example, help is the ADX cluster and Samples is in the database, which contains multiple tables such as US_States:

Figure 1.4 – Cluster, database, and tables hierarchy

A cluster or SQL server can host multiple databases, which in turn can contain multiple tables (see Figure 1.4). We will discuss tables in Chapter 4, Ingesting Data in Azure Data Explorer, when we will demonstrate how to create tables and data mappings.

Introducing Azure Data Explorer architecture

PaaS services are great because they allow developers to get started quickly and focus on their product rather than managing complex infrastructure. Being fully managed can also be a disadvantage, especially when you experience issues and need to troubleshoot, and as engineers, we tend to be curious and want to understand how things work.

As depicted in Figure 1.5, ADX contains two key services, the data management service and the engine service. Both services are clusters of compute resources that can be automatically or manually scaled horizontally and vertically. At the time of writing, Microsoft recently announced their V3 engine in March 2021, which contains some significant performance improvements:

Figure 1.5 – Azure Data Explorer architecture

Now, let's learn more about the data management and the engine service depicted in the preceding diagram:

Data management service:The data management service is responsible primarily for metadata management and managing the data ingestion pipelines. The data management service ensures data is properly ingested and sent to the engine service. Data that is streamed to our cluster is sent to the row store, whereas data that is batched is sent to the column stores.Engine service: The engine service, which is a cluster of compute resources, is responsible for processing the ingested data, managing the hot cache and the long-term storage, and query execution. Each engine uses its local SSD as the hot cache and ensures the cache is used as much as possible.

ADX is often referred to as an append-only analytics service, since the data that is ingested is stored in immutable shards and each shard is compressed for performance reasons. Data sharding is a method of splitting data into smaller chunks. Since the data is immutable, the engine nodes can safely read the data shards, knowing they do not have to worry about other nodes in the cluster making changes to the data.

Since the storage and the compute are decoupled, ADX can scale the cluster both vertically and horizontally without worrying too much about data management.

This brief overview of the architecture only scratches the surface; there are a lot more tasks happening, such as indexing columns and maintenance of the indexes. Having an overview helps appreciate what ADX is doing under the hood.

Important Note

I recommend reading the Azure Data Explorer white paper https://azure.microsoft.com/mediahandler/files/resourcefiles/azure-data-explorer/Azure_Data_Explorer_white_paper.pdf if you are interested in learning more about the architecture.

Azure Data Explorer use cases

Whenever someone asks what they should focus on when learning how to use Azure, I immediately say KQL. I use KQL daily, from managing cost and inventory to security and troubleshooting. It is not uncommon for relatively small environments to generate hundreds of GB of data per day, such as infrastructure diagnostics, Azure Resource Manager (ARM) audit logs, user audit logs, application logs, and application performance data. This may seem small in the grand scheme of things when, in 2021, we are generating quintillion bytes of data per day. But it is still enough data to require dedicated services such as ADX to analyze the data.

IoT monitoring and telemetry

Look around at your environment: how many appliances and devices can you see that are connected to the network? I see light bulbs, sensors, thermostats, and fire alarms, and there are billions of Internet of Things (IoT) devices in the world, all of which are constantly generating data. Together with Azure's IoT services, ADX can ingest the high volumes of data and enable us to monitor our things and perform complex time series analysis, so that we can identify anomalies and trends in our data.

Log analysis

Imagine this scenario: you have just performed a lift-and-shift migration to Azure for your on-premises product, and since the application is not a true cloud-native solution, you are constrained by which Azure services you can use, such as load balancing. Azure Application Gateway, which is a load-balancing service, supports cookie-based session affinity, and the cookies are completely managed by Application Gateway. The application we migrated to Azure required specific values to be written in the cookie, and this is not possible with the current version of Application Gateway, so we used HAProxy running on Linux virtual machines. The security team requires all products to only support TLS 1.2 and above. The problem is that not all of our clients support TLS 1.2, and if we simply disabled TLS 1.0 and 1.1, we would essentially break the service for those clients, which we do not want to do. Add to the equation the server-side product, which is distributed across 15 Azure Regions worldwide with each region containing hundreds of the HAProxy servers with no central logging! How can we analyze all this data to identify the clients that are not using TLS 1.2? The answer is Kusto.

We ingested the HAProxy log files and used KQL to analyze the log files and capture insights on TLS versioning and cipher information in seconds. With the queries, we were able to build near real-time dashboards for the support teams so they could reach out to clients and inform them when they would need to upgrade their software. With these insights, we were able to coordinate the TLS deprecation activities and execute them with no customer impact.

Most of the examples in this book focus on logging scenarios, and in Chapter 7, Identifying Patterns, Anomalies, and Trends in Your Data, we will learn about ADX's time series analysis features to identify patterns, anomalies, and trends in our data.

Running your first query