Business Intelligence with Databricks SQL - Vihag Gupta - E-Book

Business Intelligence with Databricks SQL E-Book

Vihag Gupta

0,0
33,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

In this new era of data platform system design, data lakes and data warehouses are giving way to the lakehouse – a new type of data platform system that aims to unify all data analytics into a single platform. Databricks, with its Databricks SQL product suite, is the hottest lakehouse platform out there, harnessing the power of Apache Spark™, Delta Lake, and other innovations to enable data warehousing capabilities on the lakehouse with data lake economics.
This book is a comprehensive hands-on guide that helps you explore all the advanced features, use cases, and technology components of Databricks SQL. You’ll start with the lakehouse architecture fundamentals and understand how Databricks SQL fits into it. The book then shows you how to use the platform, from exploring data, executing queries, building reports, and using dashboards through to learning the administrative aspects of the lakehouse – data security, governance, and management of the computational power of the lakehouse. You’ll also delve into the core technology enablers of Databricks SQL – Delta Lake and Photon. Finally, you’ll get hands-on with advanced SQL commands for ingesting data and maintaining the lakehouse.
By the end of this book, you’ll have mastered Databricks SQL and be able to deploy and deliver fast, scalable business intelligence on the lakehouse.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 391

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Business Intelligence with Databricks SQL

Concepts, tools, and techniques for scaling business intelligence on the data lakehouse

Vihag Gupta

BIRMINGHAM—MUMBAI

Business Intelligence with Databricks SQL

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Reshma Raman

Publishing Product Manager: Heramb Bhavsar

Senior Editor: Nazia Shaikh

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Vijay Kamble

Marketing Coordinator: Nivedita Singh

First published: September 2022

Production reference: 1260822

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80323-533-2

www.packt.com

This book came into life with the eternal blessings from Dada Ji & Dadi Ji &Love from Mom, Dad, Utkarsh, Suteja, Toffee & Love for the One who is arriving.

– Vihag Gupta

Contributors

About the author

Vihag Gupta is a solutions architect with a specialization in cloud data platform architecture and design. He has a background in data engineering and a professional interest in machine learning. He loves getting hands-on and solving real business problems with technology. He graduated with a degree in information technology from PES University, Bengaluru, in 2011 and earned a degree in information systems management from Carnegie Mellon University, Pittsburgh, in 2016. He has worked at companies including Deloitte Consulting, DataSpark, and Qubole. He currently works at Databricks, helping clients bring their lakehouse platforms for analytics to life.

Originally from Jharkhand, India, Vihag currently lives in Singapore with his wife and dog.

About the reviewers

Jerrold Law currently works at Databricks as a specialist solutions architect, specializing in data engineering and streaming. He helps organizations all over the world to adopt scalable big data processing use cases with the Lakehouse architecture. Prior to Databricks, he worked for Confluent, helping enterprise organizations to enable their real-time streaming use cases as part of the solutions engineering field team.

Ivan Tang is a solutions architect at Databricks, helping enterprises in the ASEAN region in their cloud transformation journey. Previously, he worked with Confluent Professional Services, where he helped enterprises from the APAC region to design, build, operate, and maintain event streaming platforms using Apache Kafka and Confluent. Prior to that, he was wrangling data and building models while being an architect, builder, and user of distributed systems, including Hadoop, Storm, Kafka, and Spark. Prior to that, he worked for Confluent, GIC (Singapore’s sovereign wealth fund), and other firms.

With over 15 years of experience working as a solutions architect for some of the largest and most innovative companies in the world, including Amazon, Bridgewater, and Bloomberg, Tomasz Kurzydym is a strong supporter and advocate for constant learning, experimentation, and curiosity. His work has taken him from his native country of Poland to the USA and UK, and he now lives with his family in Singapore working for Databricks – a data and AI company. In his spare time, Tomasz enjoys traveling, skiing, and photography.

Mahdi Askari is a lead solutions architect at Databricks. He is a software engineer at heart, and has been working in the data and big data space since 2010. He has worked across different industries with customers and prospects using Databricks Lakehouse to derive insight from their data.

Roja Boina is an engineering senior advisor. She provides data-driven solutions to enterprises. She takes pride in her get-it-done attitude. She enjoys defining requirements, designing, developing, testing, and delivering backend applications, and communicating data through visualizations. She is very passionate about being a Women in Tech (WIT) advocate and being a part of WIT communities. Outside of her 9-5, she loves to volunteer for non-profits and mentor fellow women in STEM. She believes in having a growth mindset.

Table of Contents

Preface

Part 1: Databricks SQL on the Lakehouse

1

Introduction to Databricks

Technical requirements

An overview of Databricks, the company

An overview of the Lakehouse architecture

An overview of the Databricks Lakehouse platform

Summary

2

The Databricks Product Suite – A Visual Tour

Technical requirements

Basic navigation with the sidebar

The top of the sidebar

The bottom of the sidebar

The middle of the sidebar

The SQL persona view

The Machine Learning persona view

The Data Science and Engineering persona view

Summary

3

The Data Catalog

Technical requirements

Understanding the data organization model in Databricks SQL

Apache Hive Metastore

Unity Catalog

Implications of choosing a cataloging technology

An example of the data organization model

Exploring data visually with the Data Catalog

Exploring the data programmatically with SQL statements

Summary

4

The Security Model

Technical requirements

The Databricks SQL security model

Access control with Apache Hive Metastore

Access control with Unity Catalog

Query execution model

User-facing table access control

Users, groups, and service principals

Securable objects

Operations

Privileges

Bringing everything together

The security model in practice

Ownership

Sharing the database

Exploring the database

Exploring asset metadata

Revoking access

Denying access

Going beyond read access – part 1

Going beyond read access – part 2

Going beyond read access – part 3

Summarizing the security model

UI-based user-facing table access control

The internals of cloud storage access

Cloud storage access in Microsoft Azure

Cloud storage access in Amazon Web Services

Summary

5

The Workbench

Technical requirements

Working with queries

Developing queries

Visualizing query results

Creating and publishing dashboards

Administering and governing artifacts

Summary

6

The SQL Warehouses

Technical requirements

Understanding the SQL Warehouse architecture

Creating and configuring SQL Warehouses

Cluster size

Scaling

Spot instance policy

Auto Stop

The art of SQL Warehouse sizing

Rules for query routing, queuing, and cluster autoscaling

Sizing the SQL Warehouse

Organizing and governing SQL Warehouses

SQL Warehouse assignment strategy

Access control in SQL Warehouses

Chargeback

Using Serverless SQL

Summary

7

Using Business Intelligence Tools with Databricks SQL

Technical requirements

Connecting from validated BI tools

SQL Warehouse details

Authentication details

Connecting from non-validated BI tools

Step 1 – download the driver

Step 2 – install the driver

Step 3 – configure the SQL Warehouse connection

Connecting programmatically

Databricks Partner Connect

The internals of a data source file

Summary

Part 2: Internals of Databricks SQL

8

The Delta Lake

Technical requirements

Fundamentals of the Delta Lake storage format

Data engineering before Delta Lake

The Delta Lake storage format

Data engineering after Delta Lake

Built-in performance-boosting features of Delta Lake

Automatic statistics collection

Automatic Compaction and Optimized Writes

Automatic caching

Configurable performance-boosting features of Delta Lake

Z-ordering

Bloom filter indexes

CACHE SELECT

Summary

9

The Photon Engine

Technical requirements

Understanding Photon Engine

What is Photon?

The Apache Spark execution model

Understanding vectorization

Volcano model

Code generation

Vectorization

Discussing the Photon product roadmap

Summary

Further reading

10

Warehouse on the Lakehouse

Technical requirements

Organizing data on the Lakehouse

Components of a warehouse system

The Medallion architecture

Implementing data modeling techniques

The bronze layer

The silver layer

The gold layer

Summary

Part 3: Databricks SQL Commands

11

SQL Commands – Part 1

Technical requirements

Working with data definition language commands

DDL for catalogs

DDL for external locations

DDL for Delta Sharing

Working with data manipulation language commands

MERGE INTO

COPY INTO

Working with the inbuilt functions in Databricks SQL

JSON

Lambda functions

Summary

12

SQL Commands – Part 2

Technical requirements

Working with Delta Lake maintenance commands

Vacuuming your Delta Lake

Time -traveling in your Delta Lake

Repairing your Delta Lake

Optimizing your Delta Lake

Working with data security commands

Dynamic view functions

Controlling access to columns

Controlling access to rows

Working with metadata commands

Listing data assets

Describing data assets

Analyzing Delta tables

Summary

Part 4: TPC-DS, Experiments, and Frequently Asked Questions

13

Playing with the TPC-DS Dataset

Technical requirements

Understanding the TPC-DS dataset

Generating TPC-DS data

Building the spark-sql-perf library

Installing the spark-sql-perf library

Creating a data generation cluster

Importing the spark-sql-perf repository

Running the data generation notebook

Running automated benchmarks

Experimenting with TPC-DS in Databricks SQL

Case study 1 – the effect of file formats

Case study 2 – the effect of specialized data types

Case study 3 – the effect of NULLs

Case study 4 – ZORDER and partitions

Case study 5 – Bloom filter indexes

Summary

14

Ask Me Anything

Frequently asked questions

Summary

Index

Other Books You May Enjoy

Preface

It is a new era in the design of data platform systems. Disparate data lakes and data warehouses are giving way to a new type of data platform system – the lakehouse. It promises to unify all data analytics into a single platform. Databricks, with its Databricks SQL product suite, is the hottest lakehouse platform out there. It harnesses the power of Apache Spark™, Delta Lake™, and other innovations that enable data warehousing capabilities on the lakehouse with data lake economics.

This book is a comprehensive hands-on guide that lets you explore all the advanced features, use cases, and technology components of Databricks SQL. You will start with the fundamentals of the lakehouse architecture and how Databricks SQL fits into it. Next, you will learn how to use the platform – exploring data, executing queries, and building reports and dashboards. Moving on, you will learn about the administrative aspects of the lakehouse – data security, governance, and managing the computation power of the lakehouse. You will delve into the core technology enablers of Databricks SQL – Delta Lake™ and Photon. Finally, you will get hands-on with advanced SQL commands for ingesting data and maintaining the lakehouse.

By the end of this book, you will have mastered Databricks SQL and be able to deploy and deliver fast, scalable business intelligence on the lakehouse.

Who this book is for

This book is for business intelligence practitioners, data warehouse administrators, and data engineers who are new to Databricks SQL and want to learn how to deliver high-quality insights, unhindered by the scale of data or infrastructure. This book is also perfect for anyone who wants to study the advanced technologies that power Databricks SQL.

Basic knowledge of data warehouses, SQL-based analytics, and optionally, the ETL processes is recommended to effectively learn the concepts introduced in this book and appreciate the innovation in the platform.

What this book covers

Chapter 1, Introduction to Databricks, introduces Databricks along three dimensions. First, it will introduce Databricks, the company. Second, it will introduce the Data Lakehouse architecture – the core data Platform design pattern enabled by Databricks. Third, it will introduce the Databricks Lakehouse Platform. Essentially, this is the platform that Databricks provides for your organization to implement the data lakehouse architecture.

Chapter 2, The Databricks Product Suite – A Visual Tour, presents a visual tour of Databricks SQL and the rest of the Databricks platform. It will teach you how to navigate the platform and locate features of interest with ease.

Chapter 3, The Data Catalog, introduces the data catalog of the Databricks Lakehouse platform. It will teach you how the data objects – catalogs, schemas, tables, and views – are represented in the data catalog. Finally, it will teach you how to navigate and explore the data catalog with UI interfaces and SQL commands. Generated and populated by data engineers and consumed by data analysts, the data catalog is the central pillar of all your data operations.

Chapter 4, The Security Model, discusses the Databricks data security model and teaches how to use it to secure the data. Databricks provides a very fine-grained, yet easily programmable data security model to secure all data and data-related assets.

Chapter 5, The Workbench, introduces the Databricks workbench. The workbench is a set of capabilities that enable a simple, intuitive, and intelligent experience in query building and dashboarding. The Databricks SQL workbench provides users on the unified lakehouse platform an instant way to query the data and extract insights from it.

Chapter 6, The SQL Warehouses, introduces the compute power behind Databricks SQL. SQL Warehouses provide the elastic, scalable compute power that can execute Business Intelligence(BI) queries with ease, no matter the scale of the data. The cloud philosophy says storage and compute power should scale independently so that we can drive the maximum Return on Investment(ROI). This is exactly what the SQL Warehouses in Databricks SQL do.

Chapter 7, Using Business Intelligence Tools with Databricks SQL, teaches you how to connect your business intelligence tool of choice to Databricks SQL. This allows you to harness the power of Databricks SQL from the comfort of your favorite business intelligence tool.

Chapter 8, The Delta Lake, deep dives into the default storage format of Databricks – Delta Lake. It adds a layer of transactional intelligence to the otherwise simple data lake. This chapter will discuss the Delta Lake storage format and how it enables superior out-of-the-box query performance.

Chapter 9, The Photon Engine, deep dives into the Photon engine. It is the query engine that powers Databricks SQL. It is written from the ground up in native C++ and uses the Apache Spark API. This chapter deep dives into what makes Photon so fast.

Chapter 10, Warehouse on the Lakehouse, addresses one of the biggest mental leaps that must be taken when adopting the data lakehouse architecture. This chapter discusses how to implement popular warehousing patterns on the lakehouse.

Chapter 11, SQL Commands Part–1, introduces Databricks-specific SQL commands that are used for data definition and data manipulation operations.

Chapter 12, SQL Commands Part–2, introduces Databricks-specific SQL commands that are used for data security and metadata operations.

Chapter 13, Playing with the TPC-DS Dataset, introduces the TPC-DS dataset. It is a popular dataset for benchmarking decision support systems such as data warehouses. The chapter shows how to generate the TPC-DS dataset in Databricks and test the various concepts learned in the past chapters at scale.

Chapter 14, Ask Me Anything, presents and answers the frequently asked questions about Databricks SQL.

To get the most out of this book

This book is about Databricks SQL as an enabler for your business intelligence practice. Hence, the book assumes knowledge of standard SQL and business intelligence concepts. Basic knowledge of data warehouses and data lakes is recommended, but it is not mandatory. Some chapters, such as Chapter 8, The Delta Lake, and Chapter 9, The Photon Engine, assume some familiarity with Apache Spark and some data engineering constructs. That said, they are optional, deep-dive chapters and do not affect your learning of Databricks SQL.

Software/hardware covered in the book

Operating system requirements

Databricks SQL

Windows, macOS, or Linux

Databricks SQL is a Platform-as-a-Service (PaaS) offering and is accessible via any modern internet browser. It does not require any installation on your machine.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Business-Intelligence-with-Databricks-SQL-Analytics. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/vXWLg.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

OPTIMIZE table_name [WHERE predicate]   [ZORDER BY (col_name1 [, ...] ) ]

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “To do so, we can click on the SQL Editor icon on the left-hand sidebar to bring up the SQL Editor page.”

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Business Intelligence with Databricks SQL, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Part 1: Databricks SQL on the Lakehouse

This part focuses on the features and functions of the Databricks SQL product suite, which enables the day-to-day workflows of business intelligence practitioners and warehouse administrators.

This part comprises the following chapters:

Chapter 1, Introduction to DatabricksChapter 2, The Databricks Product Suite – A Visual TourChapter 3, The Data CatalogChapter 4, The Security ModelChapter 5, The WorkbenchChapter 6, The SQL WarehousesChapter 7, Using Business Intelligence Tools with Databricks SQL

1

Introduction to Databricks

Databricks is one of the most recognizable names in the big data industry. They are the providers of the lakehouse platform for data analytics and artificial intelligence (AI). This book is about Databricks SQL, a product within the Databricks Lakehouse platform that powers data analytics and business intelligence.

Databricks SQL is a rapidly evolving product. It is not a traditional data warehouse, yet its users are the traditional data warehouse and business intelligence users. It claims to provide all the functionality of data warehouses on what is essentially a data lake. This concept can be a bit jarring. It can create resistance in adoption as you might be wondering if your skills are transferrable, or if your work might be disrupted as a result of a new learning curve.

Hence, I am writing this book.

The primary intent of this book is to help you learn the fundamental concepts of Databricks SQL in a fun, follow-along interactive manner. My aim is that by the time you complete this book, you will be confident in your adoption of Databricks SQL as the enabler of your business intelligence.

This book does not intend to be a definitive guide or a complete reference, nor does it intend to be a replacement for the official documentation. It is too early for either of those. This book is your initiation into business intelligence on the data lakehouse, the Databricks SQL way.

Let’s begin!

In this chapter, we’ll cover the following topics:

An overview of Databricks, the companyAn overview of the Lakehouse architectureAn overview of the Databricks Lakehouse platform

Technical requirements

There are no technical requirements for this chapter. However, familiarity with the concept of databases, data warehouses, and data lakes will help.

An overview of Databricks, the company

Databricks was founded in 2013 by seven researchers at the University of California, Berkeley.

This was the time when the world was learning how the Meta, Amazon, Netflix, Google, and Apple (MANGA) companies had built their success by scaling up their use of AI techniques in all aspects of their operations. Of course, they could do this because they invested heavily in talent and infrastructure to build their data and AI systems. Databricks was founded with the mission to enable everyone else to do the same – use data and AI in service of their business, irrespective of their size, scale, or technological prowess.

The mission was to democratize AI. What started as a simple platform, leveraging the open source technologies that the co-founders of Databricks had created, has now evolved into the lakehouse platform, which unifies data, analytics, and AI in one place.

As an interesting side note, and my opinion: To this date, I meet people and organizations that equate Databricks with Apache Spark. This is not correct. The platform indeed debuted with a cloud service for running Apache Spark. However, it is important to understand that Apache Spark was the enabling technology for the big data processing platform. It was not the product. The product is a simple platform that enables the democratization of data and AI.

Databricks is a strong proponent of the open source community. A lot of popular open source projects trace their roots to Databricks, including MLflow, Koalas, and Delta Lake. The profile of these innovations demonstrates the commitment to Databricks’s mission statement of democratizing data and AI. MLflow is an open source technology that enables machine learning (ML) operations or MLOps. Delta Lake is the key innovation that brings reliability, governance, and simplification to data engineering and business intelligence operations on the data lake. It is the key to building the lakehouse on top of cloud storage systems such as Amazon Web Service’s Simple Storage Service (S3), Microsoft Azure’s Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS), as well as on-premises HDFS systems.

Within the Databricks platform, these open source technologies are firmed up for enterprise readiness. They are blended with platform innovations for various data personas such as data engineers, data scientists, and data analysts. This means that MLflow within the Databricks Lakehouse platform powers enterprise-grade MLOps. Delta Lake within the Databricks Lakehouse platform powers enterprise-grade data engineering and data governance. With the Databricks SQL product, the Databricks Lakehouse platform can power all the business intelligence needs for the enterprise as well!

Technologies and Trademarks

Throughout this book we will refer to trademarked technologies and products. Some notable examples are Apache Spark™, Hive™, Delta Lake™, Power BI™, Tableau™ and others that are inadvertently mentioned.

All such trademarks are implied whenever we mention them in the book. For the sake of brevity and readability, I will omit the use of the ™ symbol in the rest of the book.

An overview of the Lakehouse architecture

If, at this point, you are a bit confused with so many terms such as databricks, lakehouse, Databricks SQL, and more – worry not. We are just at the beginning of our learning journey. We will unpack all of these throughout this book.

First, what is Databricks?

Databricks is a platform that enables enterprises to quickly build their Data Lakehouse infrastructure and enable all data personas – data engineers, data scientists, and business intelligence personnel – in their organization to extract and deliver insights from the data. The platform provides a curated experience for each data persona, enabling them to execute their daily workflows. The foundational technologies that enable these experiences are open source – Apache Spark, Delta lake, MLflow, and more.

So, what is the Lakehouse architecture and why do we need it?

The Lakehouse architecture was formally presented at the Conference on Innovative Data Systems Research (CIDR) in January 2021. You can download it from https://databricks.com/research/lakehouse-a-new-generation-of-open-platforms-that-unify-data-warehousing-and-advanced-analytics. This is an easily digestible paper that I encourage you to read for the full details. That said, I will now summarize the salient points from this paper.

Attribution, Where it is Due

In my summary of the said research paper, I am recreating the images that were originally provided. Therefore, they are the intellectual property of the authors of the research paper.

According to the paper, most of the present-day data analytics infrastructures look like a two-tier system, as shown in the following diagram:

Figure 1.1 – Two-tier data analytics infrastructures

In this two-tier system, first, data from source systems is brought onto a data lake. Examples of source systems could be your web or mobile application, transactional databases, ERP systems, social media data, and more. The data lake is typically an on-premises HDFS system or cloud object storage. Data lakes allow you to store data in big data-optimized file formats such as Apache Parquet, ORC, and Avro. The use of these open file formats enables flexibility in writing to the data lake (due to schema-on-read semantics). This flexibility enables faster ingestion of data, which, in turn, enables faster access to data for end users. It also enables more advanced analytics use cases in ML and AI.

Of course, this architecture still needs to support the traditional BI workloads and decision support systems. Hence, a second process, typically in the form of Extract, Transform, and Load (ETL), is built to copy data from the data lake to a dedicated data warehouse.

Close inspection of the two-tier architecture reveals several systemic problems:

Duplication of data: This architecture requires the same data to be present in two different systems. This results in an increased cost of storage. Constant reconciliation between these two systems is of utmost importance. This results in increased ETL operations and its associated costs.Security and governance: Data lakes and data warehouses have very different approaches to the security of data. This results in different security mechanisms for the same data that must always be in synchronization to avoid data security violations.Latency in data availability: In the two-tier architecture, the data is only moved to the warehouse by a secondary process, which introduces latency. This means analysts do not get access to fresh data. This also makes it unsuitable for tactical decision support such as operations.Total cost of ownership: Enterprises end up paying double for the same data. There are two storage systems, two ETL processes, two engineering debts, and more.

As you can see, this is unintuitive and unsustainable.

Hence, the paper presents the Lakehouse architecture as the way forward.

Simply put, the data lakehouse architecture is a data management system that implements all the features of data warehouses on data lakes. This makes the data lakehouse a single unified platform for business intelligence and advanced analytics.

This means that the lakehouse platform will implement data management features such as security controls, ACID transaction guarantees, data versioning, and auditing. It will implement query performance features such as indexing, caching, and query optimizations. These features are table stakes for data warehouses. The Lakehouse architecture brings these features to you in the flexible, open format data storage of data lakes. A Lakehouse is a platform that provides data warehousing capabilities and advanced analytics capabilities for the same platform, with cloud data lake economics.

What is the Formal Definition of the Lakehouse?

Section 3 in the CIDR paper officially defines the Lakehouse. Check it out.

The following is a visual depiction of the Lakehouse:

Figure 1.2 – Lakehouse architecture

The idea of the Lakehouse is deceptively simple – as all good things in life are! The Lakehouse architecture immediately solves the problems we highlighted about present-day two-tier architectures:

A single storage layer means no duplication of data and no extra effort to reconcile data. Reduced ETL requirements and ACID guarantees equate to the stability and reliability of the system.A single storage layer means a single model of security and governance for all data assets. This reduces the risk of security breaches.A single storage layer means the availability of the freshest data possible for the consumers of the data.Cheap cloud storage with elastic, on-demand cloud compute reduces the total cost of ownership.Open source technologies in the storage layer reduce the chances of vendor lock-in and make it easy to integrate with other tools.

Of course, any implementation of the Lakehouse will have to ensure the following:

Reliable data management: The Lakehouse proposes to eliminate (or reduce) data warehouses. Hence, the Lakehouse implementation must efficiently implement data management and governance – features that are table stakes in data warehouses.SQL performance: The Lakehouse will have to provide state-of-the-art SQL performance on top of the open-access filesystems and file formats typical in data lakes.

This is where the Databricks Lakehouse platform, and within it, the Databricks SQL product, comes in.

An overview of the Databricks Lakehouse platform

The Databricks Lakehouse platform enables enterprises to build their Lakehouse by providing simplified data engineering and data management techniques. The Databricks Lakehouse platform also provides one of the best ML experiences for data scientists and ML engineers.

Finally, Databricks SQL brings in the last piece of the puzzle – a home for the business intelligence and data analyst personas with a first-class workbench that allows query editing, building visualizations, and publishing dashboards. It also allows plug-and-play with downstream business intelligence tools such as Power BI, Tableau, Looker, and more. All of this is backed by state-of-the-art SQL query performance.

The following diagram represents the Databricks Lakehouse platform:

Figure 1.3 – The Databricks Lakehouse platform

The lakehouse platform by Databricks is a simple, open, and collaborative platform that combines the reliability, performance, and governance capabilities of data warehouses with the openness, flexibility, and economies of cloud data lakes.

Summary

In this chapter, we learned about Databricks as a company and the Databricks Lakehouse platform as the product of this company, which enables the democratization of data and AI for all organizations. We are now ready to begin exploring Databricks SQL.

In the next chapter, Chapter 2, The Databricks Product Suite – A Visual Tour, we will start with a tour of the Databricks Lakehouse platform.

2

The Databricks Product Suite – A Visual Tour

The Databricks Lakehouse platform is an expansive product that caters to personas such as data engineers, data scientists, and business analysts. It also caters to administrator personas such as platform administrators and database administrators. As we discuss core concepts in subsequent chapters, we will have to refer to other persona-specific features. Familiarity with the platform will allow us to navigate to the correct feature or function when referred.

In this chapter, we will take a visual tour of the Databricks Lakehouse platform, with an emphasis on the Databricks SQL product. We will use the official Lakehouse infographic from Figure 1.3 as our guiding light to navigate through the Databricks Lakehouse platform and introduce, at a very high level, the features and functionalities and how they fit together.

We will use screenshots throughout, and you are encouraged to navigate accordingly in your instance of Databricks for the best orientation possible. Please note that the Databricks platform is regularly updated and refined, so the visuals may have evolved by the time you read this book. Once we have this orientation and context, we will be better placed to dive into the details of the Databricks SQL product suite.

In this chapter, we will cover the following topics:

Basic navigation with the sidebarThe SQL persona view The Machine Learning persona viewThe Data Science and Engineering persona view

Technical requirements

To get the most out of this chapter, you will need to have access to the following resources:

A Databricks workspace. Databricks SQL requires a Premium or higher level of subscription. Databricks SQL is currently available on Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). Provisioning a Databricks workspace is beyond the scope of this book. You can get started here: https://databricks.com/try-databricks.Any modern internet browser. Databricks workspaces are accessed via browsers. Command-line interface and REST API-based access is also possible, but these are beyond the scope of this book. You can explore the CLI and REST API at your own pace by following the guide at https://docs.databricks.com/reference/command.html.

Note

The documentation links contain a switcher that allows you to navigate to the documentation links for the cloud of your choice. The core functionality of Databricks SQL is the same across all clouds. I will call out any cloud-specific nuances as required.

Basic navigation with the sidebar

In this section, we will learn how to navigate the Databricks platform using the left-hand sidebar. The left-hand sidebar contains links to all the user-facing features and functions on the platform.

First things first, we must log in to our Databricks platform. Whether you created a Databricks account for yourself or are using a Databricks account that your organization uses, you should have a workspace URL. Visit the workspace URL and log in with your credentials. For Azure Databricks users, the URL should be of the https://adb-<an_alphanumeric_string>.azuredatabricks.net/ form. For Databricks on AWS, the URL should be of the https://dbc-<an_alphanumeric_string>.cloud.databricks.com/ form. Finally, for Databricks on GCP, the URL should be of the https://<an_alphanumeric_string>.gcp.databricks.com/ form.

Upon successfully logging in, you should see a sidebar on the left-hand side. This sidebar is the key to navigating the Databricks platform. The sidebar can be divided into three main sections. Let’s have a look.

The top of the sidebar

When a Databricks workspace is launched, the user lands on their pinned persona experience. There are three persona experiences:

Data Science & EngineeringMachine LearningSQL

Each of these persona experiences enables access to tools and features relevant to that persona’s daily workflow. The top of the sidebar primarily allows you to switch between these persona experiences. The following screenshot shows the persona dropdown:

Figure 2.1 – The persona experience dropdown

Note

The landing page may differ based on how you have gained access to the Databricks workspace. If you have created a new trial account on either AWS or Azure, you will land on the Data Science & Engineering experience. If you are part of an organization that already has Databricks, you may have to ask them for access to the SQL experience. Once enabled, you can pin it so that you always land on the SQL experience upon logging in.

Since we will be primarily working with the SQL experience, it is a good idea to go ahead and pin the SQL persona experience as our default landing page.

The bottom of the sidebar

Let’s discuss the options available at the bottom of the left-hand sidebar, as shown in the following screenshot:

Figure 2.2 – The Settings section

As shown in the preceding screenshot, the following options are available:

Partner Connect: This provides a portal where you can select the partner tool or solution you want to connect to. Partner Connect simplifies and accelerates the integration. We will cover this in Chapter 7, Using Business Intelligence Tools with Databricks SQL.Help: This provides navigation to the documentation, release notes, service status page, knowledge base, and more. Settings: This provides navigation to various management capabilities, depending on your privilege level:User Settings: This is where you can generate access tokens, toggle visual settings, and configure personal Git integrations. We will cover this in future chapters when we discuss programmatic access to Databricks SQL.Admin Console: This is where users and groups can be created. It also provides access to self-service customization of the Databricks workspace. Any discussion on this is beyond the scope of this book, so you are encouraged to go through the options with the help of the documentation from the Help Center (https://help.databricks.com/). Manage Account: This leads to global settings that can be used to manage all the workspaces under your Databricks account. A discussion on the enterprise deployment architecture of Databricks is beyond the scope of this book. However, as a quick note – once you’ve purchased Databricks, you can create an account, and within that account, you can launch any number of workspaces. Workspaces can be aligned to projects, teams, sandboxes, business units, or any logical division that the organization may employ. You can learn more about enterprise deployment patterns at https://docs.databricks.com/getting-started/overview.html.Workspace selector: This is the part in Figure 2.2 that says databricks-sql-book. This allows the user to seamlessly switch between various workspaces that they are part of.

The middle of the sidebar

At this point, you must be wondering why we skipped the middle of the sidebar. This is because the middle of the sidebar is where the persona experience really comes in. The options that we see in this section of the sidebar are dependent on the persona we are currently in. Hence, we will explore it as we deep dive into the different personas. We will start with the SQL persona and experience.

The SQL persona view

The SQL persona is the focus of this book, so let’s start with that first. The following screenshot shows the home page of the SQL persona. The now-familiar left-hand sidebar should be visible, with SQL persona-specific options in the middle section. The landing page itself should show quick links to daily workflow features such as Query Editor, BI Connectors, and Data Explorer. Once the Databricks SQL product has seen some usage, links to recent work such as queries, dashboards, and others will also be populated on the main page for fast access:

Figure 2.3 – The SQL persona home page

Note for New Workspaces

The interface shown in the preceding screenshot has an additional sidebar called Get started. You will see this if this is the first time Databricks SQL has been accessed in the workspace. Depending on the user’s privileges, certain options will be provided to help them get started with Databricks SQL from an admin or end user perspective.

You are not discouraged from exploring the guided Get started journeys. However, we promise that those journeys will be explored in detail in subsequent chapters.

Recall the Lakehouse infographic that we introduced in Chapter 1, Introduction to Databricks (Figure 1.3). To begin the visual tour, let’s start from the Open Data Lake layer with Delta Lake on top of it for Data Management & Governance. This is highlighted in the following diagram:

Figure 2.4 – The data layer

The data in the Lakehouse is sourced from various sources, including structured, relational data sources, semi-structured data sources, unstructured data sources, and streaming data sources. This is the Open Data Lake layer. This is powered by the Delta Lake technology (https://delta.io/), which abstracts the nuances and complexity of handling data on cloud object stores such as AWS S3, Azure ADLS, and GCP GCS.

The Data Explorer, which we will cover in the following subsections, is a visual tool that we can use to discover the data in the lakehouse platform.

Data Explorer

To navigate to Data Explorer, click the Data icon, , on the left-hand sidebar to bring up the Data Explorer page. The following screenshot shows a section of the Data Explorer page:

Figure 2.5 – The Data Explorer page

The Data Explorer page has a three-level hierarchy – the catalog, the database, and the tables and views within the database. Here, database is synonymous with schema. Traditional databases or warehouses have a hierarchy of database.schema.table. However, in Databricks, the top level is a container of schemas, hence we have catalog.database/schema.table. Data Explorer lets us visually explore these objects. For example, in the preceding screenshot, we have navigated to the nyctaxi database in samples, where we are viewing the trips table.

Note on Incidental Charges

Upon opening the Data Explorer page, the catalog will be set to Hive Metastore by default and the database will be set to default by default.

Databricks provides sample datasets for getting started. They are available in the samples catalog. The preceding screenshot shows how to explore the sample data.

Bear in mind that at the time of writing, Data Explorer requires a SQL Warehouse to be up and running. This will incur cloud costs, including Databricks costs and cloud compute costs (AWS EC2 or Azure VM). This will change soon with the introduction of a new cataloging technology called Unity Catalog. We will discuss Unity Catalog in Chapter 3, The Data Catalog.

Data Explorer is the portal to visually exploring the data in a Lakehouse. As we can also see, it provides a visual view of data management and governance as well – with Access Control Lists (ACLs), comments, and more. We will explore these in detail in Chapter 3, The Data Catalog.

SQL Warehouses

Data Explorer provides features and functions for visually exploring the data available to analysts. Since Databricks leverages the economies and elasticity of the cloud, the storage and compute layers are decoupled. This means that the data in the Lakehouse (visible in Data Explorer) is sitting on the cloud object storage, so computation resources must be acquired to execute queries against the data when needed. In Databricks SQL, this compute power is acquired with SQL Warehouses. They run a query processing engine called the Photon engine that works with the Delta Lake layer to provide a seamless query experience in the Databricks SQL product.

Note

The Photon Engine and the Delta Lake are transparent to the end user/analyst. They can query the data as if it were just another relational database.

If we refer to our guiding reference architecture, we are at the bounding box, as shown here:

Figure 2.6 – The compute layer

To configure a SQL Warehouse, click on the SQL Warehouses icon on the left-hand sidebar, , to bring up the SQL Warehouses page:

Figure 2.7 – The SQL Warehouses page

Here, we can see a warehouse named sql-warehouse.  We can see that it is in a Stopped state. We can have multiple SQL Warehouses in our instance of Databricks. We will learn how to create and configure them in Chapter 6, The SQL Warehouses.

Query Editor

So far, we have learned how to explore the data visually and acquire the computation power to run queries against the data. The logical next step is to write some queries.

Going back to our reference architecture, we are at the consumption layer, where analysts and end users have access to features and tools that enable their daily workflows. The consumption layer is highlighted in red in the following diagram:

Note

The rest of the visual tour of the Databricks SQL product suite will be concerned with the consumption layer, and hence the bounding box highlighted in the diagram.

Figure 2.8 – The consumption layer

To do so, we can click on the SQL Editor icon on the left-hand sidebar, , to bring up the SQL Editor page:

Figure 2.9 – The SQL Editor page

The SQL Editor page provides an intelligent workbench to quickly author queries with typeahead and other features. It also provides built-in visualization capabilities for query results. Finally, it has an integrated view of the Schema browser, which provides a view of the catalog namespace for faster exploratory data analysis.

In the preceding screenshot, we are sampling some data from the trips table in the nyctaxi database. This Query Editor is running queries on the SQL Warehouse called sql-warehouse, which we visited in the previous section (not visible in the preceding screenshot).

The SQL Editor provides many more daily workflow features and functions, all of which we will explore in Chapter 5, The Workbench.

The Queries page

In the previous section, you may have noticed we had created a “named” query called Sample trips Taxi, as is visible in the Query tab. In Databricks SQL, queries can be named and saved for use with other features, such as dashboards and alerts.

The named, saved queries can be explored on the Queries page as follows. Click the Queries icon, , in the left-hand sidebar to bring up the Queries page. The following screenshot shows a portion of the Queries page:

Figure 2.10 – The Queries view

The Queries view helps you discover existing queries authored by the user or by others in the team, hence promoting reuse and collaboration. The Queries view also provides metadata about the queries for better search capabilities. More on this in Chapter 5, The Workbench.

The Query History page

The Queries view allows the user to discover or revisit named/saved queries. However, it is an unreasonable expectation that all the queries that are written or executed will be named or saved. Hence, the Query History page fills the gap in discovering past query executions. To visit the Query History page, click on the Query History icon, , as shown in the following screenshot:

Figure 2.11 – The Query History page

Here, we can see that the Query History page maintains a history of all the queries that have been executed in the SQL persona view. There are smart filters that can be applied to sift through the history. We will cover practical applications of this feature in Chapter 5, The Workbench.

The Dashboards page

The queries that we author in the Databricks SQL product can be composed into dashboards. These dashboards can then be shared with other teams and stakeholders for their consumption. Click on the Dashboards icon, , on the left-hand sidebar to bring up the Dashboards page, as shown in the following screenshot:

Figure 2.12 – The Dashboards page

The preceding screenshot shows a section of the Dashboards landing page where we have two dashboards. The first one, NYC Taxi Trip Analysis, is a sample dashboard that you can import from the Dashboards Gallery.

Users can create a new dashboard using the Create Dashboard button in the top right-hand corner. The following screenshot shows a section of the Creation view of the NYC Taxi Trip Analysis sample dashboard:

Figure 2.13 – Sample dashboard (Databricks Dashboard Gallery)

In the Dashboard Composer, we can add visualizations based on existing queries, parameterize them for interactivity, and compose them into useful dashboards.

More on building dashboards will be covered in Chapter 5, The Workbench.

The Alerts page

Queries are