Data Exploration and Preparation with BigQuery - Mike Kahn - E-Book

Data Exploration and Preparation with BigQuery E-Book

Mike Kahn

0,0
26,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data professionals encounter a multitude of challenges such as handling large volumes of data, dealing with data silos, and the lack of appropriate tools. Datasets often arrive in different conditions and formats, demanding considerable time from analysts, engineers, and scientists to process and uncover insights. The complexity of the data life cycle often hinders teams and organizations from extracting the desired value from their data assets. Data Exploration and Preparation with BigQuery offers a holistic solution to these challenges.
The book begins with the basics of BigQuery while covering the fundamentals of data exploration and preparation. It then progresses to demonstrate how to use BigQuery for these tasks and explores the array of big data tools at your disposal within the Google Cloud ecosystem.
The book doesn’t merely offer theoretical insights; it’s a hands-on companion that walks you through properly structuring your tables for query efficiency and ensures adherence to data preparation best practices. You’ll also learn when to use Dataflow, BigQuery, and Dataprep for ETL and ELT workflows. The book will skillfully guide you through various case studies, demonstrating how BigQuery can be used to solve real-world data problems.
By the end of this book, you’ll have mastered the use of SQL to explore and prepare datasets in BigQuery, unlocking deeper insights from data.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Exploration and Preparation with BigQuery

A practical guide to cleaning, transforming, and analyzing data for business insights

Mike Kahn

BIRMINGHAM—MUMBAI

Data Exploration and Preparation with BigQuery

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kaustubh Manglurkar

Publishing Product Manager: Arindam Majumder

Content Development Editor: Shreya Moharir

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Project Coordinator: Shambhavi Mishra

Proofreader: Safis Editing

Indexer: Tejal Soni

Production Designer: Aparna Bhagat

Marketing Coordinator: Nivedita Singh

First published: November 2023

Production reference: 1251123

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-80512-526-6

www.packtpub.com

To my wife, Jordana, for her endless support, love, and dedication to our family. Your encouragement, patience, and love has sustained me through the challenges and triumphs of writing this book. To my sons, Parker and Harrison, your curiosity and energy motivates me daily. To my parents, for always supporting me in my passions.

This book is dedicated to my family, whose love and encouragement make every achievement more meaningful and every challenge more surmountable.

– Mike Kahn

Contributors

About the author

Mike Kahn is a data and infrastructure enthusiast and currently leads a Customer Engineering team at Google Cloud. Prior to Google, Mike spent five years in solution architecture roles and worked in operations and leadership roles in the data center industry. His over 15 years of experience have given him a deep knowledge of data and infrastructure engineering, operations, strategy, and leadership. Mike holds multiple Google Cloud certifications and is a lifelong learner.

He is based in Boca Raton, Florida, in the US and holds a Bachelor of Science degree in Management Information Systems (MIS) from University of Central Florida and a Master of Science degree in MIS from Florida International University.

I want to thank the people who have been close to me and supported me, especially my wife, Jordana, and my parents, Abby and Steven Kahn, and my in-laws, Donna and Udy Nahum.

To all my colleagues and the customers that I have worked with, thank you for your inspiration and partnership.

About the reviewers

Suddhasatwa Bhaumik is an experienced data professional with 16+ years of experience in the industry. His primary expertise lies in data engineering and machine learning, with background experience in software development, cloud computing, and cloud architecture. He is currently working as an AI engineer at Google in Google Cloud Consulting, helping customers build their data and ML systems on Google Cloud Platform (GCP). He has previously worked at industry giants including Vodafone, Infosys, and Tech Mahindra in various IT roles.

Guy Shilo is a data expert with more than 20 years of experience working with data, from a DBA of relational databases to big data analytical platforms and data engineering such as Hadoop, Splunk, Kafka NiFi, and others, to cloud data platforms (especially Google Cloud). Guy is a technology and data enthusiast and loves investigating, learning, and mastering new tools and technologies. Today, Guy works as a data and analytics customer engineer at Google Cloud, specializing in all Google Cloud data products. In this role, he helps identify customers' needs and translates objectives into action through product and service strategy and implementation, optimization, and architecture efforts. In the past, Guy used to work with Israeli military intelligence and various telecom companies as a data consultant.

Julio Quinteros is a co-founder of Axmos Technologies, a Google Cloud Premier Partner located in Latam. Julio is a Google Cloud Authorized Trainer and works in learning, research, and training with a focus on machine learning and data. Julio is also a co-founder and active organizer of the Google Developers Group in Santiago, Chile.

Nick Orlove is a product manager for BigQuery. He works with customers, engineering, and other teams to launch new products and features.

Kiran Narang is a big data analytics consultant at Google. She provides solutions and guidance to enterprises on how Google Cloud’s data analytics products can solve latest business challenges.

Table of Contents

Preface

Part 1: Introduction to BigQuery

1

Introducing BigQuery and Its Components

Technical requirements

What is BigQuery?

Understanding how BigQuery works

Dremel, the execution engine

Colossus distributed storage

The Borg compute platform

Jupiter network infrastructure

BigQuery administration and access

Tools for administration

Understanding identity and access management

BigQuery best practices and cost management

Best practices

Understanding and controlling costs

Extending your data

BigQuery ML

External datasets

External connections

Summary

References

2

BigQuery Organization and Design

Technical requirements

Understanding BigQuery’s resource hierarchy

Organizations, folders, and projects

BigQuery-specific resources

BigQuery storage

Exploring architecture patterns

The centralized enterprise data warehouse

The decentralized data warehouse

The cross-org data exchange

Schema design

Table design

Summary

Part 2: Data Exploration with BigQuery

3

Exploring Data in BigQuery

Technical requirements

What is data exploration?

Fundamentals

Data life cycle

Common challenges and solutions

Introduction to exploring data in BigQuery

Exploring data in the BigQuery SQL workspace

Exploring schema and table structure

Exploring data using SQL

Exploring data using the bq command-line interface

Exploring data with visualization tools

Enhancing data exploration in BigQuery

Advanced approaches

Best practices

Summary

4

Loading and Transforming Data

Technical requirements

Exploring data loading techniques

Batch loading data

Streaming ingestion of data

Scheduled loading of data

Situations where you do not need to load data

Data transformation with BigQuery

Evaluating ETL and ELT approaches for data integration

Hands-on exercise – data loading and transformation in BigQuery

Understanding the scenario

Loading data from a local file

Preparing and transforming data

Summary

5

Querying BigQuery Data

Technical requirements

Understanding query structure

Action command – the SELECT clause

Location command – the FROM clause

Filtering command – the WHERE clause

Selection handling commands – the GROUP BY, HAVING, ORDER BY, and LIMIT clauses

Understanding data types

Using expressions and aggregations

Expressions

Aggregations

Joining tables

Inner joins

Outer joins

Using functions

Advanced querying techniques

Subqueries

Window functions

Common table expressions

Array functions

Saving, sharing, and scheduling queries

Optimizing queries

Troubleshooting queries

Summary

Further reading

6

Exploring Data with Notebooks

Technical requirements

Understanding the value of using notebooks

Jupyter notebooks

Using Workbench notebook instances in Vertex AI

Creating a managed notebook

Executions and schedules

Hands-on exercise – analyzing Google Trends data with Workbench

Using Colab notebooks

Comparing Workbench instances and Colab

Summary

Further reading

7

Further Exploring and Visualizing Data

Technical requirements

Understanding data distributions

Why is it important to understand data distributions?

Uncovering relationships in data

Exploring BigQuery data with Google Sheets

Connecting to Sheets from BigQuery using Explore with Sheets

Connecting to BigQuery using Connected Sheets

Column statistics

Collaboration with BigQuery data in Sheets

Visualizing BigQuery data using Looker Studio

Creating the right visualizations

Hands-on exercise – creating visualizations with Looker Studio

Commonly created charts

Calculations in visualization tools

Data quality discovery while creating visualizations

Filtering data in visualizations

Integrating other visualization tools with BigQuery

Summary

Further reading

Part 3: Data Preparation with BigQuery

8

An Overview of Data Preparation Tools

Technical requirements

Getting started with data preparation

Clearly defining your data preparation goals

Evaluating your current data quality

Data cleansing and transformation

Validating prepared data

Data preparation approaches

Data preparation tools

Visual data preparation tools

Query and code-based tools

Automated data preparation

Summary

Further reading

9

Cleansing and Transforming Data

Technical requirements

Using ELT for cleansing and transforming data

Assessing dataset integrity

The shape of the dataset

Skew of the dataset

Data profiling

Data validation

Data visualization

Using SQL for data cleansing and transformation

SQL data cleansing strategies and examples

SQL data transformation strategies and examples

Writing query results

Using Cloud Dataprep for visual cleansing and transformation

Summary

Further reading

10

Best Practices for Data Preparation, Optimization, and Cost Control

Technical requirements

Data preparation best practices

Understanding your data and business requirements

Denormalizing your data

Optimizing schema design

Considering nested and repeated fields

Using correct data types

Data cleansing and validation

Partitioning and clustering

Optimizing data loading

Best practices for optimizing storage

Long-term and compressed storage

Cross-cloud data analytics with federated access model and BigQuery Omni

Backup and recovery

Best practices for optimizing compute

Analysis cost options

Query optimization

Query optimization cheat sheet

Monitoring and controlling costs

Query plan and query performance insights

Monitoring, estimating, and optimizing costs

Controlling costs

Summary

Further reading

Part 4: Hands-On and Conclusion

11

Hands-On Exercise – Analyzing Advertising Data

Technical requirements

Exercise and use case overview

Loading CSV data files from local upload

Data preparation

Standardizing date formats

Data exploration, analysis, and visualization

Analyzing ads and sales data

Return on ad spend

Visualizations

Summary

References

12

Hands-On Exercise – Analyzing Transportation Data

Technical requirements

Exercise and use case overview

Loading data from GCS to BigQuery

Uploading data files to Google Cloud Storage

Loading data into BigQuery

Data preparation

Data exploration and analysis

Visualizing data with BigQuery geography functions

Summary

Further reading

13

Hands-On Exercise – Analyzing Customer Support Data

Technical requirements

Exercise and use case overview

Data loading from CSV upload

Data preparation

Data exploration and analysis

Count of ticket_type across both datasets

The most common support issues using ticket_subject data

Average resolution time per ticket_type

Customer demographics using customer_age and customer_gender

Analyzing emotions with sentiment analysis

Creating a connection

Granting access to the external connection service account

Creating a model

Querying the model

Summary

References and further reading

14

Summary and Future Directions

Summary of key points

Chapter 1, Introducing BigQuery and Its Components

Chapter 2, BigQuery Organization and Design

Chapter 3, Exploring Data in BigQuery

Chapter 4, Loading and Transforming Data

Chapter 5, Querying BigQuery Data

Chapter 6, Exploring Data with Notebooks

Chapter 7, Further Exploring and Visualizing Data

Chapter 8, An Overview of Data Preparation Tools

Chapter 9, Cleansing and Transforming Data

Chapter 10, Best Practices for Data Preparation, Optimization, andCost Control

Chapter 11, Hands-On Exercise – Analyzing Advertising Data

Chapter 12, Hands-On Exercise – Analyzing Transportation Data

Chapter 13, Hands-On Exercise – Analyzing Customer Support Data

Future directions

More integration with AI and ML

Generative AI

Natural language queries

DataOps

Hybrid and multi-cloud data analysis

Zero-ETL and real-time analytics

Data governance and privacy

Federated learning

Data clean rooms

Data monetization

Additional resources

Final words

Index

Other Books You May Enjoy

Preface

In the ever-expanding universe of data, the ability to explore and prepare information for meaningful insights is paramount. Welcome to Data Exploration and Preparation with BigQuery. This book is a guided journey into the world of leveraging Google BigQuery, a powerful and scalable data warehouse, for unraveling the potential within your data. Whether you’re an analyst embarking on a quest for patterns or a seasoned data engineer forging pathways to actionable intelligence, this book is developed to be your companion.

In these pages, we will cover the art and science of data exploration and preparation, the foundational steps that begin the practice of data analytics. From structuring complex datasets to optimizing queries and using best practices, each chapter will direct you through the landscape of data analysis and transformation.

The book extends beyond technical nuances and dives into the philosophy of data, recognizing that behind every query, there is a story waiting to be unveiled. This book provides details beyond official documentation, providing strategies and approaches from years of customer interactions to accelerate your journey with data in BigQuery.

Google BigQuery is a leader in Gartner Magic Quadrant for Cloud Database Management Systems (2022), which recognizes its exceptional capabilities and industry impact. BigQuery has earned its place by demonstrating excellence as a data warehouse with its innovative architecture designed for scalable and high-performance analytics. BigQuery empowers organizations to unlock the full potential of their data through swift and sophisticated querying and seamless integrations with other Google Cloud services.

There is an increasing demand for data analysts and data engineers, and these roles will leverage the strategies and approaches outlined in this book.

Who this book is for

This book is for data analysts who want to learn how to explore and prepare data using BigQuery. If you are a data analyst who is experienced in SQL, reporting, data modeling, and transformations, this book is for you.

Business users who want to understand how to use BigQuery to make better data-driven decisions will also benefit from this book. Program and project managers and other data professionals will also benefit from the book’s easy-to-follow approach. This book is excellent for any individuals who are planning to use BigQuery as a data warehouse to provide insights to their business from large data sets.

What this book covers

Chapter 1, Introducing BigQuery and Its Components, teaches how BigQuery operates to use it more effectively. We will take an “under the hood” look at the technologies that deliver BigQuery, and understand data exploration and preparation goals.

Chapter 2, BigQuery Organization and Design, teaches how to build a secure and collaborative BigQuery environment. You will gain a strong understanding of all services that deliver the BigQuery service beyond the SQL query. You will also understand design patterns for deploying BigQuery resources.

Chapter 3, Exploring Data in BigQuery, reviews various ways to explore data in BigQuery and reviews the process and steps of data exploration. You will learn about the different methods to access data in BigQuery and best practices to get started.

Chapter 4, Loading and Transforming Data, explores the techniques and best practices for loading data into BigQuery, and reviews the tools and methodologies for transforming and processing data with BigQuery. This chapter includes Hands-on exercise – data loading and transformation in BigQuery.

Chapter 5, Querying BigQuery Data, familiarizes you with the structure of a query and gives you a strong foundation in crafting queries. More complex querying practices will be reviewed as well. This chapter will give you the skills to begin writing queries.

Chapter 6, Exploring Data with Notebooks, helps you understand the value of using notebooks for data exploration and better understand the notebook options in Google Cloud. This chapter includes Hands-on exercise – analyzing Google Trends data in Workbench.

Chapter 7, Further Exploring and Visualizing Data, helps you better understand data attributes, discover patterns, and communicate findings effectively. You will learn about common practices for exploring data and review techniques and tools to analyze and visualize your data. This chapter includes Hands-on exercise – creating visualizations with Looker Studio.

Chapter 8, An Overview of Data Preparation Tools, explores approaches and tools that can be used with BigQuery for data preparation tasks to improve data quality.

Chapter 9, Cleansing and Transforming Data, reviews cleaning and transforming data in greater detail for optimizing table data after loading and initial exploration. You will learn about the skills to handle situations that you will encounter as you refine query results and reporting accuracy.

Chapter 10, Best Practices for Data Preparation, Optimization, and Cost Control, introduces the cost control and optimization features of BigQuery. You will learn how to use BigQuery in a cost-effective way.

Chapter 11, Hands-On Exercise – Analyzing Advertising Data, presents a use case including sales, marketing, and advertising data. Follow along with the exercise to learn how to analyze and prepare advertising data and utilize the steps as a repeatable process with your real data.

Chapter 12, Hands-On Exercise – Analyzing Transportation Data, presents a use case with vehicle data. Follow along with the exercise to learn how to analyze and prepare transportation data; the steps presented can be replicated with real data.

Chapter 13, Hands-On Exercise – Analyzing Customer Support Data, presents a use case with customer support data. Two different customer support data sources will be used, as well as BigQuery ML sentiment analysis, to better understand customer service data.

Chapter 14, Summary and Future Directions, recaps the key points discussed throughout the book. We will look into the future and learn about emerging trends and transformative directions that will shape the landscape of data exploration, preparation, and analytics with BigQuery.

To get the most out of this book

To get the most out of this book, you will want to have an active account with Google Cloud. The Technical requirements sections of each chapter provide specific directions and setup instructions.

Software/hardware covered in the book

Operating system requirements

Google Cloud

Windows, macOS, or ChromeOS

GoogleSQL, SQL

Dataprep by Trifacta

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

The book can be read from front to back or as individual chapters on desired topics. For example, if you want to get right to loading data, start with Chapter 4, Loading and Transforming Data.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Exploration-and-Preparation-with-BigQuery. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “The following SQL query will modify the time column name to date in our jewelry_ads_data table.”

A block of code is set as follows:

CREATE TABLE `ch11.jewelry_sales_data2` (date DATE, order_id INT, product_id INT, quantity INT, category_id INT, category_name STRING, brand_id INT, price FLOAT64, gender STRING, metal STRING, stone STRING)

Any command-line input or output is written as follows:

cd Data-Exploration-and-Preparation-with-BigQuery/ch4/

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Click BROWSE to select a file from your GCS buckets. Find the file in your newly created bucket and click SELECT.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Data Exploration and Preparation with BigQuery, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781805125266

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Introduction to BigQuery

In the first part of this book, you will learn the mechanics behind BigQuery and how it operates. These chapters will cover BigQuery resource organization and design best practices. You will also understand the goals of data exploration and preparation after completing these chapters.

This part has the following chapters:

Chapter 1, Introducing BigQuery and Its ComponentsChapter 2, BigQuery Organization and Design

1

Introducing BigQuery and Its Components

BigQuery is a fully managed serverless data warehouse that helps users and enterprises manage and analyze data with SQL. It includes features for data sharing, machine learning, geospatial analysis, and business intelligence.

In this chapter, you will be introduced to BigQuery and how the service operates. By gaining an understanding of how this service is delivered and how it functions, you will be able to use BigQuery more effectively.

This chapter will introduce and describe the first two, most critical steps of data analysis: data exploration and data preparation. By the end of this chapter, you will have a solid foundational understanding of BigQuery and how to begin preparing data to make it more useful and insightful.

In this chapter, we’re going to cover the following main topics:

What is BigQuery?How does BigQuery work?BigQuery administration and accessBigQuery best practices and cost managementExtending your data

Technical requirements

To get the most out of this chapter and the book, you will want to set up access to the Google Cloud console (https://console.cloud.google.com/). At the time of writing this publication, there is a free trial that will allow you to explore and build resources. Many services are also eligible for the Free tier (https://cloud.google.com/free), including BigQuery, which includes 1 TB of querying per month and 10 GB of storage each month. All you will need to get started is a Google account (most people know this as their Gmail account). If you wish to use the Cloud console with your work email, you may be prompted to contact your organization administrator.

If you wish to explore BigQuery at no cost to determine whether BigQuery fits your needs, you can also utilize the BigQuery sandbox (https://cloud.google.com/bigquery/docs/sandbox). The sandbox lets you experience BigQuery and the Google Cloud console without enabling billing for your project. To access the BigQuery sandbox, follow these steps:

Visit https://console.cloud.google.com/ and visit the BigQuery page. You can also open BigQuery directly by accessing https://console.cloud.google.com/bigquery.Authenticate with your Google account or create one. Here is where you can utilize your existing personal Gmail account or attempt to use your organization or work credentials.Follow the prompts to create a new Google Cloud project. Bypass setting up billing.After you create a Cloud project, the Google Cloud console displays the sandbox banner.

Let’s get started!

What is BigQuery?

BigQuery is Google Cloud’s enterprise data warehouse, designed to help users ingest, store, visualize, and analyze data. BigQuery enables users to unlock insights across petabyte-sized datasets and enables business agility. In this chapter, we will explore BigQuery to give you a complete understanding of how the service operates so you can best utilize BigQuery to prepare and explore your data.

BigQuery’s managed serverless infrastructure unlocks users from managing resources and allows for greater focus on data. BigQuery is a serverless enterprise data warehouse with analytic tools and machine learning services, with multi-cloud capabilities. BigQuery stores data in a columnar format that is optimized for analytics queries on structured and semi-structured data.

There are several ways to get data into BigQuery. You can ingest data into BigQuery by uploading it in one-time or scheduled batch loads or by streaming data to enable real-time insights. You can use queries to generate new data and append or overwrite the results to a table. Alternatively, you can use a third-party application or service to load data from external sources.

BigQuery supports a standard Structured Query Language (SQL) dialect that is American National Standard Institute (ANSI) compliant, which makes it easy for SQL practitioners and data analysts to use this service with a low learning curve and quick time to value. According to the documentation, BigQuery can be accessed via the Google Cloud console and through the BigQuery command-line tool. Developers and data scientists can use notebooks and client libraries as well as BigQuery’s REST API and Remote Procedure Call (RPC) API to transform and manage data. Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) driver access allows interaction with existing applications and third-party tools.

BigQuery addresses the needs of data professionals including data analysts, data engineers, data scientists, and data developers. It allows these individuals to store and analyze large amounts of data to make the most out of datasets and enable data-driven decision-making and rich insights across data sources. Enterprises can easily turn complex data landscapes into compelling stories through analysis, reporting, and machine learning.

Understanding how BigQuery works

BigQuery is known for being a serverless, highly scalable, cost-effective cloud data warehouse. The power of this cloud-native service lies in the decoupled storage and compute resources. Unlike other data warehouse software or services, BigQuery service architecture has independent storage and compute infrastructure layers. This allows each layer to scale independently on demand. This decoupled architecture offers high flexibility and cost control for data analytics and data science workloads.

Underneath the user interface, BigQuery is powered by several Google technologies that have been in use since well before the 2011 general availability launch of this service. In this section, we will go over the primary technologies behind BigQuery – Dremel, Colossus, Borg, and Jupiter, so you can better understand how BigQuery is different from other enterprise data warehouse services.

Dremel, the execution engine

Dremel is the service that turns SQL queries into execution trees. BigQuery slots, which are compute reservations used by BigQuery to execute SQL queries, are the leaves of the tree and they read data from storage. The branches of the execution tree are called “mixers” and they perform aggregation.

The following diagram shows user data stored in Colossus distributed storage. The leaves are BigQuery on-demand or reserved slots for compute. The mixers perform different levels of aggregation, and in between the mixers is “shuffle,” which moves data from one place to another on the Jupiter network. The mixers and slots are run by Borg.

Figure 1.1 – BigQuery execution tree example

Dremel is a query system for the analysis of nested data that combines multi-level execution trees and BigQuery’s columnar data layout. Dremel has tremendous power and can run aggregation queries over a trillion row tables in seconds. BigQuery offers query visualization through a query execution graph, which allows you to view performance insights for queries. The value delivered by Dremel in BigQuery is that it enables users the ability to run extremely complex queries over multiple large tables, better analyze query effectiveness, and optimize cost management. We will go over queries in detail in Chapter 5, Querying BigQuery Data. To learn more about Dremel, read the 2010 paper published by Googlers, linked at the end of this chapter [5].

Colossus distributed storage

BigQuery uses the columnar storage format, also known as Capacitor, to store data in Colossus, which is optimized for reading large amounts of structured data. Colossus also handles replication, recovery, and distribution management, so there is no single point of failure. The distributed storage system that BigQuery uses allows users to scale to hundreds of petabytes of data stored without the challenge of adding attached compute resources in most data warehouses. Herein lies one of BigQuery’s differentiating features, decoupled storage, and the reduction of the burden of managing the infrastructure required for your cloud data warehouse.

The Borg compute platform

Borg is the service that paved the way for Kubernetes at Google. To give you hundreds of thousands of CPU cores or slots to process your task, BigQuery leverages Borg, Google’s cluster management system. Borg is the platform that runs the slots and mixers and allocates compute resources for BigQuery jobs. Borg clusters run up to tens of thousands of machines.

Jupiter network infrastructure

Google has been a leader in developing distributed computing and data processing infrastructure since the early 2000s. From the early days of the company, it was known that great computing infrastructure requires great data center networking technology [1].

In order to support the billion user properties, such as YouTube and Gmail, that have been successful over the past two decades, Google developed Jupiter fabrics, networks that can deliver more than 1 petabit/second of total bisection bandwidth. To put this into perspective, such capacity would be enough for 100,000 servers to exchange information at 10 Gb/s each [1].

BigQuery uses Jupiter to move data extremely quickly between storage systems. Jupiter makes it appear that compute and storage systems are on the same physical machine, yet they are physically separated across machines in zones within a region.

Now that you have learned more about BigQuery and how the service is delivered, let’s go over some foundational elements such as administration and access. The next section will help you understand how to access and use data in BigQuery. Get ready to gain an understanding of the different access features as well as the organization and team setup features available beyond a user account.

BigQuery administration and access

In this section, we will discuss how to administer and secure resources in BigQuery. We will describe access methods for users and services to access data. By the end of the section, you will have a basic understanding of the different approaches to securely allocate resources in BigQuery and how to enable access to users.

Tools for administration

BigQuery has several access methods for administration tasks. Most users and roles can do the same function, with multiple tools enabling point-and-click as well as automated task flows.

The Google Cloud console is the main user interface for the centralized management of data and compute resources. Users access the Google Cloud console to run SQL queries and create and manage BigQuery resources. In the BigQuery navigation menu, you can access options for analysis, migration, and administration.

When you access the SQL workspace in the BigQuery section of the Cloud console, you will be presented with an explorer tree view that will show all the projects, connections, datasets, tables, and shared queries available to your user account. Within the SQL workspace, you can open tables in different tabs and views. You can also collapse menus and star or favorite your datasets and tables for quicker access.

The bq command-line tool lets you perform administrative tasks using bq console commands. The bq command-line tool is a Python-based tool for BigQuery. You can create resources, manage resources, and run queries from the bq command-line tool. You can also run the bq command-line tool in a script or from a service account.

Understanding identity and access management

In Google Cloud, Cloud Identity is built-in identity as a service (IDaaS) that works across the Cloud console and with Google Workspace. If your organization already uses Workspace, you may already have access to the Cloud console.

Fun fact

The identity management service used at Google is called Gaia internally, which is an acronym for Google Accounts and ID Administration. Gaia is the same identity management system used across many commonly used Google products.

To manage individual accounts, you can set up users and groups in Cloud Identity to assign roles and permissions in the Cloud console. A user in Google Cloud is an individual user account. This may be an individual email address associated with a Google domain (for example, [email protected]) or an email address for another domain that has already been configured by a Cloud Identity or Google Workspace domain administrator (https://developers.google.cn/issue-tracker/concepts/access-control).

A group is a Google group that contains individual users, all of which need to share the same level of permissions. Many organizations will create groups for finance teams, data engineering teams, infrastructure teams, and network teams, to name a few. This allows organizations the ability to assign permissions to a group and those privileges cascade across multiple users that perform similar job functions.

Organizations with existing identity providers such as Active Directory can federate with Cloud Identity. This makes it easier to integrate and maintain governance with your primary identity provider. You can continue to use your existing identity provider (IdP) to handle tasks such as provisioning and managing users and groups, managing single sign-on (SSO), and configuring multi-factor authentication (MFA). Federating an existing IDP with Cloud Identity can simplify access and user management with Google Cloud.

Identity and access management (IAM) helps you secure data and compute resources with an access model that is used across Google Cloud. IAM lets you grant granular access to resources, prevents access to other resources, and helps you adopt the principle of least privilege [2]. When designing access management, the security principle of least privilege ensures that no one should have more permissions than they need to do their job. Learn more about IAM roles and permissions in the docs: https://cloud.google.com/iam/docs/.

A resource is an entity or container in reference to Google Cloud permissions. You will use resource hierarchy for access control, meaning you can set IAM policies at different levels of the Google Cloud console services resource hierarchy. Resources inherit the policies of the parent resource, so you will want to give users roles and permissions at the most granular levels (e.g., provide dataset-level permissions instead of project-level permissions). IAM permissions determine the operations that are allowed on a resource. Permissions cannot be assigned directly to users; users are assigned permissions through policies or roles. Roles are collections of permissions and can be assigned to Google identities and use a policy to enforce binding to Google Cloud resource levels:

OrganizationFolderProjectService (e.g, datasets, tables, and views)

BigQuery allows you to go even more granular, assigning roles to resources within datasets, such as tables and views, without providing full access to the dataset’s resources. Authorized views are another option for granting specific users access to query results without giving them access to underlying table data. You can create a dataset just for authorized views and assign access to that dataset or the view. We will go over resource hierarchy, roles, design, and real-life scenarios and examples in Chapter 2, BigQuery Organization and Design.

Now that you have learned about administration, access, and resources, let’s discuss best practices and cost management. This will begin to introduce best practices for using BigQuery so you can start off using this service in the most economical and efficient manner.

BigQuery best practices and cost management

In this section, we will go over general best practices for using BigQuery. We will also touch on cost management and the different approaches to optimizing your spending on BigQuery. By the end of this section, you will have a good understanding of how to efficiently use BigQuery to control costs and optimize storage and queries.

Best practices

There are many best practices for using BigQuery. The following are a few best practices that are foundational for using this service efficiently. Take note and implement these approaches to get the most performance and benefit out of BigQuery:

Use the right data type for your data: BigQuery supports many data types, each with its own storage cost. Keep in mind that a large string column takes up more space than an integer column. When designing your table schema, make sure to choose the right data type for your data.