Machine Learning with BigQuery ML - Alessandro Marrandino - E-Book

Machine Learning with BigQuery ML E-Book

Alessandro Marrandino

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

BigQuery ML enables you to easily build machine learning (ML) models with SQL without much coding. This book will help you to accelerate the development and deployment of ML models with BigQuery ML.

The book starts with a quick overview of Google Cloud and BigQuery architecture. You'll then learn how to configure a Google Cloud project, understand the architectural components and capabilities of BigQuery, and find out how to build ML models with BigQuery ML. The book teaches you how to use ML using SQL on BigQuery. You'll analyze the key phases of a ML model's lifecycle and get to grips with the SQL statements used to train, evaluate, test, and use a model. As you advance, you'll build a series of use cases by applying different ML techniques such as linear regression, binary and multiclass logistic regression, k-means, ARIMA time series, deep neural networks, and XGBoost using practical use cases. Moving on, you'll cover matrix factorization and deep neural networks using BigQuery ML's capabilities. Finally, you'll explore the integration of BigQuery ML with other Google Cloud Platform components such as AI Platform Notebooks and TensorFlow along with discovering best practices and tips and tricks for hyperparameter tuning and performance enhancement.

By the end of this BigQuery book, you'll be able to build and evaluate your own ML models with BigQuery ML.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 319

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning with BigQuery ML

Create, execute, and improve machine learning models in BigQuery using standard SQL queries

Alessandro Marrandino

BIRMINGHAM—MUMBAI

Machine Learning with BigQuery ML

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Parikh

Publishing Product Manager: Sunith Shetty

Senior Editor: David Sugarman

Content Development Editor: Nathanya Dias

Technical Editor: Manikandan Kurup

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Prashant Ghare

First published: June 2021

Production reference: 1120521

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-030-7

www.packt.com

Contributors

About the author

Alessandro Marrandino is a Google Cloud customer engineer. He helps various enterprises on the digital transformation to adopt cloud technologies. He is actively focused on and experienced in data management and smart analytics solutions. He has spent his entire career on data and artificial intelligence projects for global companies in different industries.

I want to thank the people who have been close to me and supported me, especially my wife, Federica. Thanks to her love and availability, I was able to dedicate most of my free time to writing this book, while we were waiting for the most important person in our life: Eva. Special thanks go to all my family. They have always believed in me and in my passion for technology and data. Just a final remark for my mom: The internet has had some success and there are people working on it!

About the reviewers

Marijan Milovec currently works as a software developer. He is highly ambitious and interested in software development, DevOps, and software architecture. He is also the lead organizer of the Google Developer Group Zagreb, which focuses on software development, software architecture, artificial intelligence, machine learning, deep learning, data science, DevOps, Docker, Kubernetes, Google Cloud, and more.

Sathish VJ is a software architect, technology trainer, and angel investor. He has all the open certifications on Google Cloud, including Google Cloud Machine Learning Engineer, and is also a Google Cloud Authorized Trainer. He runs a YouTube channel, called AwesomeGCP, where he teaches people how to apply Google Cloud to their projects and prepare for certifications.

Sharmistha Chatterjee is a data science evangelist with 15+ years of professional experience in the field of machine learning (AI research and productionizing scalable solutions) and cloud applications. She has worked in both Fortune 500 companies, as well as in very early-stage startups. She is currently working as a Senior Manager of Data Sciences at Publicis Sapient where she leads the digital transformation of clients across industry verticals. She is an active blogger, an international speaker at various tech conferences, and 2X Google Developer Expert in Machine Learning and Google Cloud. She is also the Hackernoon Tech award winner for 2020, been listed as 40 under 40. Data Scientist by AIM and '21 tech trailblazers 2021 by Google.

Table of Contents

Preface

Section 1: Introduction and Environment Setup

Chapter 1: Introduction to Google Cloud and BigQuery

Introducing Google Cloud Platform

Interacting with GCP

Discovering GCP's key differentiators

Exploring AI and ML services on GCP

Core platform services

Building blocks

Solutions

Introducing BigQuery

BigQuery architecture

BigQuery's advantages over traditional data warehouses

Interacting with BigQuery

BigQuery data structures

Discovering BigQuery ML

BigQuery ML benefits

BigQuery ML algorithms

Understanding BigQuery pricing

BigQuery pricing

BigQuery ML pricing

Free operations and free tiers

Pricing calculator

Summary

Further resources

Chapter 2: Setting Up Your GCP and BigQuery Environment

Technical requirements

Creating your GCP account and project

Registering a GCP account

Exploring Google Cloud Console

Creating a GCP project

Activating BigQuery

Discovering the BigQuery web UI

Exploring the BigQuery public datasets

Searching for a public dataset

Analyzing a table

Summary

Further reading

Chapter 3: Introducing BigQuery Syntax

Technical requirements

Creating a BigQuery dataset

Discovering BigQuery SQL

CRUD operations

Diving into BigQuery ML

Summary

Further resources

Section 2: Deep Learning Networks

Chapter 4: Predicting Numerical Values with Linear Regression

Technical requirements

Introducing the business scenario

Discovering linear regression

Exploring and understanding the dataset

Understanding the data

Checking the data's quality

Segmenting the dataset

Training the linear regression model

Evaluating the linear regression model

Utilizing the linear regression model

Drawing business conclusions

Summary

Further reading

Chapter 5: Predicting Boolean Values Using Binary Logistic Regression

Technical requirements

Introducing the business scenario

Discovering binary logistic regression

Exploring and understanding the dataset

Understanding the data

Segmenting the dataset

Training the binary logistic regression model

Evaluating the binary logistic regression model

Using the binary logistic regression model

Drawing business conclusions

Summary

Further resources

Chapter 6: Classifying Trees with Multiclass Logistic Regression

Technical requirements

Introducing the business scenario

Discovering multiclass logistic regression

Exploring and understanding the dataset

Understanding the data

Checking the data quality

Segmenting the dataset

Training the multiclass logistic regression model

Evaluating the multiclass logistic regression model

Using the multiclass logistic regression model

Drawing business conclusions

Summary

Further resources

Section 3: Advanced Models with BigQuery ML

Chapter 7: Clustering Using the K-Means Algorithm

Technical requirements

Introducing the business scenario

Discovering K-Means clustering

Exploring and understanding the dataset

Understanding the data

Checking the data quality

Creating the training datasets

Training the K-Means clustering model

Evaluating the K-Means clustering model

Using the K-Means clustering model

Drawing business conclusions

Summary

Further resources

Chapter 8: Forecasting Using Time Series

Technical requirements

Introducing the business scenario

Discovering time series forecasting

Exploring and understanding the dataset

Understanding the data

Checking the data quality

Creating the training dataset

Training the time series forecasting model

Evaluating the time series forecasting model

Using the time series forecasting model

Presenting the forecast

Summary

Further resources

Chapter 9: Suggesting the Right Product by Using Matrix Factorization

Technical requirements

Introducing the business scenario

Discovering matrix factorization

Configuring BigQuery Flex Slots

Exploring and preparing the dataset

Understanding the data

Creating the training dataset

Training the matrix factorization model

Evaluating the matrix factorization model

Using the matrix factorization model

Drawing business conclusions

Summary

Further resources

Chapter 10: Predicting Boolean Values Using XGBoost

Technical requirements

Introducing the business scenario

Discovering the XGBoost Boosted Tree classification model

Exploring and understanding the dataset

Checking the data quality

Segmenting the dataset

Training the XGBoost classification model

Evaluating the XGBoost classification model

Using the XGBoost classification model

Drawing business conclusions

Summary

Further resources

Chapter 11: Implementing Deep Neural Networks

Technical requirements

Introducing the business scenario

Discovering DNNs

DNNs in BigQuery ML

Preparing the dataset

Training the DNN models

Evaluating the DNN models

Using the DNN models

Drawing business conclusions

Deep neural networks versus linear models

Summary

Further resources

Section 4: Further Extending Your ML Capabilities with GCP

Chapter 12: Using BigQuery ML with AI Notebooks

Technical requirements

Discovering AI Platform Notebooks

AI Platform Notebooks pricing

Configuring the first notebook

Implementing BigQuery ML models within notebooks

Compiling the AI notebook

Running the code in the AI notebook

Summary

Further resources

Chapter 13: Running TensorFlow Models with BigQuery ML

Technical requirements

Introducing TensorFlow

Discovering the relationship between BigQuery ML and TensorFlow

Understanding commonalities and differences

Collaborating with BigQuery ML and TensorFlow

Converting BigQuery ML models into TensorFlow

Training the BigQuery ML to export it

Exporting the BigQuery ML model

Running TensorFlow models with BigQuery ML

Summary

Further resources

Chapter 14: BigQuery ML Tips and Best Practices

Choosing the right BigQuery ML algorithm

Preparing the datasets

Working with high-quality data

Segmenting the datasets

Understanding feature engineering

Tuning hyperparameters

Using BigQuery ML for online predictions

Summary

Further resources

Other Books You May Enjoy

Preface

Machine Learning (ML) democratization is one of the fastest growing trends in the AI industry. In this field, BigQuery ML represents a fundamental tool for bridging the gap between data analysis and the implementation of innovative ML models. Through this book, you will have the opportunity to learn how to use BigQuery and BigQuery ML with an incremental approach that combines technical explanations with hands-on exercises. Following a brief introduction, you will immediately be able to build ML models on concrete use cases using BigQuery ML. By the end of this book, you will be able to choose the right ML algorithm to train, evaluate, and use advanced ML models.

Who this book is for

This book is for data scientists, data analysts, data engineers, and anyone looking to get started with Google's BigQuery ML. You'll also find this book useful if you want to accelerate the development of ML models or if you are a business user who wants to apply ML in an easy way using SQL. A basic knowledge of BigQuery and SQL is required.

What this book covers

Chapter 1, Introduction to Google Cloud and BigQuery, provides an overview of the Google Cloud Platform and of the BigQuery analytics database.

Chapter 2, Setting Up Your GCP and BigQuery Environment, explains the configuration of your first Google Cloud account, project, and BigQuery environment.

Chapter 3, Introducing BigQuery Syntax, covers the main SQL operations for working on BigQuery.

Chapter 4, Predicting Numerical Values with Linear Regression, explains the development of a linear regression ML model to predict the trip durations of a bike rental service.

Chapter 5, Predicting Boolean Values Using Binary Logistic, explains the implementation of a binary logistic regression ML model to predict the behavior of a taxi company's customers.

Chapter 6, Classifying Trees with Multiclass Logistic Regression, explains the development of a multiclass logistic ML algorithm to automatically classify species of trees according to their natural characteristics.

Chapter 7, Clustering Using the K-Means Algorithm, covers the implementation of a clustering system to identify the best-performing drivers in a taxi company.

Chapter 8, Forecasting Using Time Series, outlines the design and implementation of a forecasting tool to predict and present the sales of specific products.

Chapter 9, Suggesting the Right Product by Using Matrix Factorization, explains how to build a recommendation engine, using the matrix factorization algorithm, that suggests the best product to each customer.

Chapter 10, Predicting Boolean Values Using XGBoost, covers the implementation of a boosted tree ML model to predict the behavior of a taxi company's customers.

Chapter 11, Implementing Deep Neural Networks, covers the design and implementation of a Deep Neural Network (DNN) to predict the trip durations of a bike rental service.

Chapter 12, Using BigQuery ML with AI Notebooks, explains how AI Platform Notebooks can be integrated with BigQuery ML to develop and share ML models.

Chapter 13, Running TensorFlow Models with BigQuery ML, explains how BigQuery ML and TensorFlow can work together.

Chapter 14, BigQuery ML Tips and Best Practices, covers ML best practices and tips that can be applied during the development of a BigQuery ML model.

To get the most out of this book

You will need to have a basic knowledge of SQL syntax and some experience of using databases.

A knowledge of the fundamentals of ML is not mandatory but is advised.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you to avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Machine-Learning-with-BigQuery-ML. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Code in Action

Code in Action videos for this book can be viewed at https://bit.ly/3f11XbU.

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800560307_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Sort the results of a query according to a specific list of fields with the ORDER BY clause."

A block of code is set as follows:

UPDATE

    `bigqueryml-packt.03_bigquery_syntax.first_table`

SET

    description= 'This is my updated description'

WHERE

    id_key=1;

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "BigQuery supports two different SQL dialects: standard SQL and legacy SQL."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Introduction and Environment Setup

This section provides an introduction to machine learning and an overview of the technical tools that will be used in the next sections of the book: Google Cloud Platform, BigQuery, and BigQuery ML, as well as the SQL syntax related to it.

This section comprises the following chapters:

Chapter 1, Introduction to Google Cloud and BigQueryChapter 2, Setting Up Your GCP and BigQuery EnvironmentChapter 3, Introducing BigQuery Syntax

Chapter 1: Introduction to Google Cloud and BigQuery

The adoption of the public cloud enables companies and users to access innovative and cost-effective technologies. This is particularly valuable in the big data and Artificial Intelligence (AI) areas, where new solutions are providing possibilities that seemed impossible to achieve with on-premises systems only a few years ago. In order to be effective in the day-to-day business of a company, the new AI capabilities need to be shared between different roles and not concentrated only with technicians. Most cloud providers are currently addressing the challenge of democratizing AI across different departments and employees with different skills.

In this context, Google Cloud provides several services to accelerate the processing of large amounts of data and build Machine Learning (ML) applications that can make better decisions.

In this chapter, we'll gradually introduce the main concepts that will be useful in the upcoming hands-on activities. Using an incremental approach, we'll go through the following topics:

Introducing Google Cloud PlatformExploring AI and ML services on GCPIntroducing BigQueryDiscovering BigQuery MLUnderstanding BigQuery pricing

Introducing Google Cloud Platform

Starting from 1998 with the launch of Google Search, Google has developed one of the largest and most powerful IT infrastructures in the world. Today, this infrastructure is used by billions of users to use services such as Gmail, YouTube, Google Photo, and Maps. After 10 years, in 2008, Google decided to open its network and IT infrastructure to business customers, taking an infrastructure that was initially developed for consumer applications to public service and launching Google Cloud Platform (GCP).

The 90+ services that Google currently provides to large enterprises and small- and medium-sized businesses cover the following categories:

Compute: Used to support workloads or applications with virtual machines such as Google Compute Engine, containers with Google Kubernetes Engine, or platforms such as AppEngine.Storage and databases: Used to store datasets and objects in an easy and convenient way. Some examples are Google Cloud Storage, Cloud SQL, and Spanner.Networking: Used to easily connect different locations and data centers across the globe with Virtual Private Clouds (VPCs), firewalls, and fully managed global routers.Big data: Used to store and process large amounts of information in a structured, semi-structured, or unstructured format. Among these services are Google DataProc, the Hadoop services offered by GCP, and BigQuery, which is the main focus of this book.AI and machine learning: This product area provides various tools for different kinds of users, enabling them to leverage AI and ML in their everyday business. Some examples are TensorFlow, AutoML, Vision APIs, and BigQuery ML, the main focus of this book. Identity, security, and management tools: This area includes all the services that are necessary to prevent unauthorized access, ensure security, and monitor all other cloud infrastructure. Identity Access Management, Key Management Service, Cloud Logging, and Cloud Audit Logs are just some of these tools.Internet of Things (IoT): Used to connect plants, vehicles, or any other objects to the GCP infrastructure, enabling the development of modern IoT use cases. The core component of this area is Google IoT Core. API management: Tools to expose services to customers and partners through REST APIs, providing the ability to fully leverage the benefits of interconnectivity. In this pillar, Google Apigee is one of the most famous products and is recognized as the leader of this market segment. Productivity: Used to improve productivity and collaboration for all companies that want to start working with Google and embracing its way of doing business through the powerful tools of Google Workplace (previously GSuite).

Interacting with GCP

All the services just mentioned can be accessed through four different interfaces:

Google Cloud Console: The web-based user interface of GCP, easily accessible from compatible web browsers such as Google Chrome, Edge, or Firefox. For the hands-on exercises in this book, we'll mainly use Google Cloud Console:

Figure 1.1 – Screenshot of Google Cloud Console

Google Cloud SDK: The client SDK can be installed in order to interact with GCP services through the command line. It can be very useful to automate tasks and operations by scheduling them into scripts.Client libraries: The SDK also includes some client libraries to interact with GCP using the most common programming languages, such as Python, Java, and Node.js.REST APIs: Any task or operation performed on GCP can be executed by invoking a specific REST API from any compatible software.

Now that we've learned how to interact with GCP, let's discover how GCP is different from other cloud providers.

Discovering GCP's key differentiators

GCP is not the only public cloud provider on the market. Other companies have embarked on this kind of business, for example, with Amazon Web Services (AWS), Microsoft Azure, IBM, and Oracle. For this reason, before we get too deep into this book, it could be valuable to understand how GCP is different from the other offerings in the cloud market.

Each cloud provider has its own mission, strategy, history, and strengths. Let's take a look at why Google Cloud can be considered different from all the other cloud providers.

Security

Google provides an end-to-end security model for its data centers across the globe, using customized hardware developed and used by Google, and application encryption is enabled by default. The security best practices adopted by Google for GCP are the same as those developed to run applications with more than 1 billion users, such as Gmail and Google Maps.

Global network and infrastructure

At the time of writing, Google's infrastructure is available in 24 different regions, 74 availability zones, and 144 network edge locations, enabling customers to connect to Google's network and ensuring the best experience in terms of bandwidth, network latency, and security. This network allows GCP users to move data across different regions without leaving Google's proprietary network, minimizing the risk of sending information across the public internet. As of today, it is estimated that about 40% of internet traffic goes through Google's proprietary network.

In the following figure, we can see how GCP regions are distributed across the globe:

Figure 1.2 – A map of Google's global availability

The latest version of the map can be seen at the following URL: https://cloud.google.com/about/locations.

Serverless and fully managed approach

Google provides a lot of fully managed and serverless services to allow its customers to focus on high-value activities rather than maintenance operations. A great example is BigQuery, the serverless data warehouse that will be introduced in the next section of this chapter.

Environmental sustainability

100% of the energy used for Google's data centers comes from renewable energy sources. Furthermore, Google has committed to being the first major company to operate carbon-free for all its operations, such as its data centers and campuses, by 2030.

Pervasive AI

Google is a pioneer of the AI industry and is leveraging AI and ML to improve its consumer products, such as Google Photos, but also to improve the performance and efficiency of its data centers. All of Google's expertise in terms of AI and ML can be leveraged by customers through adopting GCP services such as AutoML and BigQuery ML. That will be the main focus of this book.

Now that we have discussed some of the key elements of GCP as a service, let's look at AI and ML more specifically.

Exploring AI and ML services on GCP

Before we get too deep into our look at all of the AI and ML tools of GCP, it is very important to remember that Google is an AI company and embeds AI and ML features within many of its other products, providing the best user experience to its customers. Simply looking at Google's products, we can easily perceive how AI can be a key asset for a business. Some examples follow:

Gmail Smart Reply allows users to quickly reply to emails, providing meaningful suggestions according to the context of the conversation.Google Maps is able to precisely predict our time of arrival when we move from one place to another by combining different data sources.Google Translate provides translation services for more than one hundred languages.YouTube and the Google Play Store are able to recommend the best video to watch or the most useful mobile application to install according to user preferences.Google Photos recognizes people, animals, and places in our pictures, simplifying the job of archiving and organizing our photos.

Google proves that leveraging AI and ML capabilities in our business opens new opportunities for us, increases our revenue, saves money and time, and provides better experiences to our customers.

To better understand the richness of the GCP portfolio in terms of AI and ML services, it is important to emphasize that GCP services are able to address all the needs that emerge in a typical life cycle of an ML model:

Ingestion and preparation of the datasetsBuilding and training of the modelEvaluation and validationDeploymentMaintenance and further improvements of the model

In the following figure, you can see the entire AI and ML GCP portfolio:

Figure 1.3 – GCP AI and ML services represented by their icons

Each one of the previously mentioned five stages can be fully managed by the user or delegated to the automation capabilities of GCP, according to the customer's needs and skills. For this reason, it is possible to divide the AI and ML services provided by GCP into three subcategories:

Core platform servicesAI ApplicationsSolutions

For each of these subcategories, we'll go through the most important services currently available and some typical users that could benefit from them.

Core platform services

The core AI and ML services are the most granular items that a customer can use on GCP to develop AI and ML use cases. They provide the most control and flexibility to their users in exchange for less automation; users will also need to have more expertise in ML.

Processing units (CPU, GPU, and TPU)

With a traditional Infrastructure-as-a-Service (IaaS) approach, developers can equip their Google Compute Engine instances with powerful processing units to accelerate the training phases of ML models that might otherwise take a long time to run, particularly if complex contexts or large amounts of data need to be processed. Beyond the Central Processing Units (CPUs) that are available on our laptops, GCP offers the use of high-performance Graphical Processing Units (GPUs) made by Nvidia and available in the cloud to speed up computationally heavy jobs. Beyond that, there are Tensor Processing Units (TPUs), which are specifically designed to support ML workloads and perform matrix calculations.

Deep Learning VM Image

One of the biggest challenges for data scientists is quickly provisioning environments to develop their ML models. For this reason, Google Cloud provides pre-configured Google Compute Engine (GCE) images that can be easily provisioned with a pre-built set of components and libraries dedicated to ML.

In the following screenshot, you can see how these Virtual Machines (VMs) are presented in the GCP marketplace:

Figure 1.4 – Deep Learning VM in the GCP marketplace

Deep Learning VM Image is also optimized for ML workloads and is already pre-configured to use GPUs. When a GCE image is provisioned from the GCP marketplace, it is already configured with the most common ML frameworks and programming languages, such as Python, TensorFlow, scikit-learn, and others. This allows data scientists to focus on the development of the model rather than on the provisioning and configuration of the development environment.

TensorFlow

TensorFlow is an open source framework for math, statistics, and ML. It was launched by Google Brain for internal use at Google and then released under the Apache License 2.0. It is still the core of the most successful Google products. The framework natively supports Python but can be used also with other programming languages such as Java, C++, and Go. It requires ML expertise, but it allows users to achieve great results in terms of customization and flexibility to develop the best ML model.

AI Platform

AI Platform is an integrated service of GCP that provides serverless tools to train, evaluate, deploy, and maintain ML models. With this service, data scientists are able to focus only on their code, simplifying all the side activities of ML development, such as provisioning, maintenance, and scalability.

AI Platform Notebooks

AI Platform Notebooks is a fully managed service that provides data scientists with a JupyterLab environment already integrated and connected with all other GCP resources. Similar to Deep Learning VM Image, AI Platform Notebooks instances come pre-configured with the latest versions of the AI and ML frameworks and allow you to develop an ML model with diagrams and written explanations.

All the services described so far require good knowledge of ML and proven experience in hand-coding with the most common programming languages. The core platform services address the needs of data scientists and ML engineers who need full control over and flexibility with the solutions that they're building and who already have strong technical skills.

Building blocks

On top of the core platform services, Google Cloud provides pre-built components that can be used to accelerate the development of new ML use cases. This category encompasses the following aspects:

AutoML

Unlike the services outlined in the previous section, AutoML offers the ability to build ML models even if you have limited expertise in the field. It leverages Google's ML capabilities and allows users to provide their data to train customized versions of algorithms already developed by Google. AutoML currently provides the ability to train models for images (AutoML Vision), video (AutoML Video Intelligence), free text (AutoML Natural Language), translation (AutoML Translation), and structured data (AutoML Tables). When the ML model is trained and ready to use, it is automatically deployed and made available through a REST endpoint.

Pre-built APIs

Google Cloud provides pre-built APIs that leverage ML technology under the surface but are already trained and ready to use. The APIs are exposed through a standard REST interface that can be easily integrated into applications to work with images (Vision API), videos (Video API), free text (Natural Language API), translations (Translation API), e-commerce data (Recommendations AI), and conversational scenarios (Speech-to-Text API, Text-to-Speech API, and Dialogflow). Using a pre-built ML API is the best choice for general-purpose applications where generic training datasets can be used.

BigQuery ML

As BigQuery ML will be discussed in detail in the following sections of this chapter, for the moment you just need to know that this component enables users to build ML models with SQL language, using structured data stored in BigQuery and a list of supported algorithms.

None of the building blocks described here requires any specific knowledge of ML or any proven coding experience with programming languages. In fact, these services are intended for developers or business analysts who are not very familiar with ML but want to start using it quickly and with little effort. On the other hand, a data scientist with ML expertise can also leverage the building blocks to accelerate the development of a model, reducing the time to market of a solution.

To see a summary of the building blocks, their usage, and their target users, let's take a look at the following table:

Figure 1.5 – Building blocks summary table

Now that we've learned the basics of building blocks, let's take a look at the solutions offered by GCP.

Solutions

Following the incremental approach, building blocks and core platform services are also bundled to provide out-of-the-box solutions. These pre-built modules can be adopted by companies and immediately used to improve their business. These solutions are covered in this section.

AI Hub

Google Cloud's AI Hub acts as a marketplace for AI components. It can be used in public mode to share and use assets developed by the community, which actively works on GCP, or it can be used privately to share ML assets inside your company. The goal of this service is to simplify the sharing of valuable assets across different users, favoring re-use and accelerating the deployment of new use cases.

In the following screenshot, you can see AI Hub's home page:

Figure 1.6 – Screenshot of AI Hub on GCP

Now that we've understood the role of AI Hub, let's look at Cloud Talent Solution.

Cloud Talent Solution

Cloud Talent Solution is basically a solution for HR offices that improves the candidate discovery and hiring processes using AI. We will not go any further with the description of this solution, but there will be a link in the Further resources section at the end of this chapter.

Contact Center AI

Contact Center AI is a solution that can be used to improve the effectiveness of the customer experience with a contact center powered by AI and automation. The solution is based on Dialogflow and the Text-to-Speech and Speech-to-Text APIs.

Document AI

This solution is focused on document processing to extract relevant information and streamline business processes that usually require manual effort. The solution is able to parse PDF files, images, and handwritten text to convert this information into a digitally structured format, making them accessible and researchable.

As can be easily seen from their descriptions, the AI solutions provided by Google are more business-oriented and designed to solve specific challenges. They can be configured and customized but are basically dedicated to business users.

Before going on, let's take a look at the following table, which summarizes the concepts explained in this section and provides a clear overview of the different AI and ML service categories:

Figure 1.7 – Summary of GCP AI and ML services

Tip

When you need to develop a new use case, we recommend using pre-built solutions and building blocks before trying to reinvent the wheel. If a building block already satisfies all the requirements of your use case, it can be extremely valuable to use it. It will save time and effort during the development and maintenance phases. Start considering the use of core services only if the use case is complex or so particular that it cannot be addressed with building blocks or solutions.

As we've seen in this section, GCP's AI and ML services are extensive. Now, let's take a closer look at the main topic of this book: Google BigQuery.

Introducing BigQuery

Google BigQuery is a highly scalable, serverless, distributed data warehouse technology built internally by Google in 2006 and then released for public use on GCP in 2010. Thanks to its architecture, it can store petabytes of data and query them with high performance and on-demand scaling. Due to its serverless nature, users who store and query data on BigQuery don't have to manage the underlying infrastructure and can focus on implementing the logic that brings the business value, saving time and resources.

BigQuery is currently used by many large enterprises that leverage it to make data-driven decisions, including Twitter, The Home Depot, and Dow Jones.

BigQuery architecture

BigQuery has a distributed architecture running on thousands of nodes across Google's data centers. Your datasets are not stored in a unique server but are chunked and replicated across different regions to guarantee maximum performance and availability.

The storage and compute layers are fully decoupled in BigQuery. This means that the query engine runs on different servers from the servers where the data is stored. This feature enables BigQuery to provide great scalability both in terms of data volume and query execution. This decoupled paradigm is only possible thanks to Google's Petabit network, which moves data very quickly from one server to another, leveraging Google's proprietary fiber cables across the globe.

Now let's look deeper into how BigQuery manages storage and the compute engine.

Storage layer

Unlike traditional data warehouses, BigQuery stores data in columnar format rather than in row format. This approach enables you to do the following:

Achieve a better compression ratio for each column, because the data in a column is typically homogeneous and simpler to compress.Reduce the amount of data to read and get the best possible performance for data warehouse use cases that are usually based on a small selection of columns in a table and aggregating operations such as sums, average, and maximum.

All the data is stored in Google's proprietary distributed filesystem named Google File System (codename Colossus). The distribution of the data allows it to guarantee faster I/O performance and better availability of data in the case of failures. Google File System is based on two different server types:

Master servers: Nodes that don't store data but are responsible for managing the metadata of each file, such as the location and available number of replicas of each chunk that compose a file.Chunk servers: Nodes that actually store the chunks of files that are replicated across different servers.

In the following diagram, you can see how Google File System manages data:

Figure 1.8 – Google File System (Colossus) storage strategy

Now that we've learned how BigQuery handles large volumes of data, let's see how this data can be accessed by the compute layer.

Compute (query) layer

Fully decoupled from storage, the compute layer is responsible for receiving query statements from BigQuery users and executing them in the fastest way. The query engine is based on Dremel, a technology developed by Google and then published in a paper in 2010. This engine leverages a multi-level tree architecture:

The root node of the tree receives the query to execute.The root node splits and distributes the query to other intermediate nodes named mixers.Mixer nodes have the task of rewriting queries before passing them to the leaf nodes or to other intermediate mixer nodes.Leaf nodes are responsible for parallelizing the reading of the chunks of data from Google File System.When the right chunks of data are extracted from the filesystem, leaf nodes perform computations on the data and eventually shuffle them across other leaf nodes.At the end of the computation, each leaf node produces a result that is returned to the parent node.When all the results are returned to the root node, the outcome of the query is sent to the user or application that requested the execution.

The execution process of a query on BigQuery based on the multi-level tree is represented in the following diagram:

Figure 1.9 – The BigQuery engine is a multi-level tree

Each node provides a number of processing units called BigQuery slots to execute the business logic of the query. A BigQuery slot can be considered a virtual CPU on a Dremel node. The calculation of the slots needed to perform a specific query is automatically managed by BigQuery depending on the complexity of the query and the impacted data volumes.

BigQuery's advantages over traditional data warehouses

Now that we've learned about the technical architecture underneath BigQuery, let's take a look at how this architecture translates into benefits for the enterprises that use it to become data-driven companies compared to other traditional on-premises data warehouses.

Serverless

As we have