The Definitive Guide to Google Vertex AI - Jasmeet Bhatia - E-Book

The Definitive Guide to Google Vertex AI E-Book

Jasmeet Bhatia

0,0
35,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

While AI has become an integral part of every organization today, the development of large-scale ML solutions and management of complex ML workflows in production continue to pose challenges for many. Google’s unified data and AI platform, Vertex AI, directly addresses these challenges with its array of MLOPs tools designed for overall workflow management.
This book is a comprehensive guide that lets you explore Google Vertex AI’s easy-to-advanced level features for end-to-end ML solution development. Throughout this book, you’ll discover how Vertex AI empowers you by providing essential tools for critical tasks, including data management, model building, large-scale experimentations, metadata logging, model deployments, and monitoring. You’ll learn how to harness the full potential of Vertex AI for developing and deploying no-code, low-code, or fully customized ML solutions. This book takes a hands-on approach to developing u deploying some real-world ML solutions on Google Cloud, leveraging key technologies such as Vision, NLP, generative AI, and recommendation systems. Additionally, this book covers pre-built and turnkey solution offerings as well as guidance on seamlessly integrating them into your ML workflows.
By the end of this book, you’ll have the confidence to develop and deploy large-scale production-grade ML solutions using the MLOps tooling and best practices from Google.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Definitive Guide to Google Vertex AI

Accelerate your machine learning journey with Google Cloud Vertex AI and MLOps best practices

Jasmeet Bhatia

Kartik Chaudhary

BIRMINGHAM – MUMBAI

The Definitive Guide to Google Vertex AI

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Sanjana Gupta

Book Project Manager: Hemangi Lotlikar

Senior Editor: Gowri Rekha

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Shambhavi Mishra

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Shankar Kalbhor

DevRel Marketing Executive (DRME): Vinishka Kalra

First published: December 2023

Production reference: 1211223

Published by

Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-80181-526-0

www.packtpub.com

To my incredible wife, who gracefully navigates our cloud rivalry with love and patience. This book wouldn’t be possible without your constant support and encouragement. Thank you for tolerating my late-night writing and even later-night snacking. Your patience is my favorite superpower.

To my parents, who still can’t fully explain what I do for a living but are proud nonetheless – you’re the original algorithms of my life. Thank you for programming me with constant love, support, and the occasional necessary reboot!

To my wonderful daughters, without whom I would never have really understood why so many authors joke about their kids delaying their books. Now I do. Thank you for bringing immense joy and well-needed system shutdowns to my life.

And to my colleagues, the wizards of Google Cloud, who speak fluent Python and dream in code – without you, this book would just be a collection of funny error messages.

This book is dedicated to all of you. May our models always converge, and may we all never run out of GPUs!

– Jasmeet Bhatia

To my mother, Smt. Sarita Devi, and my father, Mr. Inderpal Singh, for their sacrifices, constant love, and never-ending support. Thank you for teaching me to believe in myself, in God, and in my dreams.

To my little brother, Chakit Gill, for continuous encouragement, support, and love. Thanks for being my best friend; I am really proud of you.

To my friends and colleagues for their inspiration, motivation, and always being there for me.

And, most importantly, to all the readers – I hope this book helps you with your goals, because that’s the real motivation behind writing this book and every single technical article that I share publicly on my blog.

– Kartik Chaudhary

Contributors

About the authors

Jasmeet Bhatia is a machine learning solution architect with over 18 years of industry experience, with the last 10 years focused on global-scale data analytics and machine learning solutions. In his current role at Google, he works closely with key GCP enterprise customers to provide them guidance on how to best use Google’s cutting-edge machine learning products. At Google, he has also worked as part of the Area 120 incubator on building innovative data products such as Demand Signals, and he has been involved in the launch of Google products such as Time Series Insights. Before Google, he worked in similar roles at Microsoft and Deloitte.

When not immersed in technology, he loves spending time with his wife and two daughters, reading books, watching movies, and exploring the scenic trails of southern California.

He holds a bachelor’s degree in electronics engineering from Jamia Millia Islamia University in India and an MBA from the University of California Los Angeles (UCLA) Anderson School of Management.

Kartik Chaudhary is an AI enthusiast, educator, and ML professional with 6+ years of industry experience. He currently works as a senior AI engineer with Google to design and architect ML solutions for Google’s strategic customers, leveraging core Google products, frameworks, and AI tools. He previously worked with UHG, as a data scientist, and helped in making the healthcare system work better for everyone. Kartik has filed nine patents at the intersection of AI and healthcare.

Kartik loves sharing knowledge and runs his own blog on AI, titled Drops of AI.

Away from work, he loves watching anime and movies and capturing the beauty of sunsets.

I would like to thank my parents, my brother, and my friends for their constant love and support.

About the reviewers

Surya Tripathi is a seasoned data scientist with nearly nine years of expertise in data science, analysis, and data engineering. He holds a bachelor’s degree in electronics and communications engineering and a master’s in applied mathematics from Liverpool John Moores University. He is proficient in cloud platforms (GCP, Azure, AWS, and IBM) and has extensive GCP experience, delivering ML solutions in CPG, healthcare, banking, and supply chain. Involved in the full data science life cycle, he excels in requirement gathering, data analysis, model development, and MLOps. With experience in both consulting and product companies, he is currently affiliated with a top consulting firm, where his primary focus areas include generative AI and demand forecasting.

Gopala Dhar has worked with and implemented state-of-the-art technology in the field of AI to solve real-world business use cases at scale. He has four published patents to his name, ranging from the field of software design to hardware manufacturing, including embedded systems. His latest stint is at Google as an AI engineer. His areas of expertise include ML, ML system design, reinforcement learning, and, most recently, generative AI. He shares what he learns frequently through blog posts and open source contributions. He has won several awards from various academic as well as professional institutions, including the Indian Institute of Technology in Mumbai, the Indian Institute of Management in Bangalore, Texas Instruments, and Google.

Lakshmanan Sethu Sankaranarayanan is an award-winning AI/ML cloud industry leader in the fields of data, AI, and ML. He helps enterprise customers to migrate to Google Cloud, Azure, and AWS. He serves on the Technical Advisory Board for AI/ML solutions at Packt, and he is also the Technical Editor for Packt and O’Reilly. He has been honored with three LinkedIn TopVoice awards for his contributions to AI/ML and cloud computing. He earned four Microsoft Most Valuable Player awards for his outstanding contribution to the cloud community.

I would like to acknowledge my wife, Dhivya, and my kids, Sanjana and Saisasthik, for being a constant source of support and encouragement throughout this book-reviewing journey.

Chetan Apsunde is an experienced software engineer, specializing in conversational AI and machine learning with a robust nine-year IT background. He works with Google to build cloud solutions using CCAI and GenAI. He is passionate about creating intelligent, user-centric solutions at the intersection of technology and human interaction.

Table of Contents

Preface

Part 1: The Importance of MLOps in a Real-World ML Deployment

1

Machine Learning Project Life Cycle and Challenges

ML project life cycle

Common challenges in developing real-world ML solutions

Data collection and security

Non-representative training data

Poor quality of data

Underfitting the training dataset

Overfitting the training dataset

Infrastructure requirements

Limitations of ML

Data-related concerns

Deterministic nature of problems

Lack of interpretability and reproducibility

Concerns related to cost and customizations

Ethical concerns and bias

Summary

2

What Is MLOps, and Why Is It So Important for Every ML Team?

Why is MLOps important?

Implementing different MLOps maturity levels

MLOps maturity level 0

MLOps maturity level 1 – automating basic ML steps

MLOps maturity level 2 – automated model deployments

How can Vertex AI help with implementing MLOps?

Summary

Part 2: Machine Learning Tools for Custom Models on Google Cloud

3

It’s All About Data – Options to Store and Transform ML Datasets

Moving data to Google Cloud

Google Cloud Storage Transfer tools

BigQuery Data Transfer Service

Storage Transfer Service

Transfer Appliance

Where to store data

GCS – object storage

BQ – data warehouse

Transforming data

Ad hoc transformations within Jupyter Notebook

Cloud Data Fusion

Dataflow pipelines for scalable data transformations

Summary

4

Vertex AI Workbench – a One-Stop Tool for AI/ML Development Needs

What is Jupyter Notebook?

Getting started with Jupyter Notebook

Vertex AI Workbench

Getting started with Vertex AI Workbench

Custom containers for Vertex AI Workbench

Scheduling notebooks in Vertex AI

Configuring notebook executions

Summary

5

No-Code Options for Building ML Models

ML modeling options in Google Cloud

What is AutoML?

Vertex AI AutoML

How to create a Vertex AI AutoML model using tabular data

Importing data to use with Vertex AI AutoML

Training the AutoML model for tabular/structured data

Generating predictions using the recently trained model

Deploying a model in Vertex AI

Generating predictions

Generating predictions programmatically

Summary

6

Low-Code Options for Building ML Models

What is BQML?

Getting started with BigQuery

Using BQML for feature transformations

Manual preprocessing

Building ML models with BQML

Creating BQML models

Hyperparameter tuning with BQML

Evaluating trained models

Doing inference with BQML

User exercise

Summary

7

Training Fully Custom ML Models with Vertex AI

Technical requirements

Building a basic deep learning model with TensorFlow

Experiment – converting black-and-white images into color images

Packaging a model to submit it to Vertex AI as a training job

Monitoring model training progress

Evaluating trained models

Summary

8

ML Model Explainability

What is Explainable AI and why is it important for MLOps practitioners?

Building trust and confidence

Explainable AI techniques

Global versus local explainability

Techniques for image data

Techniques for tabular data

Techniques for text data

Explainable AI features available in Google Cloud Vertex AI

Feature-based explanation techniques available on Vertex AI

Using the model feature importance (SHAP-based) capability with AutoML for tabular data

Exercise 1

Exercise 2

Example-based explanations

Key steps to use example-based explanations

Exercise 3

Summary

References

9

Model Optimizations – Hyperparameter Tuning and NAS

Technical requirements

What is HPT and why is it important?

What are hyperparameters?

Why HPT?

Search algorithms

Setting up HPT jobs on Vertex AI

What is NAS and how is it different from HPT?

Search space

Optimization method

Evaluation method

NAS on Vertex AI overview

NAS best practices

Summary

10

Vertex AI Deployment and Automation Tools – Orchestration through Managed Kubeflow Pipelines

Technical requirements

Orchestrating ML workflows using Vertex AI Pipelines (managed Kubeflow pipelines)

Developing Vertex AI Pipeline using Python

Pipeline components

Orchestrating ML workflows using Cloud Composer (managed Airflow)

Creating a Cloud Composer environment

Vertex AI Pipelines versus Cloud Composer

Getting predictions on Vertex AI

Getting online predictions

Getting batch predictions

Managing deployed models on Vertex AI

Multiple models – single endpoint

Single model – multiple endpoints

Compute resources and scaling

Summary

11

MLOps Governance with Vertex AI

What is MLOps governance and what are its key components?

Data governance

Model governance

Enterprise scenarios that highlight the importance of MLOps governance

Scenario 1 – limiting bias in AI solutions

Scenario 2 – the need to constantly monitor shifts in feature distributions

Scenario 3 – the need to monitor costs

Scenario 4 – monitoring how the training data is sourced

Tools in Vertex AI that can help with governance

Model Registry

Metadata Store

Feature Store

Vertex AI pipelines

Model Monitoring

Billing monitoring

Summary

References

Part 3: Prebuilt/Turnkey ML Solutions Available in GCP

12

Vertex AI – Generative AI Tools

GenAI fundamentals

GenAI versus traditional AI

Types of GenAI models

Challenges of GenAI

LLM evaluation

GenAI with Vertex AI

Understanding foundation models

What is a prompt?

Using Vertex AI GenAI models through GenAI Studio

Example 1 – using GenAI Studio language models to generate text

Example 2 – submitting examples along with the text prompt in structured format to get generated output in a specific format

Example 3 – generating images using GenAI Studio (Vision)

Example 4 – generating code samples

Building and deploying GenAI applications with Vertex AI

Enhancing GenAI performance with model tuning in Vertex AI

Using Vertex AI supervised tuning

Safety filters for generated content

Summary

References

13

Document AI – An End-to-End Solution for Processing Documents

Technical requirements

What is Document AI?

Document AI processors

Overview of existing Document AI processors

Using Document AI processors

Creating custom Document AI processors

Summary

14

ML APIs for Vision, NLP, and Speech

Vision AI on Google Cloud

Vision AI

Video AI

Translation AI on Google Cloud

Cloud Translation API

AutoML Translation

Translation Hub

Natural Language AI on Google Cloud

AutoML for Text Analysis

Natural Language API

Healthcare Natural Language API

Speech AI on Google Cloud

Speech-to-Text

Text-to-Speech

Summary

Part 4: Building Real-World ML Solutions with Google Cloud

15

Recommender Systems – Predict What Movies a User Would Like to Watch

Different types of recommender systems

Real-world evaluation of recommender systems

Deploying a movie recommender system on Vertex AI

Data preparation

Model building

Local model testing

Deploying the model on Google Cloud

Using the model for inference

Summary

References

16

Vision-Based Defect Detection System – Machines Can See Now!

Technical requirements

Vision-based defect detection

Dataset

Importing useful libraries

Loading and verifying data

Checking few samples

Data preparation

Splitting data into train and test

Final preparation of training and testing data

TF model architecture

Compiling the model

Training the model

Plotting the training progress

Results

Deploying a vision model to a Vertex AI endpoint

Saving model to Google Cloud Storage (GCS)

Uploading the TF model to the Vertex Model Registry

Creating a Vertex AI endpoint

Deploying a model to the Vertex AI endpoint

Getting online predictions from a vision model

Summary

17

Natural Language Models – Detecting Fake News Articles!

Technical requirements

Detecting fake news using NLP

Fake news classification with random forest

About the dataset

Importing useful libraries

Reading and verifying the data

NULL value check

Combining title and text into a single column

Cleaning and pre-processing data

Separating the data and labels

Converting text into numeric data

Splitting the data

Defining the random forest classifier

Training the model

Predicting the test data

Checking the results/metrics on the test dataset

Confusion matrix

Launching model training on Vertex AI

Setting configurations

Initializing the Vertex AI SDK

Defining the Vertex AI training job

Running the Vertex AI job

BERT-based fake news classification

BERT for fake news classification

Importing useful libraries

The dataset

Data preparation

Splitting the data

Creating data loader objects for batching

Loading the pre-trained BERT model

Scheduler

Training BERT

Loading model weights for evaluation

Calculating the accuracy of the test dataset

Classification report

Summary

Index

Other Books You May Enjoy

Preface

Hello there! The Definitive Guide to Google Vertex AI is a comprehensive guide on accelerating the development and deployment of real-world ML solutions, with the help of the frameworks and best practices offered by Google as part of Vertex AI within Google Cloud.

Developing large-scale ML solutions and managing ML workflows in production is important for every business nowadays. Google has developed a unified data and AI platform, called Google Vertex AI, to help accelerate your ML journey and MLOps tools for workflow management.

This book is a complete guide that lets you explore all the features of Google Vertex AI, from an easy to advanced level, for end-to-end ML solution development. Starting from data management, model building, and experimentation to deployment, the Vertex AI platform provides you with tooling for no-code and low-code as well as fully customized approaches.

This book also provides a hands-on guide to developing and deploying some real-world applications on Google Cloud Platform, using technologies such as computer vision, NLP, and generative AI. Additionally, this book discusses some prebuilt/turnkey solution offerings from Google and shows you how to quickly integrate them into ML projects.

Who this book is for

If you are a machine learning practitioner who wants to learn end-to-end ML solution development journey on Google Cloud Platform, using the MLOps best practices and tools offered by Google Vertex AI, this book is for you. Starting from data storage and data management, this book takes you through the Vertex AI offerings to build, experiment, optimize, and deploy ML solutions in a fast and scalable way. It also covers topics related to scaling, monitoring, and governing your ML workloads with the help of MLOps tooling on Google Cloud.

What this book covers

Chapter 1, Machine Learning Project Life Cycle and Challenges, provides an introduction to a typical ML project’s life cycle. It also highlights the common challenges and limitations of developing ML solutions for real-world use cases.

Chapter 2, What Is MLOps, and Why Is It So Important for Every ML Team? covers a set of practices usually known as MLOps that mature ML teams use as part of their ML development life cycle.

Chapter 3, It’s All about Data – Options to Store and Transform ML Datasets, provides an overview of the different options available for storing data and analyzing data in Google Cloud. It also helps you to choose the best option based on your requirements.

Chapter 4, Vertex AI Workbench – a One-Stop Tool for for AI/ML Development Needs, demonstrates the use of a Vertex AI Workbench-based notebook environment for end-to-end ML solution development.

Chapter 5, No-Code Options for Building ML Models, covers GCP AutoML capabilities that can help users build state-of-the-art ML models, without the need for code or deep data science knowledge.

Chapter 6, Low-Code Options for Building ML Models, covers how to use BigQuery ML (BQML) to build and evaluate ML models using just SQL.

Chapter 7, Training Fully Custom ML Models with Vertex AI, explores how to develop fully customized ML solutions using the Vertex AI tooling available on Google Cloud. Thischapter also shows you how to monitor training progress and evaluate ML models.

Chapter 8, ML Model Explainability, discusses concepts around ML model explainability and describes how to effectively incorporate explainable models into your ML solutions, using Vertex AI.

Chapter 9, Model Optimizations – Hyperparameter Tuning and NAS, explains the need for model optimization. It also covers two model optimization frameworks in detail – hyperparameter tuning and Neural Architecture Search (NAS).

Chapter 10, Vertex AI Deployment and Automation Tools – Orchestration through Managed Kubeflow Pipelines, provides an overview of ML orchestrations and automation tools. This chapter further covers the implementation examples of ML workflow orchestration, using Cloud Composer and Vertex AI pipelines.

Chapter 11, MLOps Governance with Vertex AI, describes the different Google Cloud ML tools that can be used to deploy governance and monitoring controls.

Chapter 12, Vertex AI – Generative AI Tools, provides an overview of Vertex AI’s recently launched generative AI features, such as Model Garden and Generative AI Studio.

Chapter 13, Document AI – an End-to-End Solution for Processing Documents, provides an overview of the document processing-related offerings on Google Cloud, such as OCR and Form Parser. This chapter also shows how to combine prebuilt and custom document processing solutions to develop a custom document processor.

Chapter 14, ML APIs for Vision, NLP, and Speech, provides an overview of the prebuilt state-of-the-art solutions from Google for computer vision, NLP, and speech-related use cases. It also shows you how to integrate them to solve real-world problems.

Chapter 15, Recommender Systems – Predict What Movies a User Would Like to Watch, provides an overview of popular approaches to building recommender systems and how to deploy one using Vertex AI.

Chapter 16, Vision-Based Defect Detection System – Machines Can See Now, shows you how to develop end-to-end computer vision-based custom solutions using Vertex AI tooling on Google Cloud, enabling you to solve real-world use cases.

Chapter 17, Natural Language Models – Detecting Fake News Articles, shows you how to develop NLP-related, end-to-end custom ML solutions on Google Cloud. This chapter explores a classical as well as a deep learning-based approach to solving the problem of detecting fake news articles.

To get the most out of this book

You will need to have a basic understanding of machine learning and deep learning techniques. You also should have beginner-level experience with the Python programming language.

Software/hardware for the coding exercises

Operating system requirements

Python 3.8 or later

Windows, macOS, or Linux

Google Cloud SDK

Windows, macOS, or Linux

A Google Cloud Platform account

N/A

To ensure that you are using the correct Python library versions while executing the code samples, you can check out the GitHub repository of this book, where the code example notebooks also contain the version information.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/The-Definitive-Guide-to-Google-Vertex-AI. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “By default, the Jupyter server starts on port 8888, but in case, this port is unavailable so it finds the next available port.”

A block of code is set as follows:

export PROJECT=$(gcloud config list project --format     "value(core.project)") docker build . -f Dockerfile.example -t "gcr.io/${PROJECT}/tf-custom:latest" docker push "gcr.io/${PROJECT}/tf-custom:latest"

Any command-line input or output is written as follows:

$ mkdir css $ cd css

Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “In the Environment field, select Custom Container.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read The Definitive Guide to Google Vertex AI, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-80181-526-0

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1:The Importance of MLOps in a Real-World ML Deployment

In this part, you will get an overview of the life cycle of a typical real-world machine learning (ML) project. You will also learn about common challenges encountered during the development of ML applications and some key limitations of an ML framework. Finally, you will learn about the machine learning operations (MLOps) practice and its importance in ML deployments.

This part has the following chapters:

Chapter 1, Machine Learning Project Life Cycle and ChallengesChapter 2, What Is MLOps, and Why Is It So Important for Every ML Team?

1

Machine Learning Project Life Cycle and Challenges

Today, machine learning (ML) and artificial intelligence (AI) are integral parts of business strategy for many organizations, and more organizations are using them every year. The major reason for this adoption is the power of ML and AI solutions to garner more revenue, brand value, and cost savings. This increase in the adoption of AI and ML demands more skilled data and ML specialists and technical leaders. If you are an ML practitioner or beginner, this book will help you become a confident ML engineer or data scientist with knowledge of Google’s best practices. In this chapter, we will discuss the basics of the life cycle and the challenges and limitations of ML when developing real-world applications.

ML projects often involve a defined set of steps from problem statements to deployments. It is essential to understand the importance and common challenges involved with these steps to complete a successful and impactful project. In this chapter, we will discuss the importance of understanding the business problem, the common steps involved in a typical ML project life cycle, and the challenges and limitations of ML in detail. This will help new ML practitioners understand the basic project flow; plus, it will help create a foundation for forthcoming chapters in this book.

This chapter covers the following topics:

ML project life cycleCommon challenges in developing real-world ML solutionsLimitations of ML

ML project life cycle

In this section, we will learn about the typical life cycle of an ML project, from defining the problem to model development, and finally, to the operationalization of the model. Figure 1.1 shows the high-level steps almost every ML project goes through. Let’s go through all these steps in detail.

Figure 1.1 – Life cycle of a typical ML project

Just like the Software Development Life Cycle (SDLC), the Machine Learning Project/Development Lifecycle (MDLC) guides the end-to-end process of ML model development and operationalization. At a high level, the life cycle of a typical ML project in an enterprise setting remains somewhat consistent and includes eight key steps:

Define the ML use case: The first step of any ML project is where the ML team works with business stakeholders to assess the business needs around predictive analytics and identifies a use case where ML can be used, along with some success criteria, performance metrics, and possible datasets that can be used to build the models.

For example, if the sales/marketing department of an insurance company called ABC Insurance Inc. wants to better utilize its resources to target customers who are more likely to buy a certain product, they might approach the ML team to build a solution that can sift through all possible leads/customers and, based on the data points for each lead (age, prior purchase, length of policy history, income level, etc.), identify the customers who are most likely to buy a policy. Then the sales team can ask their customer representatives to prioritize reaching out to these customers instead of calling all possible customers blindly. This can significantly improve the outcome of outbound calls by the reps and improve the sales-related KPIs.

Once the use case is defined, the next step is to define a set of KPIs to measure the success of the solution. For this sales use case, this could be the customer sign-up rate—what percentage of the customers whom sales reps talk to sign up for a new insurance policy?

To measure the effectiveness of the ML solution, the sales team and the ML team might agree to measure the increase or decrease in customer sign-up rate once the ML model is live and iteratively improve on the model to optimize the sign-up rate.

At this stage, there will also be a discussion about the possible datasets that can be utilized for the model training. These could include the following:

Internal customer/product datasets being generated by marketing and sales teams, for example, customer metadata, such as their age, education profile, income level, prior purchase behavior, number and type of vehicles they own, etc.External datasets that can be acquired through third parties; for example, an external marketing consultancy might have collected data about the insurance purchase behavior of customers based on the car brand they own. This additional data can be used to predict how likely they are to purchase the insurance policy being sold by ABC Insurance Inc.Explore/analyze data: The next step is to do a detailed analysis of the datasets. This is usually an iterative process in which the ML team works closely with the data and business SMEs to better understand the nuances of the available datasets, including the following:Data sourcesData granularityUpdate frequencyDescription of individual data points and their business meaning

This is a key step where data scientists/ML engineers analyze the available data and decide what datasets might be relevant to the ML solution being considered, analyze the robustness of the data, and identify any gaps. Issues that the team might identify at this stage could relate to the cleanliness and completeness of data or problems with the timely availability of the data in production. For example, the age of the customer could be a great indicator of their purchase behavior, but if it’s an optional field in the customer profile, only a handful of customers might have provided their date of birth or age.

So, the team would need to figure out if they want to use the field and, if so, how to handle the samples where age is missing. They could also work with sales and marketing teams to make the field a required field whenever a new customer requests an insurance quote online and generates a lead in the system.

Select ML model type: Once the use case has been identified along with the datasets that can possibly be used to train the model, the next step is to consider the types of models that can be used to achieve the requirements. We won’t go too deep into the topic of general model selection here since entire books could be written on the topic, but in the next few chapters, you will see what different model types can be built for specific use cases in Vertex AI. At a very high level, the key considerations at this stage are as follows:Type of model: For example, for the insurance customer/lead ranking example, we could build a classification model that will predict whether a new customer is high/medium/low in terms of their likelihood to purchase a policy. Or a regression model could be built to output a sales probability number for each likely customer.Does the conventional ML model satisfy our requirements or do we need a deep learning model?Explainability requirements: Does the use case require an explanation for each prediction as to why the sample was classified a certain way?Single versus ensemble model: Do we need a single model to give us the final prediction, or do we need to employ a set of interconnected models that feed into each other? For example, a first model might assign a customer to a particular customer group, and the next model might use that grouping to identify the final likelihood of purchase.Separation of models: For example, sometimes we might build a single global model for the entire customer base, or we might need separate models for each region due to significant differences in products and user behavior in different regions.Feature engineering: This process is usually the most time-consuming and involves several steps:Data cleanup–Imputing missing values where possible, dropping fields with too many missing valuesData and feature augmentation–Joining datasets to bring in additional fields, and cross-joining existing features to generate new featuresFeature analysis–Calculating feature correlation and analyzing collinearity, checking for data leakage in features

Again, since this is an extremely broad topic, we are not diving too deep into it and suggest you refer to other books on this topic.

Iterate over the model design/build: The actual design and build of the ML model is an iterative process involving the following key steps:Select model architectureSplit acquired data into train/validation/test subsetsRun model training experiments, tune hyperparametersEvaluate trained models with the test datasetRank and select the best models

Figure 1.2 shows the typical ML model development life cycle:

Figure 1.2 – ML model development life cycle

Consensus on results: Once a satisfactory model has been obtained, the ML team shares the results with the business stakeholders to ensure the results fully align with the business needs and performs additional optimizations and post-processing steps to make the model predictions usable by the business. To assure business stakeholders that ML solution is aligned with the business goals and is accurate enough to drive value, ML teams could use one of a number of approaches:Evaluate using historical test datasets: ML teams can run historical data through the new ML models and evaluate the predictions against the ground truth values. For example, in the insurance use case discussed previously, the ML team can take last month’s data on customer leads and use the ML model to predict which customers are most likely to purchase a new insurance policy. Then they can compare the model’s predictions against the actual purchase history from the previous month and see how accurate the model’s predictions were. If the model’s output is close to the real purchase behavior of customers, then the model is working as desired, and this information can be presented to business stakeholders to convince them of the ML solution’s efficacy in driving additional revenue. On the contrary, if the model’s output significantly deviates from the customer’s behavior, the ML team needs to go back and work on improving the model. This usually is an iterative process and can take a number of iterations, depending on the complexity of the model.Evaluate with live data: In some scenarios, an organization might decide to conduct a small pilot in a production environment with real-time data to assess the performance of the new ML model. This is usually done in the following scenarios:When there is no historical data available to conduct the evaluation or where testing with historical data is not expected to be an accurate; for example, during the onset of COVID, customer behavior patterns abruptly changed to the extent that testing with any historical data became nearly uselessWhen there is an existing model in production being used for critical real-time predictions, the sanity check for the new model needs to be performed not just in terms of its accuracy but also its subtle impact on downstream KPIs such as revenue per user session

In such cases, teams might deploy the model in production, divert a small number of prediction requests to the newer model, and periodically compare the overall impact on the KPIs. For example, in the case of a recommendation model deployed on an e-commerce website, a recommendation model might start recommending products that are comparatively cheaper than the predictions from the older model already live in production. In this scenario, the likelihood of a customer completing a purchase would go up, but at the same time, the revenue generated per user session would decrease, impacting overall revenue for the organization. So, although it might seem like the ML model is working as designed, it might not be considered a success by the business/sales stakeholders, and more discussions would be required to optimize it.

Operationalize model: Once the model has been approved for deployment in production, the ML team will work with their organization’s IT and data engineering teams to deploy the model so that other applications can start utilizing it to generate insights. Depending on the size of the organization, there can be significant overlap in the roles these teams play.

The actual deployment architecture would depend on the following:

Prediction SLAs – Ranging from periodic batch jobs to solutions that require sub-second prediction performance.Compliance requirements – Can the user data be sent to third-party cloud providers, or does it need to always reside within an organization’s data centers?Infrastructure requirements – This depends on the size of the model and its compute requirements. Small models can be served from a shared compute node. Some large models might need a large GPU-connected node.

We will discuss this topic in detail in later chapters, but the following figure shows some key components you might consider as part of your deployment architecture.

Figure 1.3 – Key components of ML model training and deployment

Monitor and retrain: It might seem as if the ML team’s job is done once the model has been operationalized, but in real-world deployments, most models require periodic or sometimes constant monitoring to ensure the model is operating within the required performance thresholds. Model performance could become sub-optimal for several reasons:Data drift: Changes in data being used to generate predictions could change significantly and impact the model’s performance. As we discussed before, during COVID, customer behavior changed significantly. Models that were trained on pre-COVID customer behavior data were not equipped to handle this sudden change in usage patterns. Change due to the pandemic was relatively rare but high-impact, but there are plenty of other smaller changes in prediction input data that might impact your model’s performance adversely. The impact could range from a subtle drop in accuracy to a model generating erroneous responses. So, it is important to keep an eye on the key performance metrics of your ML solution.Change prediction request volume: If your solution was designed to handle 100 requests per second but now is seeing periodic bursts in traffic of around 1,000 requests per second, your solution might not be able to keep up with the demand, or latency might go above acceptable levels. So, your solution also needs to have monitoring and certain levels of auto-scaling built in to handle such scenarios. For larger changes in traffic volumes, you might even need to completely rethink the serving architecture.

There would be scenarios where through monitoring, you will discover that your ML model no longer meets the prediction accuracy and requires retraining. If the change in data patterns is expected, the ML team should design the solution to support automatic periodic retraining. For example, in the retail industry, product catalogs, pricing, and promotions constantly evolve, requiring regular retraining of the models. In other scenarios, the change might be gradual or unexpected, and when the monitoring system alerts the ML team of the model performance degradation, they need to take a call on retraining the model with more recent data, or maybe even completely rebuilding the model with new features.

Now that we have a good idea of the life cycle of an ML project, let’s learn about some of the common challenges faced by ML developers when creating and deploying ML solutions.

Common challenges in developing real-world ML solutions

A real-world ML project is always filled with some unexpected challenges that we get to experience at different stages. The main reason for this is that the data present in the real world, and the ML algorithms, are not perfect. Though these challenges hamper the performance of the overall ML setup, they don’t prevent us from creating a valuable ML application. In a new ML project, it is difficult to know the challenges up front. They are often found during different stages of the project. Some of these challenges are not obvious and require skilled or experienced ML practitioners (or data scientists) to identify them and apply countermeasures to reduce their effect.

In this section, we will understand some of the common challenges encountered during the development of a typical ML solution. The following list shows some common challenges we will discuss in more detail:

Data collection and securityNon-representative training setPoor quality of dataUnderfitting of the training datasetOverfitting of the training datasetInfrastructure requirements

Now, let’s learn about each of these common challenges in detail.

Data collection and security

One of the most common challenges that organizations face is data availability. ML algorithms require a large amount of good-quality data in order to provide quality results. Thus, the availability of raw data is critical for a business if it wants to implement ML. Sometimes, even if the raw data is available, gathering data is not the only concern; we often need to transform or process the data in a way that our ML algorithm supports.

Data security is another important challenge that is very frequently faced by ML developers. When we get data from a company, it is essential to differentiate between sensitive and non-sensitive information to implement ML correctly and efficiently. The sensitive part of data needs to be stored in fully secured servers (storage systems) and should always be kept encrypted. Sensitive data should be avoided for security purposes, and only the less-sensitive data access should be given to trusted team members working on the project. If the data contains Personally Identifiable Information (PII), it can still be used by anonymizing it properly.

Non-representative training data

A good ML model is one that performs equally well on unseen data and training data. It is only possible when your training data is a good representative of most possible business scenarios. Sometimes, when the dataset is small, it may not be a true representative of the inherent distribution, and the resulting model may provide inaccurate predictions on unseen datasets despite having high-quality results on the training dataset. This kind of non-representative data is either the result of sampling bias or the unavailability of data. Thus, an ML model trained on such a non-representative dataset may have less value when it is deployed in production.

If it is impossible to get a true representative training dataset for a business problem, then it’s better to limit the scope of the problem to only the scenarios for which we have a sufficient amount of training samples. In this way, we will only get known scenarios in the unseen dataset, and the model should provide quality predictions. Sometimes, the data related to a business problem keeps changing with time, and it may not be possible to develop a single static model that works well; in such cases, continuous retraining of the model on the latest data becomes essential.

Poor quality of data

The performance of ML algorithms is very sensitive to the quality of training samples. A small number of outliers, missing data cases, or some abnormal scenarios can affect the quality of the model significantly. So, it is important to treat such scenarios carefully while analyzing the data before training any ML algorithm. There are multiple methods for identifying and treating outliers; the best method depends upon the nature of the problem and the data itself. Similarly, there are multiple ways of treating the missing values as well. For example, mean, median, mode, and so on are some frequently used methods to fill in missing data. If the training data size is sufficiently large, dropping a small number of rows with missing values is also a good option.

As discussed, the quality of the training dataset is important if we want our ML system to learn accurately and provide quality results on the unseen dataset. It means that the data pre-processing part of the ML life cycle should be taken very seriously.

Underfitting the training dataset

Underfitting an ML model means that the model is too simple to learn the inherent information or structure of the training dataset. It may occur when we try to fit a non-linear distribution using a linear ML algorithm such as linear regression. Underfitting may also occur when we utilize only a minimal set of features (that may not have much information about the target distribution) while training the model. This type of model can be too simple to learn the target distribution. An underfitted model learns too little from the training data and, thus, makes mistakes on unseen or test datasets.

There are multiple ways to tackle the problem of underfitting. Here is a list of some common methods:

Feature engineering – add more features that represent target distributionNon-linear algorithms – switch to a non-linear algorithm if the target distribution is not linearRemoving noise from the dataAdd more power to the model – increase trainable parameters, increase depth or number of trees in tree-based ensembles

Just like underfitting the model on training data, overfitting is also a big issue. Let’s deep dive into it.

Overfitting the training dataset

The overfitting problem is the opposite of the underfitting problem. Overfitting is the scenario when the ML model learns too much unnecessary information from the training data and fails to generalize on a test or unseen dataset. In this case, the model performs extremely well on the training dataset, but the metric value (such as accuracy) is very low on the test set. Overfitting usually occurs when we implement a very complex algorithm on simple datasets.

Some common methods to address the problem of overfitting are as follows:

Increase training data size – ML models often overfit on small datasetsUse simpler models – When problems are simple or linear in nature, choose simple ML algorithmsRegularization – There are multiple regularization methods that prevent complex models from overfitting on the training datasetReduce model complexity – Use a smaller number of trainable parameters, train for a smaller number of epochs, and reduce the depth of tree-based models

Overfitting and underfitting are common challenges and should be addressed carefully, as discussed earlier. Now, let’s discuss some infrastructure-related challenges.

Infrastructure requirements

ML is expensive. A typical ML project often involves crunching large datasets with millions or billions of samples. Slicing and dicing such datasets requires a lot of memory and high-end multi-core processors. Additionally, once the development of the project is complete, dedicated servers are required to deploy the models and match the scale of consumers. Thus, business organizations willing to practice ML need some dedicated infrastructure to implement and consume ML efficiently. This requirement increases further when working with large, deep learning models such as transformers, large language models (LLMs), and so on. Such models usually require a set of accelerators, graphical processing units (GPUs), or tensor processing units (TPUs) for training, finetuning, and deployment.

As we have discussed, infrastructure is critical for practicing ML. Companies that lack such infrastructure can consult with other firms or adopt cloud-based offerings to start developing ML-based applications.

Now that we understand the common challenges faced during the development of an ML project, we should be able to make more informed decisions about them. Next, let’s learn about some of the limitations of ML.

Limitations of ML

ML is very powerful, but it’s not the answer to every single problem. There are problems that ML is just not suitable for, and there are some cases where ML can’t be applied due to technical or business constraints. As an ML practitioner, it is important to develop the ability to find relevant business problems where ML can provide significant value instead of applying it blindly everywhere. Additionally, there are algorithm-specific limitations that can render an ML solution not applicable in some business applications. In this section, we will learn about some common limitations of ML that should be kept in mind while finding relevant use cases.

Keep in mind that the limitations we are discussing in this section are very general. In real-world applications, there are more limitations possible due to the nature of the problem we are solving. Some common limitations that we will discuss in detail are as follows:

Data-related concernsDeterministic nature of problemsLack of interpretability and reproducibilityConcerns related to cost and customizationsEthical concerns and bias

Let’s now deep dive into each of these common limitations.

Data-related concerns

The quality of an ML model highly depends upon the quality of the training data it is provided with. Data present in the real world is often noisy, incomplete, unlabeled, and sometimes unusable. Moreover, most supervised learning algorithms require large amounts of properly labeled training data to produce good results. The training data requirements of some algorithms (e.g., deep learning) are so high that even manually labeling data is not an option. And even if we manage to label the data manually, it is often error-prone due to human bias.

Another major issue is incompleteness or missing data. For example, consider the problem of automatic speech recognition. In this case, model results are highly biased toward the accent present in the training dataset. A model that is trained on the American accent doesn’t produce good results on other accented speech. Since accents change significantly as we travel to different parts of the world, it is hard to gather and label relevant amounts of training data for every possible accent. For this reason, developing a single speech recognition model that works for everyone is not yet feasible, and thus, the tech giants providing speech recognition solutions often develop accent-specific models. Developing a new model for each new accent is not very scalable.

Deterministic nature of problems

ML has achieved great success in solving some highly complex problems, such as numerical weather prediction. One problem with most of the current ML algorithms is that they are stochastic in nature and thus cannot be trusted blindly when the problem is deterministic. Considering the case of numerical weather prediction, today we have ML models that can predict rain, wind speed, air pressure, and so on, with acceptable accuracy, but they completely fail to understand the physics behind real weather systems. For example, an ML model might provide negative value estimations of parameters such as density.

However, it is very likely that these kinds of limitations can be overcome in the near future. Future research in the field of ML might discover new algorithms that are smart enough to understand the physics of our world. Such models will open infinite possibilities in the future.

Lack of interpretability and reproducibility

One major issue with many ML algorithms (and often with neural networks) is the lack of interpretability of results. Many business applications, such as fraud detection and disease prediction, require a justification for model results. If an ML model classifies a financial transaction as fraud, it should also provide solid evidence for the decision; otherwise, this output may not be useful for the business. Deep learning or neural network models often lack interpretability, and the explainability of such models is an active area of research. Multiple methods have been developed for model interpretability or explainability purposes. Though these methods can provide some insights into the results, they are still far from the actual requirements.

Reproducibility, on the other hand, is another complex and growing issue with ML solutions. Some of the latest research papers might show us great improvements in results using some technological advancements on a fixed set of datasets, but the same method may not work in real-world scenarios. Secondly, ML models are often unstable, which means that they produce different results when trained on different partitions of the dataset. This is a challenging situation because models developed for one business segment may be completely useless for another business segment, even though the underlying problem statement is similar. This makes them less reusable.

Concerns related to cost and customizations

Developing and maintaining ML solutions is often expensive, more so in the case of deep learning algorithms. Development costs may come from employing highly skilled developers as well as the infrastructure needed for data analytics and ML experimentation. Deep learning models usually require high-compute resources such as GPUs and TPUs for training and experimentation. Running a hyperparameter tuning job with such models is even more costly and time-consuming. Once the model is ready for production, it requires dedicated resources for deployment, monitoring, and maintenance. This cost further increases as you scale your deployments to serve a large number of customers, and even more if there are very low latency concerns. Thus, it is very important to understand the value that our solution is going to bring before jumping into the development phase and check whether it is worth the investment.

Another concern with the ML solutions is their lack of customizations. ML models are often very difficult to customize, meaning it can be hard to change their parameters or make them adapt to new datasets. Pre-built general-purpose ML solutions often do not work well on specific business use cases, and this leaves them with two choices – either to develop the solution from scratch or customize the prebuilt general-purpose solutions. Though the customization of prebuilt models seems like a better choice here, even the customization is not easy in the case of ML models. ML model customization requires a skilled set of data engineers and ML specialists with a deep understanding of technical concepts such as deep learning, predictive modeling, and transfer learning.

Ethical concerns and bias

ML is quite powerful and is adopted today by many organizations to guide their business strategy and decisions. As we know, some of these ML algorithms are black boxes; they may not provide reasons behind their decisions. ML systems are trained on a finite set of datasets, and they may not apply to some real-world scenarios; if those scenarios are encountered in the future, we can’t tell what decision the ML system will take. There might be ethical concerns related to such black-box decisions. For example, if a self-driving car is involved in a road accident, whom should you blame – the driver, the team that developed the AI system, or the car manufacturer? Thus, it is clear that the current advancements in ML and AI are not suitable for ethical or moral decision-making. Also, we need a framework to solve ethical concerns involving ML and AI systems.

The accuracy and speed of ML solutions are often commendable, but these solutions cannot always be trusted to be fair and unbiased. Consider AI software that recognizes faces or objects in a given image; this system could go wrong on photos where the camera is not able to capture racial sensitivity properly, or it may classify a certain type of dog (that is somewhat similar to a cat) as a cat. This kind of bias may come from a biased set of training or testing datasets used for developing AI systems. Data present in the real world is often collected and labeled by humans; thus, the bias that exists in humans is transferred into AI systems. Avoiding bias completely is impossible as we all are humans and are thus biased, but there are measures that can be taken to reduce it. Establishing a culture of ethics and building teams from diverse backgrounds can be a good step to reduce bias to a certain extent.

Summary

ML is an integral part of any business strategy and decisions for many organizations today, thus it is very important to do it right. In this chapter, we learned about the general steps involved in a typical ML project development life cycle and their significance. We also highlighted some common challenges that ML practitioners face while undergoing project development. Finally, we listed some of the common limitations of ML in real-world scenarios to help us choose the right business problem and a fitting ML algorithm to solve it.

In this chapter, we learned about the importance of choosing the right business problem in order to deliver the maximum impact using ML. We also learned about the general flow of a typical ML project. We should now be confident about identifying the underlying ML-related challenges in a business process and making informed decisions about them. Finally, we have learned about the common limitations of ML algorithms, and it will help us apply ML in a better way to get the best out of it.

Just developing a high-performing ML model is not enough. The real value comes when it is deployed and used in real-world applications. Taking an ML model to production is not trivial and should be done in the right way. The next chapter is all about the guidelines and best practices to follow while operationalizing an ML model and it is going to be extremely important to understand it thoroughly before jumping into the later chapters of this book.

2

What Is MLOps, and Why Is It So Important for Every ML Team?

Machine learning operations (MLOps) is a pivotal practice for modern ML teams, encompassing the blend of technological and operational best practices. At its heart, MLOps seeks to address the challenges of productionizing ML models and fostering better collaboration between data scientists and IT teams. With the rapid advancements in technology and increasing reliance on ML solutions, MLOps is becoming the backbone of a sustainable and scalable ML strategy. This chapter will delve deep into the essence of MLOps, detailing its significance, its various maturity levels, and the role of Google’s Vertex AI in facilitating MLOps. By the end of this chapter, you will be equipped with a robust understanding of MLOps principles and what tools in Vertex AI can be used to implement those principles.

In this chapter, we will cover the following topics:

Why is MLOps important?MLOps maturity levelsHow can Vertex AI help with implementing ML Ops?

Let’s embark on this enlightening journey to master MLOps on Vertex AI.

Why is MLOps important?

As the development and integration of ML models become more and more common in today’s world, the need for a robust operational framework has become more critical than ever. MLOps aims to address this requirement by streamlining the entire process of developing, deploying, and monitoring ML models. In this section, we will discuss the importance of MLOps due to the following aspects:

Standardizing and automating ML workflows

MLOps aims to standardize and automate various stages of the ML life cycle, from data