Getting Started with Amazon SageMaker Studio - Michael Hsieh - E-Book

Getting Started with Amazon SageMaker Studio E-Book

Michael Hsieh

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Amazon SageMaker Studio is the first integrated development environment (IDE) for machine learning (ML) and is designed to integrate ML workflows: data preparation, feature engineering, statistical bias detection, automated machine learning (AutoML), training, hosting, ML explainability, monitoring, and MLOps in one environment.
In this book, you'll start by exploring the features available in Amazon SageMaker Studio to analyze data, develop ML models, and productionize models to meet your goals. As you progress, you will learn how these features work together to address common challenges when building ML models in production. After that, you'll understand how to effectively scale and operationalize the ML life cycle using SageMaker Studio.
By the end of this book, you'll have learned ML best practices regarding Amazon SageMaker Studio, as well as being able to improve productivity in the ML development life cycle and build and deploy models easily for your ML use cases.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 321

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Getting Started with Amazon SageMaker Studio

Learn to build end-to-end machine learning projects in the SageMaker machine learning IDE

Michael Hsieh

BIRMINGHAM—MUMBAI

Getting Started with Amazon SageMaker Studio

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Dhruv Jagdish Kataria

Senior Editor: David Sugarman

Content Development Editor: Priyanka Soam

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Roshan Kawale

Marketing Coordinators: Abeer Riyaz Dawe, Shifa Ansari

First published: March 2022

Production reference: 1240222

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-015-7

www.packt.com

This is for my wife, and my parents. Thank you for unconditionally supporting me and being there for me.

Contributors

About the author

Michael Hsieh is a senior AI/machine learning (ML) solutions architect at Amazon Web Services. He creates and evangelizes for ML solutions centered around Amazon SageMaker. He also works with enterprise customers to advance their ML journeys.

Prior to working at AWS, Michael was an advanced analytic consultant creating ML solutions and enterprise-level ML strategies at Slalom Consulting in Philadelphia, PA. Prior to consulting, he was a data scientist at the University of Pennsylvania Health System, focusing on personalized medicine and ML research.

Michael has two master's degrees, one in applied physics and one in robotics.

Originally from Taipei, Taiwan, Michael currently lives in Sammamish, WA, but still roots for the Philadelphia Eagles.

About the reviewers

Brent Rabowsky is a principal data science consultant at AWS with over 10 years' experience in the field of ML. At AWS, he leverages his expertise to help AWS customers with their data science projects. Prior to AWS, he joined Amazon.com on an ML and algorithms team, and he has previously worked on conversational AI agents for a government contractor and a research institute. He also served as a technical reviewer of the books Data Science on AWS by Chris Fregly and Antje Barth, published by O'Reilly, and SageMaker Best Practices and Learn Amazon SageMaker, published by Packt.

Ankit Sirmorya is an ML lead/senior ML engineer at Amazon and has led several machine-learning initiatives across the Amazon ecosystem. Ankit works on applying ML to solve ambiguous business problems and improve customer experience. For instance, he created a platform for experimenting with different hypotheses on Amazon product pages using reinforcement learning techniques. Currently, he works in the Alexa Shopping organization, where he is developing ML-based solutions to send personalized reorder hints to customers for improving their experience.

Table of Contents

Contributors

About the author

About the reviewers

Preface

Who this book is for

What this book covers

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Share Your Thoughts

Part 1 – Introduction to Machine Learning on Amazon SageMaker Studio

Chapter 1: Machine Learning and Its Life Cycle in the Cloud

Technical requirements

Understanding ML and its life cycle

An ML life cycle

Building ML in the cloud

Exploring AWS essentials for ML

Compute

Storage

Database and analytics

Security

Setting up an AWS environment

Summary

Chapter 2: Introducing Amazon SageMaker Studio

Technical requirements

Introducing SageMaker Studio and its components

Prepare

Build

Training and tuning

Deploy

MLOps

Setting up SageMaker Studio

Setting up a domain

Walking through the SageMaker Studio UI

The main work area

The sidebar

"Hello world!" in SageMaker Studio

Demystifying SageMaker Studio notebooks, instances, and kernels

Using the SageMaker Python SDK

Summary

Part 2 – End-to-End Machine Learning Life Cycle with SageMaker Studio

Chapter 3: Data Preparation with SageMaker Data Wrangler

Technical requirements

Getting started with SageMaker Data Wrangler for customer churn prediction

Preparing the use case

Launching SageMaker Data Wrangler

Importing data from sources

Importing from S3

Importing from Athena

Editing the data type

Joining tables

Exploring data with visualization

Understanding the frequency distribution with a histogram

Scatter plots

Previewing ML model performance with Quick Model

Revealing target leakage

Creating custom visualizations

Applying transformation

Exploring performance while wrangling

Exporting data for ML training

Summary

Chapter 4: Building a Feature Repository with SageMaker Feature Store

Technical requirements

Understanding the concept of a feature store

Understanding an online store

Understanding an offline store

Getting started with SageMaker Feature Store

Creating a feature group

Ingesting data to SageMaker Feature Store

Ingesting from SageMaker Data Wrangler

Accessing features from SageMaker Feature Store

Accessing a feature group in the Studio UI

Accessing an offline store – building a dataset for analysis and training

Accessing online store – low-latency feature retrieval

Summary

Chapter 5: Building and Training ML Models with SageMaker Studio IDE

Technical requirements

Training models with SageMaker's built-in algorithms

Training an NLP model easily

Managing training jobs with SageMaker Experiments

Training with code written in popular frameworks

TensorFlow

PyTorch

Hugging Face

MXNet

Scikit-learn

Developing and collaborating using SageMaker Notebook

Summary

Chapter 6: Detecting ML Bias and Explaining Models with SageMaker Clarify

Technical requirements

Understanding bias, fairness in ML, and ML explainability

Detecting bias in ML

Detecting pretraining bias

Mitigating bias and training a model

Detecting post-training bias

Explaining ML models using SHAP values

Summary

Chapter 7: Hosting ML Models in the Cloud: Best Practices

Technical requirements

Deploying models in the cloud after training

Inferencing in batches with batch transform

Hosting real-time endpoints

Optimizing your model deployment

Hosting multi-model endpoints to save costs

Optimizing instance type and autoscaling with load testing

Summary

Chapter 8: Jumpstarting ML with SageMaker JumpStart and Autopilot

Technical requirements

Launching a SageMaker JumpStart solution

Solution catalog for industries

Deploying the Product Defect Detection solution

SageMaker JumpStart model zoo

Model collection

Deploying a model

Fine-tuning a model

Creating a high-quality model with SageMaker Autopilot

Wine quality prediction

Setting up an Autopilot job

Understanding an Autopilot job

Evaluating Autopilot models

Summary

Further reading

Part 3 – The Production and Operation of Machine Learning with SageMaker Studio

Chapter 9: Training ML Models at Scale in SageMaker Studio

Technical requirements

Performing distributed training in SageMaker Studio

Understanding the concept of distributed training

The data parallel library with TensorFlow

Model parallelism with PyTorch

Monitoring model training and compute resources with SageMaker Debugger

Managing long-running jobs with checkpointing and spot training

Summary

Chapter 10: Monitoring ML Models in Production with SageMaker Model Monitor

Technical requirements

Understanding drift in ML

Monitoring data and performance drift in SageMaker Studio

Training and hosting a model

Creating inference traffic and ground truth

Creating a data quality monitor

Creating a model quality monitor

Reviewing model monitoring results in SageMaker Studio

Summary

Chapter 11: Operationalize ML Projects with SageMaker Projects, Pipelines, and Model Registry

Technical requirements

Understanding ML operations and CI/CD

Creating a SageMaker project

Orchestrating an ML pipeline with SageMaker Pipelines

Running CI/CD in SageMaker Studio

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Preface

Amazon SageMaker Studio is the first integrated development environment (IDE) for machine learning (ML) and is designed to integrate ML workflows: data preparation, feature engineering, statistical bias detection, Automated Machine Learning (AutoML), training, hosting, ML explainability, monitoring, and MLOps in one environment.

In this book, you'll start by exploring the features available in Amazon SageMaker Studio to analyze data, develop ML models, and productionize models to meet your goals. As you progress, you will learn how these features work together to address common challenges when building ML models in production. After that, you'll understand how to effectively scale and operationalize the ML life cycle using SageMaker Studio.

By the end of this book, you'll have learned ML best practices regarding Amazon SageMaker Studio, as well as being able to improve productivity in the ML development life cycle and build and deploy models easily for your ML use cases.

Who this book is for

This book is for data scientists and ML engineers who are looking to become well versed in Amazon SageMaker Studio and gain hands-on ML experience to handle every step in the ML life cycle, including building data as well as training and hosting models. Although basic knowledge of ML and data science is necessary, no previous knowledge of SageMaker Studio or cloud experience is required.

What this book covers

Chapter 1, Machine Learning and Its Life Cycle in the Cloud, describes how cloud technology has democratized the field of ML and how ML is being deployed in the cloud. It introduces the fundamentals of the AWS services that are used in the book.

Chapter 2, Introducing Amazon SageMaker Studio, covers an overview of Amazon SageMaker Studio, including its features and functionalities and user interface components. You will set up a SageMaker Studio domain and get familiar with basic operations.

Chapter 3, Data Preparation with SageMaker Data Wrangler, looks at how, with SageMaker Data Wrangler, you can perform exploratory data analysis and data preprocessing for ML modeling with a point-and-click experience (that is, without any coding). You will be able to quickly iterate through data transformation and modeling to see whether your transform recipe helps increase model performance, learn whether there is implicit bias in the data against sensitive groups, and have a clear record of what transformation has been done for the processed data.

Chapter 4, Building a Feature Repository with SageMaker Feature Store, looks at SageMaker Feature Store, which allows storing features for ML training and inferencing. Feature Store serves as a central repository for teams collaborating on ML use cases to avoid duplicating and confusing efforts in creating features. SageMaker Feature Store makes storing and accessing training and inferencing data easier and faster.

Chapter 5, Building and Training ML Models with SageMaker Studio IDE, looks at how building and training an ML model can be made easy. No more frustration in provisioning and managing compute infrastructure. SageMaker Studio is an IDE designed for ML developers. In this chapter, you will learn how to use the SageMaker Studio IDE, notebooks, and SageMaker-managed training infrastructure.

Chapter 6, Detecting ML Bias and Explaining Models with SageMaker Clarify, covers the ability to detect and remediate bias in data and models during the ML life cycle, which is critical in creating an ML model with social fairness. You will learn how to apply SageMaker Clarify to detect bias in your data and how to read the metrics in SageMaker Clarify.

Chapter 7, Hosting ML Models in the Cloud: Best Practices, looks at how, after successfully training a model, if you want to make the model available for inference, SageMaker has several options depending on your use case. You will learn how to host models for batch inference, do online real-time inference, and use multimodel endpoints for cost savings, as well as a resource optimization strategy for your inference needs.

Chapter 8, Jumpstarting ML with SageMaker JumpStart and Autopilot, looks at SageMaker JumpStart, which offers complete solutions for select use cases as a starter kit to the world of ML with Amazon SageMaker without any code development. SageMaker JumpStart also catalogs popular pretrained computer vision (CV) and natural language processing (NLP) models for you to easily deploy or fine-tune to your dataset. SageMaker Autopilot is an AutoML solution that explores your data, engineers features on your behalf, and trains an optimal model from various algorithms and hyperparameters. You don't have to write any code as Autopilot does it for you and returns notebooks to show how it does it.

Chapter 9, Training ML Models at Scale in SageMaker Studio, discusses how a typical ML life cycle starts with prototyping and then transitions to production scale, where the data is going to be much larger, models are much more complicated, and the number of experiments grows exponentially. SageMaker Studio makes this transition easier than before. You will learn how to run distributed training, how to monitor the compute resources and modeling status of a training job, and how to manage training experiments with SageMaker Studio.

Chapter 10, Monitoring ML Models in Production with SageMaker Model Monitor, looks at how data scientists used to spend too much time and effort maintaining and manually managing ML pipelines, a process that starts with data processing, training, and evaluation and ends with model hosting with ongoing maintenance. SageMaker Studio provides features that aim to streamline this operation with continuous integration and continuous delivery (CI/CD) best practices. You will learn how to implement SageMaker Projects, Pipelines, and the model registry, which will help operationalize the ML life cycle with CI/CD.

Chapter 11, Operationalize ML Projects with SageMaker Projects, Pipelines, and Model Registry, discusses how having a model put into production for inferencing isn't the end of the life cycle. It is just the beginning of an important topic: how do we make sure the model is performing as it is designed and as expected in real life? Monitoring how the model performs in production, especially on data that the model has never seen before, is made easy with SageMaker Studio. You will learn how to set up model monitoring for models deployed in SageMaker, detect data drift and performance drift, and visualize feature importance and bias in the inferred data in real time.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Getting-Started-with-Amazon-SageMaker-Studio. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at

https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801070157_ColorImages.pdf.

Conventions used

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "The two types of metadata that need to be cataloged include Functional and Technical."

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Share Your Thoughts

Once you've read Getting Started with Amazon SageMaker Studio, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Part 1 – Introduction to Machine Learning on Amazon SageMaker Studio

In this section, we will cover an introduction to machine learning (ML), the ML life cycle in the cloud, and Amazon SageMaker Studio. This section also includes a level set on the domain terminology in ML with example use cases.

This section comprises the following chapters:

Chapter 1, Machine Learning and Its Life Cycle in the CloudChapter 2, Introducing Amazon SageMaker Studio

Chapter 1: Machine Learning and Its Life Cycle in the Cloud

Machine Learning (ML) is a technique that has been around for decades. It is hard to believe how ubiquitous ML is now in our daily life. It has also been a rocky road for the field of ML to become mainstream, until the recent major leap in computer technology. Today's computer hardware is faster, smaller, and smarter. Internet speeds are faster and more convenient. Storage is cheaper and smaller. Now, it is rather easy to collect, store, and process massive amounts of data with the technology we have now. We are able to create sizeable datasets that we were not able to before, train ML models using compute resources that were not available before, and make use of ML models in every corner of our lives.

For example, media streaming companies can now build ML recommendation engines at a global scale using their title collections and customer activity data on their websites to provide the most relevant content in real time in order to optimize the customer experience. The size of the data for both the titles and customer preferences and activity is on a scale that wasn't possible 20 years ago, considering how many of us are currently using a streaming service.

Training an ML model at this scale, using ML algorithms that are becoming increasingly more complex, requires a robust and scalable solution. After a model is trained, companies are able to serve the model at a global scale where millions of users visit the application from web and mobile devices at the same time.

Companies are also creating more and more models for each segment of customers or even one model for one customer. There is another dimension to this – companies are rolling out new models at a pace that would not have been possible to manage without a pipeline that trains, evaluates, tests, and deploys a new model automatically. Cloud computing has provided a perfect foundation for the streaming service provider to perform these ML activities to increase customer satisfaction.

If ML is something that interests you, or if you are already working in the field of ML in any capacity, this book is the right place for you. You will be learning all things ML, and how to build, train, host, and manage ML models in the cloud with actual use cases and datasets along with me throughout the book. I assume you come to this book with a good understanding of ML and cloud computing. The purpose of this first chapter is to set the level of the concepts and terminology of the two technologies, to define the ML life cycle that is going to be the core of this book, and to provide a crash course on Amazon Web Services and its core services, which will be mentioned throughout the book.

In this chapter, we will cover the following:

Understanding ML and its life cycleBuilding ML in the cloudExploring AWS essentials for MLSetting up AWS environment

Technical requirements

For this chapter, you will need a computer with an internet connection and a browser to perform the basic AWS account setup in order to run Amazon SageMaker setup and code samples in the following chapters.

Understanding ML and its life cycle

At its core, ML is a process that uses computer algorithms to automatically discover the underlying patterns and trends in a dataset (which is a collection of observations with features, also known as variables), make a prediction, obtain the error measure against a ground truth (if provided), and "learn" from the error with an optimization process in order to make a prediction next time. At the end of the process, an ML model is fitted or trained so that it can be used to apply the knowledge it learned to apply a decision based on the features of a new observation. The first part, generating a model, is called training, while the second part is called prediction or inference.

There are three basic types of ML algorithms based on the way the training process takes place – supervised learning, unsupervised learning, and reinforcement learning. A supervised learning algorithm is given a set of observations with a ground truth from the past. A ground truth is a key ingredient to train a supervised learning algorithm, as it drives how the model learns and makes future predictions – hence the "supervised" in the name, as the learning is supervised by the ground truth. Unsupervised learning, on the other hand, does not require a ground truth for the observations to learn how to apply the prediction. It finds patterns and relationships solely based on the features of the observations. However, a ground truth, if it exists, would still help us validate and understand the accuracy of the model in the case of unsupervised learning. Reinforcement learning, often abbreviated as RL, has quite a different learning paradigm compared to the previous two. RL consists of an agent interacting with an environment with a set of actions, and corresponding rewards and states. The learning is not guided by a ground truth, rather by optimizing cumulative rewards with actions. The trained model in the end would be able to perform actions autonomously in an environment that would achieve the best rewards.

An ML life cycle

Now we have a basic understanding of what ML is, we can go broader to see what a typical ML life cycle looks like, as illustrated in the following figure:

Figure 1.1 – The ML life cycle

Problem framing

The first step in a successful ML life cycle is framing the business problem into an ML problem. Business problems come in all shapes and forms. For example, "How do we increase sales of a newly released product?" and "How do we improve the QA Quality Assessment (QA) throughput on the assembly line?" Business problems such as these, usually qualitative, are not something ML can be directly applied to. But looking at the business problem statement, we should think about how it can be translated into an ML problem. We should ask questions like the following:

"What are the key factors to the success of product sales?""Who are the people that are most likely to purchase the product?""What is the bottleneck in throughput in the assembly line?""How do we know whether an item is defective? What differentiates a defective one from a normal one?"

By asking questions like these, we start to dig into the realm of pattern recognition, a process of recognizing patterns from the data at hand. Having the right questions that can be formulated into pattern recognition, we are a step closer to framing an ML problem. Then, we also need to understand what the key metric is to gauge the success of an approach, regardless of whether we use ML or other approaches. It is quite straightforward to measure, for example, daily product sales. We can also improve sales by targeting advertisements to the people that are mostly like to convert. Then, we get questions like the following:

"How do we measure the conversion?""What are the common characteristics of the consumers who have bought this product?"

More importantly, we need to find out whether there is even a target metric for us to predict! If there are targets, we can frame the problem as an ML problem, such as predicting future sales (supervised learning and regression), predicting whether a customer is going to buy a certain product or not (supervised learning and classification), or identifying defective items (supervised learning and classification). Questions that do not have a clear target to predict would fall into an unsupervised learning task in order to apply the pattern discovered in the data to future data points. Use cases where the target is dynamic and of high uncertainty, such as autonomous driving, robotic control, and stock price prediction, are good candidates for RL.

Data exploration and engineering

Sourcing data is the first step of a successful ML modeling journey. Once we have clearly defined both our business problem and ML problem with a basic understanding of the scope of the problem – meaning, what are the metrics and what are the factors – we can start gathering the data needed for ML. Data scientists explore the data sources to find out relevant information that could support the modeling. Sometimes, the data being captured and collected within the organization is easily accessible. Sometimes, the data is available outside your organization and would require you to reach out and ask for data sharing permission.

Sometimes, datasets can be sourced from the public internet and institutions that focus on creating and sharing standardized datasets for ML purposes, which is especially true for computer vision and natural language understanding use cases. Furthermore, data can arrive through streaming from websites and applications. Connections to a database, data lake, data warehouse, and streaming source need to be set up. Data needs to be integrated into the ML platform for processing and engineering before an ML model can be trained.

Managing data irregularity and heterogeneity is the second step in the ML life cycle. Data needs to be processed to remove irregularities such as missing values, incorrect data entry, and outliers because many ML algorithms have statistical assumptions that these irregularities would violate and render the modeling ineffective (if not invalid). For example, the linear regression model assumes that an error or residual is normally distributed, therefore it is important to check whether there are outliers that could contribute to such a violation. If so, we must perform the necessary preprocessing tasks to remedy it. Common preprocessing approaches include, but are not limited to, removal of invalid entries, removal of extreme data points (also known as outliers), and filling in missing values. Data also need to be processed to remove heterogeneity across features and normalize them into the same scale, as some ML algorithms are sensitive to the scale of the features and would develop a bias towards features with a larger scale. Common approaches include min-max scaling and z-standardization (z-score).

Visualization and data analysis is the third step in the ML life cycle. Data visualization allows data scientists to easily understand visually how data is distributed and what the trends are in the data. Exploratory Data Analysis (EDA) allows data scientists to understand the statistical behavior of the data at hand, figure out the information that has predictive power to be included in the modeling process, and eliminate any redundancy in the data, such as duplicated entries, multicollinearity, and unimportant features.

Feature engineering is the fourth step in the ML life cycle. Even with the various sources from which we are collecting data, ML models oftentimes benefit from engineered features that are calculated from existing features. For example, Body Mass Index (BMI) is a well-known engineered feature, calculated using the height and weight of a person, and is also an established feature (or risk factor, in clinical terms) that predicts certain diseases rather than height or weight alone. Feature engineering often requires extensive experience in the domain and experimentation to find out what recipes are adding predictive power to the modeling.

Modeling and evaluation

For a data scientist, ML modeling is the most exciting part of the life cycle (I think so; I hope you agree with me). You've formulated the problem in the language of ML. You've collected, processed the data, and looked at the underlying trends that give you enough hints to build an ML model. Now, it's time to build your first model for the dataset, but wait – what model, what algorithm, and what metric do we use to evaluate the performance? Well, that's the core of modeling and evaluation.

The goal is to explore and find out a satisfactory ML model, with an objective metric, from all possible algorithms, feature sets, and hyperparameters. This is definitely not an easy task and requires extensive experience. Depending on the problem type (whether it's classification, regression, or reinforcement learning), data type (as in whether it's tabular, text, or image data), data distribution (is there a class imbalance or outliers?), and domain (medical, financial, or industrial), you can narrow down the choice of algorithms to a handful. With each of these algorithms, there are hyperparameters that control the behavior and performance of the algorithm on the provided data. What is also needed is a definition of an objective metric and a threshold that meets the business requirement, using the metric to guide you toward the best model. You may blindly choose one or two algorithm-hyperparameter combinations for your project, but you may not reach the optimal solution in just one or two trials. It is rather typical for a data scientist to try out hundreds if not thousands of combinations. How is that possible?

This is why establishing a streamlined model training and evaluation process is such a critical step in the process. Once the model training and evaluation is automated, you can simply launch the process that helps you automatically iterate through the experimentations among algorithms and hyperparameters, and compare the metric performance to find out the optimal solution. This process is called hyperparameter tuning or hyperparameter optimization. If multiple algorithms are the subject of tuning, it can also be called multi-algorithm hyperparameter tuning.

Production – predicting, monitoring, and retraining

An ML model needs to be put in use in order to have an impact on the business. However, the production process is different from that of a typical software application. Unlike other software applications where business logic can be pre-written and tested exhaustively with edge cases before production, there is no guarantee that once the model is trained and evaluated, it will be performing at the same level in production as in the testing environment. This is because ML models use probabilistic, statistical, and fuzzy logic to infer an outcome for each incoming data point, and the testing, that is, the model evaluation, is typically done without true prior knowledge of production data. The best a data scientist can do prior to production is to create training data from a sample that closely represents real-world data, and evaluate the model with an out-of-sample strategy in order to get an unbiased idea of how the model would perform on unseen data. While in production, the incoming data is completely unseen by the model; how to evaluate live model performance, and how to take actions on that evaluation, are critical topics for productionizing ML models.

Model performance can be monitored with two approaches. One that is more straightforward is to capture the ground truth for the unseen data and compare the prediction against the ground truth. The second approach is to use the drift in data as a proxy to determine whether the model is going to behave in an expected way. In some use cases, the first approach is not feasible, as the true outcome (the ground truth) may lag behind the event for a long time. For example, in a disease prediction use case, where the purpose of ML modeling is to help a healthcare provider to find a likely outcome in the future, say three months, with current health metrics, it is not possible to gather a true ground truth less than three months or even later, depending on the onset of the disease. It is, therefore, impractical to only fix the model after obtaining it, should it be proven ineffective.

The second approach lies in the premise that an ML model learns statistically and probabilistically from the training data and would behave differently when a new dataset with different statistical characteristics is provided. A model would return gibberish when data does not come from the same statistical distribution. Therefore, by detecting the drift in data, it gives a more real-time estimate of how the model is going to perform. Take the disease prediction use case once again as an example: when data about a group of patients in their 30s is sent to an ML model that is trained on data with an average age of 65 for prediction, it is likely that the model is going to be clueless about these new patients. So we need to take action.

Retraining and updating the model makes sure that it stays performant for future data. Being able to capture the ground truth and detecting the data drift helps create a retraining strategy at the right time. The data that has drifted and the ground truth are the great input into the retraining process, as they will help the model to cover a wider statistical distribution.

Now that we have a clear idea of the basics of the uses and life cycle of ML development, let's take the next step and investigate how it can work with the cloud.

Building ML in the cloud

Cloud computing is a technology that delivers on-demand IT resources that can grow and shrink at any time, depending on the need. There is no more buying and maintaining computer servers or data centers. It is much like utilities in your home, such as water, which is there when you turn on the faucet. If you turn it all the way, you get a high-pressure water stream. If you turn it down, you conserve water. If you don't need it anymore, you turn it off completely. With this model, developers and teams get the following benefits from on-demand cloud computing:

Agility: Quickly spin up resources as you need them. Develop and roll out new apps, experiment with new ideas, and fail quickly without risks. Elasticity: Scale your resources as you need them. Cloud computing takes away "undifferentiated heavy lifting" – racking up additional servers and planning capacity for the future. These are things that don't help address your core business problems.Global availability: With a click of a button, you can spin up resources that are closest to your customers/users without relocating your physical compute resources.

How does this impact the field of ML? As compute resources become easier to acquire, information exchange becomes much more frequent. As that happens, more data is generated and stored. And more data means more opportunities to train more accurate ML models. The agility, elasticity, and scale that cloud computing provides accelerates the development and application of ML models from weeks or months down to a much shorter cycle so that developers can now generate and improve ML models faster than ever. Developers are no longer constrained by physical compute resources available to them. With better ML models, businesses can make better decisions and provide better product experiences to customers.

For cloud computing, we will be using Amazon Web Services, which is the provider of Amazon SageMaker Studio, throughout the book.

Exploring AWS essentials for ML

Amazon Web Services (AWS) offers cloud computing resources to developers of all kinds to create applications and solutions for their businesses. AWS manages the technology and infrastructure in a secure environment and a scalable fashion, taking away the undifferentiated heavy lifting of infrastructure management from developers. AWS provides a broad range of services, including ML, artificial intelligence, the internet of things, analytics, and application development tools. These are built on top of the following key areas – compute, storage, databases, and security. Before we start our journey with Amazon SageMaker Studio, which is one of the ML offerings from AWS, it is important to know the core services that are commonly used while developing your ML projects on Amazon SageMaker Studio.

Compute

For ML in the cloud, developers need computational resources in all aspects of the life cycle. Amazon Elastic Compute Cloud (Amazon EC2) is the most fundamental cloud computing environment for developers to process, train, and host ML models. Amazon EC2 provides a wide range of compute instance types for many purposes, such as compute-optimized instances for compute-intensive work, memory-optimized instances for applications that have a large memory footprint, and Graphics Processing Unit (GPU)-accelerated instances for deep learning training.

Amazon SageMaker also offers on-demand compute resources for ML developers to run processing, training, and model hosting. Amazon SageMaker's ML instances build on top of Amazon EC2 instances and equip the instances with a fully managed, optimized versions of popular ML frameworks such as TensorFlow, PyTorch, MXNet, and scikit-learn, which are optimized for Amazon EC2 compute instances. Developers do not need to manage the provisioning and patching of the ML instances, so they can focus on the ML life cycle.

Storage

While conducting an ML project, developers need to be able to access files, store codes, and store artifacts. Reliable storage is crucial to an ML project. AWS provides several types of storage options for ML development. Amazon Simple Storage Service (Amazon S3) and Amazon Elastic File System (Amazon EFS) are the two that are most relevant to the development of ML projects in Amazon SageMaker Studio.

Amazon S3 is an object storage service that allows developers to store any amount of data with high security, availability, and scalability. ML developers can store structured and unstructured data, and ML models with versioning on Amazon S3. Amazon S3 can also be used to build a data lake for analytics and to store backups and archives.

Amazon EFS provides a fully managed, serverless filesystem that allows developers to store and share files across users on the filesystem without any storage provisioning, as the filesystem increases and decreases its capacity automatically when you add or delete files. It is often used in a High-Performance Cluster (HPC) setting and applications where parallel or simultaneous data access across threads, processing tasks, compute instances, and users with high throughput are required. As Amazon SageMaker Studio embeds an Amazon EFS filesystem, each user on Amazon SageMaker Studio gets a home directory for storing and accessing data, codes, and notebooks.

Database and analytics

Besides