The Machine Learning Solutions Architect Handbook - David Ping - E-Book

The Machine Learning Solutions Architect Handbook E-Book

David Ping

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

David Ping, Head of GenAI and ML Solution Architecture for global industries at AWS, provides expert insights and practical examples to help you become a proficient ML solutions architect, linking technical architecture to business-related skills.
You'll learn about ML algorithms, cloud infrastructure, system design, MLOps , and how to apply ML to solve real-world business problems. David explains the generative AI project lifecycle and examines Retrieval Augmented Generation (RAG), an effective architecture pattern for generative AI applications. You’ll also learn about open-source technologies, such as Kubernetes/Kubeflow, for building a data science environment and ML pipelines before building an enterprise ML architecture using AWS. As well as ML risk management and the different stages of AI/ML adoption, the biggest new addition to the handbook is the deep exploration of generative AI.
By the end of this book , you’ll have gained a comprehensive understanding of AI/ML across all key aspects, including business use cases, data science, real-world solution architecture, risk management, and governance. You’ll possess the skills to design and construct ML solutions that effectively cater to common use cases and follow established ML architecture patterns, enabling you to excel as a true professional in the field.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 857

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Machine Learning Solutions Architect Handbook

Second Edition

Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI

David Ping

The Machine Learning Solutions Architect Handbook

Second Edition

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Bhavesh Amin

Acquisition Editor – Peer Reviews: Gaurav Gavas

Project Editor: Amisha Vathare

Content Development Editor: Tanya D’cruz

Copy Editor: Safis Editing

Technical Editor: Anjitha Murali

Proofreader: Safis Editing

Indexer: Hemangini Bari

Presentation Designer: Ajay Patule

Developer Relations Marketing Executive: Monika Sangwan

First published: January 2022

Second edition: April 2024

Production reference: 1080424

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80512-250-0

www.packt.com

Contributors

About the author

David Ping is a seasoned technology executive with over 25 years of experience in the technology and financial services sectors. Specializing in cloud architecture, AI/ML, generative AI, ML platforms, and data analytics, he currently leads a global AI/ML solutions architecture team for industries at AWS, guiding companies worldwide in deploying cutting-edge AI/ML solutions. Previously holding executive roles at Credit Suisse and JPMorgan, David began his career as a software engineer at Intel after graduating with an engineering degree from Cornell University.

About the reviewers

Sepehr Pakbaz has been developing software since 2000 and has experience in full-stack software development, working with a variety of programming languages such as Python, JavaScript, .NET, and recently Golang. He has also worked as a product owner, consultant, and cloud solution architect. He has worked for companies like IBM and Microsoft in the past and is currently a Solutions Architect at Amazon Web Services. Additionally, he works as a consultant for his own company, Starspak LLC, as a side hustle.

Chakravarthy Nagarajan is a technology evangelist with 23 years of industry experience in ML, big data, and high performance computing. He is currently working as a Principal AI/ML Specialist Solutions Architect at Amazon Web Services based in Bay Area, USA. He helps customers solve real-world complex business problems by building prototypes with end-to-end AI/ML solutions on cloud and edge devices. His specialization includes generative AI, computer vision, natural language processing, time series forecasting, and personalization. In his current role, Chakravarthy helps customers across start-ups, enterprises, and ISVs to solve their business problems using AI and ML solutions across North America.

Amit Nandi is a Solutions and Enterprise Architect specializing in driving innovation across diverse industries, including financial, pharmaceutical, manufacturing, and retail. He is recognized for architecting and implementing groundbreaking business paradigms through the integration of big data technologies, real-time streaming, and cutting-edge ML and AI solutions. He built an ML/AI - powered cybersecurity platform and enabled MLOps for the research team of a large pharmaceutical company.

Join our community on Discord

Join our community’s Discord space for discussions with the author and other readers:

https://packt.link/mlsah

Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Navigating the ML Lifecycle with ML Solutions Architecture

ML versus traditional software

ML lifecycle

Business problem understanding and ML problem framing

Data understanding and data preparation

Model training and evaluation

Model deployment

Model monitoring

Business metric tracking

ML challenges

ML solutions architecture

Business understanding and ML transformation

Identification and verification of ML techniques

System architecture design and implementation

ML platform workflow automation

Security and compliance

Summary

Exploring ML Business Use Cases

ML use cases in financial services

Capital market front office

Sales trading and research

Investment banking

Wealth management

Capital market back office operations

Net Asset Value review

Post-trade settlement failure prediction

Risk management and fraud

Anti-money laundering

Trade surveillance

Credit risk

Insurance

Insurance underwriting

Insurance claim management

ML use cases in media and entertainment

Content development and production

Content management and discovery

Content distribution and customer engagement

ML use cases in healthcare and life sciences

Medical imaging analysis

Drug discovery

Healthcare data management

ML use cases in manufacturing

Engineering and product design

Manufacturing operations – product quality and yield

Manufacturing operations – machine maintenance

ML use cases in retail

Product search and discovery

Targeted marketing

Sentiment analysis

Product demand forecasting

ML use cases in the automotive industry

Autonomous vehicles

Perception and localization

Decision and planning

Control

Advanced driver assistance systems (ADAS)

Summary

Exploring ML Algorithms

Technical requirements

How machines learn

Overview of ML algorithms

Consideration for choosing ML algorithms

Algorithms for classification and regression problems

Linear regression algorithms

Logistic regression algorithms

Decision tree algorithms

Random forest algorithm

Gradient boosting machine and XGBoost algorithms

K-nearest neighbor algorithm

Multi-layer perceptron (MLP) networks

Algorithms for clustering

Algorithms for time series analysis

ARIMA algorithm

DeepAR algorithm

Algorithms for recommendation

Collaborative filtering algorithm

Multi-armed bandit/contextual bandit algorithm

Algorithms for computer vision problems

Convolutional neural networks

ResNet

Algorithms for natural language processing (NLP) problems

Word2Vec

BERT

Generative AI algorithms

Generative adversarial network

Generative pre-trained transformer (GPT)

Large Language Model

Diffusion model

Hands-on exercise

Problem statement

Dataset description

Setting up a Jupyter Notebook environment

Running the exercise

Summary

Data Management for ML

Technical requirements

Data management considerations for ML

Data management architecture for ML

Data storage and management

AWS Lake Formation

Data ingestion

Kinesis Firehose

AWS Glue

AWS Lambda

Data cataloging

AWS Glue Data Catalog

Custom data catalog solution

Data processing

ML data versioning

S3 partitions

Versioned S3 buckets

Purpose-built data version tools

ML feature stores

Data serving for client consumption

Consumption via API

Consumption via data copy

Special databases for ML

Vector databases

Graph databases

Data pipelines

Authentication and authorization

Data governance

Data lineage

Other data governance measures

Hands-on exercise – data management for ML

Creating a data lake using Lake Formation

Creating a data ingestion pipeline

Creating a Glue Data Catalog

Discovering and querying data in the data lake

Creating an Amazon Glue ETL job to process data for ML

Building a data pipeline using Glue workflows

Summary

Exploring Open-Source ML Libraries

Technical requirements

Core features of open-source ML libraries

Understanding the scikit-learn ML library

Installing scikit-learn

Core components of scikit-learn

Understanding the Apache Spark ML library

Installing Spark ML

Core components of the Spark ML library

Understanding the TensorFlow deep learning library

Installing TensorFlow

Core components of TensorFlow

Hands-on exercise – training a TensorFlow model

Understanding the PyTorch deep learning library

Installing PyTorch

Core components of PyTorch

Hands-on exercise – building and training a PyTorch model

How to choose between TensorFlow and PyTorch

Summary

Kubernetes Container Orchestration Infrastructure Management

Technical requirements

Introduction to containers

Overview of Kubernetes and its core concepts

Namespaces

Pods

Deployment

Kubernetes Job

Kubernetes custom resources and operators

Services

Networking on Kubernetes

Security and access management

API authentication and authorization

Hands-on – creating a Kubernetes infrastructure on AWS

Problem statement

Lab instruction

Summary

Open-Source ML Platforms

Core components of an ML platform

Open-source technologies for building ML platforms

Implementing a data science environment

Building a model training environment

Registering models with a model registry

Serving models using model serving services

The Gunicorn and Flask inference engine

The TensorFlow Serving framework

The TorchServe serving framework

KFServing framework

Seldon Core

Triton Inference Server

Monitoring models in production

Managing ML features

Automating ML pipeline workflows

Apache Airflow

Kubeflow Pipelines

Designing an end-to-end ML platform

ML platform-based strategy

ML component-based strategy

Summary

Building a Data Science Environment Using AWS ML Services

Technical requirements

SageMaker overview

Data science environment architecture using SageMaker

Onboarding SageMaker users

Launching Studio applications

Preparing data

Preparing data interactively with SageMaker Data Wrangler

Preparing data at scale interactively

Processing data as separate jobs

Creating, storing, and sharing features

Training ML models

Tuning ML models

Deploying ML models for testing

Best practices for building a data science environment

Hands-on exercise – building a data science environment using AWS services

Problem statement

Dataset description

Lab instructions

Setting up SageMaker Studio

Launching a JupyterLab notebook

Training the BERT model in the Jupyter notebook

Training the BERT model with the SageMaker Training service

Deploying the model

Building ML models with SageMaker Canvas

Summary

Designing an Enterprise ML Architecture with AWS ML Services

Technical requirements

Key considerations for ML platforms

The personas of ML platforms and their requirements

ML platform builders

Platform users and operators

Common workflow of an ML initiative

Platform requirements for the different personas

Key requirements for an enterprise ML platform

Enterprise ML architecture pattern overview

Model training environment

Model training engine using SageMaker

Automation support

Model training lifecycle management

Model hosting environment

Inference engines

Authentication and security control

Monitoring and logging

Adopting MLOps for ML workflows

Components of the MLOps architecture

Monitoring and logging

Model training monitoring

Model endpoint monitoring

ML pipeline monitoring

Service provisioning management

Best practices in building and operating an ML platform

ML platform project execution best practices

ML platform design and implementation best practices

Platform use and operations best practices

Summary

Advanced ML Engineering

Technical requirements

Training large-scale models with distributed training

Distributed model training using data parallelism

Parameter server overview

AllReduce overview

Distributed model training using model parallelism

Naïve model parallelism overview

Tensor parallelism/tensor slicing overview

Implementing model-parallel training

Achieving low-latency model inference

How model inference works and opportunities for optimization

Hardware acceleration

Central processing units (CPUs)

Graphics processing units (GPUs)

Application-specific integrated circuit

Model optimization

Quantization

Pruning (also known as sparsity)

Graph and operator optimization

Graph optimization

Operator optimization

Model compilers

TensorFlow XLA

PyTorch Glow

Apache TVM

Amazon SageMaker Neo

Inference engine optimization

Inference batching

Enabling parallel serving sessions

Picking a communication protocol

Inference in large language models

Text Generation Inference (TGI)

DeepSpeed-Inference

FastTransformer

Hands-on lab – running distributed model training with PyTorch

Problem statement

Dataset description

Modifying the training script

Modifying and running the launcher notebook

Summary

Building ML Solutions with AWS AI Services

Technical requirements

What are AI services?

Overview of AWS AI services

Amazon Comprehend

Amazon Textract

Amazon Rekognition

Amazon Transcribe

Amazon Personalize

Amazon Lex V2

Amazon Kendra

Amazon Q

Evaluating AWS AI services for ML use cases

Building intelligent solutions with AI services

Automating loan document verification and data extraction

Loan document classification workflow

Loan data processing flow

Media processing and analysis workflow

E-commerce product recommendation

Customer self-service automation with intelligent search

Designing an MLOps architecture for AI services

AWS account setup strategy for AI services and MLOps

Code promotion across environments

Monitoring operational metrics for AI services

Hands-on lab – running ML tasks using AI services

Summary

AI Risk Management

Understanding AI risk scenarios

The regulatory landscape around AI risk management

Understanding AI risk management

Governance oversight principles

AI risk management framework

Applying risk management across the AI lifecycle

Business problem identification and definition

Data acquisition and management

Risk considerations

Risk mitigations

Experimentation and model development

Risk considerations

Risk mitigations

AI system deployment and operations

Risk considerations

Risk mitigations

Designing ML platforms with governance and risk management considerations

Data and model documentation

Lineage and reproducibility

Observability and auditing

Scalability and performance

Data quality

Summary

Bias, Explainability, Privacy, and Adversarial Attacks

Understanding bias

Understanding ML explainability

LIME

SHAP

Understanding security and privacy-preserving ML

Differential privacy

Understanding adversarial attacks

Evasion attacks

PGD attacks

HopSkipJump attacks

Data poisoning attacks

Clean-label backdoor attack

Model extraction attack

Attacks against generative AI models

Defense against adversarial attacks

Robustness-based methods

Detector-based method

Open-source tools for adversarial attacks and defenses

Hands-on lab – detecting bias, explaining models, training privacy-preserving mode, and simulating adversarial attack

Problem statement

Detecting bias in the training dataset

Explaining feature importance for a trained model

Training privacy-preserving models

Simulate a clean-label backdoor attack

Summary

Charting the Course of Your ML Journey

ML adoption stages

Exploring AI/ML

Disjointed AI/ML

Integrated AI/ML

Advanced AI/ML

AI/ML maturity and assessment

Technical maturity

Business maturity

Governance maturity

Organization and talent maturity

Maturity assessment and improvement process

AI/ML operating models

Centralized model

Decentralized model

Hub and spoke model

Solving ML journey challenges

Developing the AI vision and strategy

Getting started with the first AI/ML initiative

Solving scaling challenges with AI/ML adoption

Solving ML use case scaling challenges

Solving technology scaling challenges

Solving governance scaling challenges

Summary

Navigating the Generative AI Project Lifecycle

The advancement and economic impact of generative AI

What industries are doing with generative AI

Financial services

Healthcare and life sciences

Media and entertainment

Automotive and manufacturing

The lifecycle of a generative AI project and the core technologies

Business use case selection

FM selection and evaluation

Initial screening via manual assessment

Automated model evaluation

Human evaluation

Assessing AI risks for FMs

Other evaluation consideration

Building FMs from scratch via pre-training

Adaptation and customization

Domain adaptation pre-training

Fine-tuning

Reinforcement learning from human feedback

Prompt engineering

Model management and deployment

The limitations, risks, and challenges of adopting generative AI

Summary

Designing Generative AI Platforms and Solutions

Operational considerations for generative AI platforms and solutions

New generative AI workflow and processes

New technology components

New roles

Exploring generative AI platforms

The prompt management component

FM benchmark workbench

Supervised fine-tuning and RLHF

FM monitoring

The retrieval-augmented generation pattern

Open-source frameworks for RAG

LangChain

LlamaIndex

Evaluating a RAG pipeline

Advanced RAG patterns

Designing a RAG architecture on AWS

Choosing an LLM adaptation method

Response quality

Cost of the adaptation

Implementation complexity

Bringing it all together

Considerations for deploying generative AI applications in production

Model readiness

Decision-making workflow

Responsible AI assessment

Guardrails in production environments

External knowledge change management

Practical generative AI business solutions

Generative AI-powered semantic search engine

Financial data analysis and research workflow

Clinical trial recruiting workflow

Media entertainment content creation workflow

Car design workflow

Contact center customer service operation

Are we close to having artificial general intelligence?

The symbolic approach

The connectionist/neural network approach

The neural-symbolic approach

Summary

Other Books You May Enjoy

Index

Landmarks

Cover

Index

Share your thoughts

Once you’ve read The Machine Learning Solutions Architect Handbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily.

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below:

https://packt.link/free-ebook/9781805122500

Submit your proof of purchase.That’s it! We’ll send your free PDF and other benefits to your email directly.

1

Navigating the ML Lifecycle with ML Solutions Architecture

The field of artificial intelligence (AI) and machine learning (ML) has had a long history. Over the last 70+ years, ML has evolved from checker game-playing computer programs in the 1950s to advanced AI capable of beating the human world champion in the game of Go. More recently, Generative AI (GenAI) technology such as ChatGPT has been taking the industry by storm, generating huge interest among company executives and consumers alike, promising new ways to transform businesses such as drug discovery, new media content, financial report analysis, and consumer product design. Along the way, the technology infrastructure for ML has also evolved from a single machine/server for small experiments and models to highly complex end-to-end ML platforms capable of training, managing, and deploying tens of thousands of ML models. The hyper-growth in the AI/ML field has resulted in the creation of many new professional roles, such as MLOpsengineering, AI/ML product management, ML software engineering, AI risk manager, and AI strategist across a range of industries.

Machine learning solutions architecture (ML solutions architecture) is another relatively new discipline that is playing an increasingly critical role in the full end-to-end ML lifecycle as ML projects become increasingly complex in terms of business impact, science sophistication, and the technology landscape.

This chapter will help you understand where ML solutions architecture fits in the full data science lifecycle. We will discuss the different steps it will take to get an ML project from the ideation stage to production and the challenges faced by organizations, such as use case identification, data quality issues, and shortage of ML talent when implementing an ML initiative. Finally, we will finish the chapter by briefly discussing the core focus areas of ML solutions architecture, including system architecture, workflow automation, and security and compliance.

In this chapter, we are going to cover the following main topics:

ML versus traditional softwareThe ML lifecycle and its key challengesWhat is ML solutions architecture, and where does it fit in the overall lifecycle?

Upon completing this chapter, you will understand the role of an ML solutions architect and what business and technology areas you need to focus on to support end-to-end ML initiatives. The intent of this chapter is to offer a fundamental introduction to the ML lifecycle for those in the early stages of their exploration in the field. Experienced ML practitioners may wish to skip this foundational overview and proceed directly to more advanced content.

The more advanced section commences in Chapter 4; however, many technical practitioners may find Chapter 2 helpful, as numerous technical practitioners often need more business understanding of where ML can be applied in different businesses and workflows. Additionally, Chapter 3, could prove beneficial for certain practitioners, as it provides an introduction to ML algorithms for those new to this topic and can also serve as a refresher for those practicing these concepts regularly.

ML versus traditional software

Before I started working in the field of AI/ML, I spent many years building computer software platforms for large financial services institutions. Some of the business problems I worked on had complex rules, such as identifying companies for comparable analysis for investment banking deals or creating a master database for all the different companies’ identifiers from the different data providers. We had to implement hardcoded rules in database-stored procedures and application server backends to solve these problems. We often debated if certain rules made sense or not for the business problems we tried to solve.

As rules changed, we had to reimplement the rules and make sure the changes did not break anything. To test for new releases or changes, we often replied to human experts to exhaustively test and validate all the business logic implemented before the production release. It was a very time-consuming and error-prone process and required a significant amount of engineering, testing against the documented specification, and rigorous change management for deployment every time new rules were introduced, or existing rules needed to be changed. We often replied to users to report business logic issues in production, and when an issue was reported in production, we sometimes had to open up the source code to troubleshoot or explain the logic of how it worked. I remember I often asked myself if there were better ways to do this.

After I started working in the field of AI/ML, I started to solve many similar challenges using ML techniques. With ML, I did not need to come up with complex rules that often require deep data and domain expertise to create or maintain the complex rules for decision making. Instead, I focused on collecting high-quality data and used ML algorithms to learn the rules and patterns from the data directly. This new approach eliminated many of the challenging aspects of creating new rules (for example, a deep domain expertise requirement, or avoiding human bias) and maintaining existing rules. To validate the model before the production release, we could examine model performance metrics such as accuracy. While it still required data science expertise to interpret the model metrics against the nature of the business problems and dataset, it did not require exhaustive manual testing of all the different scenarios. When a model was deployed into production, we would monitor if the model performed as expected by monitoring any significant changes in production data versus the data we have collected for model training. We would collect new unseen data and labels for production data and test the model performance periodically to ensure that its predictive accuracy remains robust when faced with new, previously unseen production data. To explain why a model made a decision the way it did, we did not need to open up the source code to re-examine the hardcoded logic. Instead, we would rely on ML techniques to help explain the relative importance of different input features to understand what factors were most influential in the decision-making by the ML models.

The following figure shows a graphical view of the process differences between developing a piece of software and training an ML model:

Figure 1.1: ML and computer software

Now that you know the difference between ML and traditional software, it is time to dive deep into understanding the different stages in an ML lifecycle.

ML lifecycle

One of the early ML projects that I worked on was a fascinating yet daunting sports predictive analytics problem for a major league brand. I was given a list of predictive analytics outcomes to think about to see if there were ML solutions for the problems. I was a casual viewer of the sport; I didn’t know anything about the analytics to be generated, nor the rules of the games in the detail that was needed. I was provided with some sample data but had no idea what to do with it.

The first thing I started to work on was an immersion in the sport itself. I delved into the intricacies of the game, studying the different player positions and events that make up each game and play. Only after being armed with the newfound domain knowledge did the data start to make sense. Together with the stakeholder, we evaluated the impact of the different analytics outcomes and assessed the modeling feasibility based on the data we had. With a clear understanding of the data, we came up with a couple of top ML analytics with the most business impact to focus on. We also decided how they would be integrated into the existing business workflow, and how they would be measured on their impacts.

Subsequently, I delved deeper into the data to ascertain what information was available and what was lacking. The raw dataset had a lot of irrelevant data points that needed to be removed while the relevant data points needed to be transformed to provide the strongest signals for model training. I processed and prepared the dataset based on a few of the ML algorithms I had considered and conducted experiments to determine the best approach. I lacked a tool to track the different experiment results, so I had to document what I had done manually. After some initial rounds of experimentation, it became evident that the existing data was not sufficient to train a high-performance model. Hence, I decided to build a custom deep learning model to incorporate data of different modalities as the data points had temporal dependencies and required additional spatial information for the modeling. The data owner was able to provide the additional datasets I required, and after more experiments with custom algorithms and significant data preparations and feature engineering, I eventually trained a model that met the business objectives.

After completing the model, another hard challenge began – deploying and operationalizing the model in production and integrating it into the existing business workflow and system architecture. We engaged in many architecture and engineering discussions and eventually built out a deployment architecture for the model.

As you can see from my personal experience, the journey from business idea to ML production deployment involved many steps. A typical lifecycle of an ML project follows a formal structure, which includes several essential stages like business understanding, data acquisition and understanding, data preparation, model building, model evaluation, and model deployment. Since a big component of the lifecycle is experimentation with different datasets, features, and algorithms, the whole process is highly iterative. Furthermore, it is essential to note that there is no guarantee of a successful outcome. Factors such as the availability and quality of data, feature engineering techniques (the process of using domain knowledge to extract useful features from raw data), and the capability of the learning algorithms, among others, can all affect the final results.

Figure 1.2: ML lifecycle

The preceding figure illustrates the key steps in ML projects, and in the subsequent sections, we will delve into each of these steps in greater detail.

Business problem understanding and ML problem framing

The first stage in the lifecycle is business understanding. This stage involves the understanding of the business goals and defining business metrics that can measure the project’s success. For example, the following are some examples of business goals:

Cost reduction for operational processes, such as document processing.Mitigation of business or operational risks, such as fraud and compliance. Product or service revenue improvements, such as better target marketing, new insight generation for better decision making, and increased customer satisfaction.

To measure the success, you may use specific business metrics such as the number of hours reduced in a business process, an increased number of true positive frauds detected, a conversion rate improvement from target marketing, or the number of churn rate reductions. This is an essential step to get right to ensure there is sufficient justification for an ML project and that the outcome of the project can be successfully measured.

After you have defined the business goals and business metrics, you need to evaluate if there is an ML solution for the business problem. While ML has a wide scope of applications, it is not always an optimal solution for every business problem.

Data understanding and data preparation

The saying that “data is the new oil” holds particularly true for ML. Without the required data, you cannot move forward with an ML project. That’s why the next step in the ML lifecycle is data acquisition, understanding, and preparation.

Based on the business problems and ML approach, you will need to gather and comprehend the available data to determine if you have the right data and data volume to solve the ML problem. For example, suppose the business problem to address is credit card fraud detection. In that case, you will need datasets such as historical credit card transaction data, customer demographics, account data, device usage data, and networking access data. Detailed data analysis is then necessary to determine if the dataset features and quality are sufficient for the modeling tasks. You also need to decide if the data needs labeling, such as fraud or not-fraud. During this step, depending on the data quality, a significant amount of data wrangling might be performed to prepare and clean the data and to generate the dataset for model training and model evaluation, depending on the data quality.

Model training and evaluation

Using the training and validation datasets established, a data scientist must run a number of experiments using different ML algorithms and dataset features for feature selection and model development. This is a highly iterative process and could require numerous runs of data processing and model development to find the right algorithm and dataset combination for optimal model performance. In addition to model performance, factors such as data bias and model explainability may need to be considered to comply with internal or regulatory requirements.

Prior to deployment into production, the model quality must be validated using the relevant technical metrics, such as the accuracy score. This is usually accomplished using a holdout dataset, also known as a test dataset, to gauge how the model performs on unseen data. It is crucial to understand which metrics are appropriate for model validation, as they vary depending on the ML problems and the dataset used. For example, model accuracy would be a suitable validation metric for a document classification use case if the number of document types is relatively balanced. However, model accuracy would not be a good metric to evaluate the model performance for a fraud detection use case – this is because the number of frauds is small and even if the model predicts not-fraud all the time, the model accuracy could still be very high.

Model deployment

After the model is fully trained and validated to meet the expected performance metric, it can be deployed into production and the business workflow. There are two main deployment concepts here. The first involves the deployment of the model itself to be used by a client application to generate predictions. The second concept is to integrate this prediction workflow into a business workflow application. For example, deploying the credit fraud model would either host the model behind an API for real-time prediction or as a package that can be loaded dynamically to support batch predictions. Moreover, this prediction workflow also needs to be integrated into business workflow applications for fraud detection, which might include the fraud detection of real-time transactions, decision automation based on prediction output, and fraud detection analytics for detailed fraud analytics.

Model monitoring

The ML lifecycle does not end with model deployment. Unlike software, whose behavior is highly deterministic since developers explicitly code its logic, an ML model could behave differently in production from its behavior in model training and validation. This could be caused by changes in the production data characteristics, data distribution, or the potential manipulation of request data. Therefore, model monitoring is an important post-deployment step for detecting model performance degradation (a.k.a model drift) or dataset distribution change in the production environment (a.k.a data drift).

Business metric tracking

The actual business impact should be tracked and measured as an ongoing process to ensure the model delivers the expected business benefits. This may involve comparing the business metrics before and after the model deployment, or A/B testing where a business metric is compared between workflows with or without the ML model. If the model does not deliver the expected benefits, it should be re-evaluated for improvement opportunities. This could also mean framing the business problem as a different ML problem. For example, if churn prediction does not help improve customer satisfaction, then consider a personalized product/service offering to solve the problem.

ML challenges

Over the years, I have worked on many real-world problems using ML solutions and encountered different challenges faced by different industries during ML adoptions.

I often get the same question when working on ML projects: We have a lot of data – can you help us figure out what insights we can generate using ML? I refer to companies with this question as having a business use case challenge. Not being able to identify business use cases for ML is a very big hurdle for many companies. Without a properly identified business problem and its value proposition and benefit, it becomes difficult to initiate an ML project.

In my conversations with different companies across their industries, data-related challenges emerge as a frequent issue. This includes data quality, data inventory, data accessibility, data governance, and data availability. This problem affects both data-poor and data-rich companies and is often exacerbated by data silos, data security, and industry regulations.

The shortage of data science and ML talent is another major challenge I have heard from many companies. Companies, in general, are having a tough time attracting and retaining top ML talents, which is a common problem across all industries. As ML platforms become more complex and the scope of ML projects increases, the need for other ML-related functions starts to surface. Nowadays, in addition to just data scientists, an organization would also need functional roles for ML product management, ML infrastructure engineering, and ML operations management.

Based on my experiences, I have observed that cultural acceptance of ML-based solutions is another significant challenge for broad adoption. There are individuals who perceive ML as a threat to their job functions, and their lack of knowledge in ML makes them hesitant to adopt these new methods in their business workflows.

The practice of ML solutions architecture aims to help solve some of the challenges in ML. In the next section, we will explore ML solutions architecture and its role in the ML lifecycle.

ML solutions architecture

When I initially worked with companies as an ML solutions architect, the landscape was quite different from what it is now. The focus was mainly on data science and modeling, and the problems at hand were small in scope. Back then, most of the problems could be solved using simple ML techniques. The datasets were small, and the infrastructure required was not too demanding. The scope of the ML initiative at these companies was limited to a few data scientists or teams. As an ML architect at that time, I primarily needed to have solid data science skills and general cloud architecture knowledge to get the job done.

In more recent years, the landscape of ML initiatives has become more intricate and multifaceted, necessitating involvement from a broader range of functions and personas at companies. My engagement has expanded to include discussions with business executives about ML strategies and organizational design to facilitate the broad adoption of AI/ML throughout their enterprises. I have been tasked with designing more complex ML platforms, utilizing a diverse range of technologies for large enterprises to meet stringent security and compliance requirements. ML workflow orchestration and operations have become increasingly crucial topics of discussion, and more and more companies are looking to train large ML models with enormous amounts of training data. The number of ML models trained and deployed by some companies has skyrocketed to tens of thousands from a few dozen models in just a few years. Furthermore, sophisticated and security-sensitive customers have sought guidance on topics such as ML privacy, model explainability, and data and model bias. As an ML solutions architect, I’ve noticed that the skills and knowledge required to be successful in this role have evolved significantly.

Trying to navigate the complexities of a business, data, science, and technology landscape can be a daunting task. As an ML solutions architect, I have seen firsthand the challenges that companies face in bringing all these pieces together. In my view, ML solutions architecture is an essential discipline that serves as a bridge connecting the different components of an ML initiative. Drawing on my years of experience working with companies of all sizes and across diverse industries, I believe that an ML solutions architect plays a pivotal role in identifying business needs, developing ML solutions to address these needs, and designing the technology platforms necessary to run these solutions. By collaborating with various business and technology partners, an ML solutions architect can help companies unlock the full potential of their data and realize tangible benefits from their ML initiatives.

The following figure illustrates the core functional areas covered by the ML solutions architecture:

Figure 1.3: ML solutions architecture coverage

In the following sections, we will explore each of these areas in greater detail:

Business understanding: Business problem understanding and transformation using AI and ML.Identification and verification of ML techniques: Identification and verification of ML techniques for solving specific ML problems.System architecture of the ML technology platform: System architecture design and implementation of the ML technology platforms.MLOps: ML platform automation technical design.Security and compliance: Security, compliance, and audit considerations for the ML platform and ML models.

So, let’s dive in!

Business understanding and ML transformation

The goal of the business workflow analysis is to identify inefficiencies in the workflows and determine if ML can be applied to help eliminate pain points, improve efficiency, or even create new revenue opportunities.

Picture this: you are tasked with improving a call center’s operations. You know there are inefficiencies that need to be addressed, but you’re not sure where to start. That’s where business workflow analysis comes in. By analyzing the call center’s workflows, you can identify pain points such as long customer wait times, knowledge gaps among agents, and the inability to extract customer insights from call recordings. Once you have identified these issues, you can determine what data is available and which business metrics need to be improved. This is where ML comes in. You can use ML to create virtual assistants for common customer inquiries, transcribe audio recordings to allow for text analysis, and detect customer intent for product cross-sell and up-sell. But sometimes, you need to modify the business process to incorporate ML solutions. For example, if you want to use call recording analytics to generate insights for cross-selling or up-selling products, but there’s no established process to act on those insights, you may need to introduce an automated target marketing process or a proactive outreach process by the sales team.

Identification and verification of ML techniques

Once you have come up with a list of ML options, the next step is to determine if the assumption behind the ML approach is valid. This could involve conducting a simple proof of concept (POC) modeling to validate the available dataset and modeling approach, or technology POC using pre-built AI services, or testing of ML frameworks. For example, you might want to test the feasibility of text transcription from audio files using an existing text transcription service or build a customer propensity model for a new product conversion from a marketing campaign.

It is worth noting that ML solutions architecture does not focus on developing new machine algorithms, a job best suited for applied data scientists or research data scientists. Instead, ML solutions architecture focuses on identifying and applying ML algorithms to address a range of ML problems such as predictive analytics, computer vision, or natural language processing. Also, the goal of any modeling task here is not to build production-quality models but rather to validate the approach for further experimentations by full-time applied data scientists.

System architecture design and implementation

The most important aspect of the ML solutions architect’s role is the technical architecture design of the ML platform. The platform will need to provide the technical capability to support the different phases of the ML cycle and personas, such as data scientists and operations engineers. Specifically, an ML platform needs to have the following core functions:

Data explorations and experimentation: Data scientists use ML platforms for data exploration, experimentation, model building, and model evaluation. ML platforms need to provide capabilities such as data science development tools for model authoring and experimentation, data wrangling tools for data exploration and wrangling, source code control for code management, and a package repository for library package management.Data management and large-scale data processing: Data scientists or data engineers will need the technical capability to ingest, store, access, and process large amounts of data for cleansing, transformation, and feature engineering.Model training infrastructure management: ML platforms will need to provide model training infrastructure for different modeling training using different types of computing resources, storage, and networking configurations. It also needs to support different types of ML libraries or frameworks, such as scikit-learn, TensorFlow, and PyTorch.Model hosting/serving: ML platforms will need to provide the technical capability to host and serve the model for prediction generations, for real-time, batch, or both.Model management: Trained ML models will need to be managed and tracked for easy access and lookup, with relevant metadata.Feature management: Common and reusable features will need to be managed and served for model training and model serving purposes.

ML platform workflow automation

A key aspect of ML platform design is workflow automation and continuous integration/continuous deployment (CI/CD), also known as MLOps. ML is a multi-step workflow – it needs to be automated, which includes data processing, model training, model validation, and model hosting. Infrastructure provisioning automation and self-service is another aspect of automation design. Key components of workflow automation include the following:

Pipeline design and management: The ability to create different automation pipelines for various tasks, such as model training and model hosting.Pipeline execution and monitoring: The ability to run different pipelines and monitor the pipeline execution status for the entire pipeline and each of the steps in the ML cycle such as data processing and model training.Model monitoring configuration: The ability to monitor the model in production for various metrics, such as data drift (where the distribution of data used in production deviates from the distribution of data used for model training), model drift (where the performance of the model degrades in the production compared with training results), and bias detection (the ML model replicating or amplifying bias towards certain individuals).

Security and compliance

Another important aspect of ML solutions architecture is the security and compliance consideration in a sensitive or enterprise setting:

Authentication and authorization: The ML platform needs to provide authentication and authorization mechanisms to manage access to the platform and different resources and services.Network security: The ML platform needs to be configured for different network security controls such as a firewall and an IP address access allowlist to prevent unauthorized access.Data encryption: For security-sensitive organizations, data encryption is another important aspect of the design consideration for the ML platform.Audit and compliance: Audit and compliance staff need the information to help them understand how decisions are made by the predictive models if required, the lineage of a model from data to model artifacts, and any bias exhibited in the data and model. The ML platform will need to provide model explainability, bias detection, and model traceability across the various datastore and service components, among other capabilities.

Various industry technology providers have established best practices to guide the design and implementation of ML infrastructure, which is part of the ML solutions architect’s practices. Amazon Web Services, for example, created Machine Learning Lens to provide architectural best practices across crucial domains like operational excellence, security, reliability, performance, cost optimization, and sustainability. Following these published guidelines can help practitioners implement robust and effective ML solutions.

Summary

In this chapter, I have shared some of my personal experience as an ML solutions architect and provided an overview of core concepts and components involved in the ML lifecycle. We discussed the key responsibilities of the ML solutions architect role throughout the lifecycle. This chapter aimed to give you an understanding of the technical and business domains required to work effectively as an ML solutions architect. With this foundational knowledge, you should now have an appreciation for the breadth of this role and its integral part in delivering successful ML solutions.

In the upcoming chapter, we will dive into various ML use cases across different industries, such as financial services and media and entertainment, to gain further insights into the practical applications of ML.

Join our community on Discord

Join our community’s Discord space for discussions with the author and other readers:

https://packt.link/mlsah

2

Exploring ML Business Use Cases

As anML practitioner, it is essential for me to develop a deep understanding of different businesses to have effective conversations with business and technology leaders. This should not come as a surprise since the ultimate goal of any ML solutions architecture is to solve practical business problems with science and technology solutions. Therefore, one of the main areas of focus for ML solutions architecture is to develop a broad understanding of different business domains, workflows, and relevant data. Without this understanding, it would be challenging to make sense of the data and design and develop practical ML solutions for business problems.

In this chapter, we will explore various real-world ML use cases across several industry verticals. We will examine the key business workflows and challenges faced by industries such as financial services and retail, and how ML technologies can help solve these challenges. The aim of this chapter is not to make you an expert in any particular industry or its ML use cases and techniques but rather to expose you to real-world ML use cases in business contexts and workflows. After reading this chapter, you will be equipped to apply similar analytical thinking regarding ML solutions in your own line of business. You will gain perspective on identifying and evaluating where ML technology can provide value in your workflows, processes, and objectives. The cross-industry examples and scenarios are intended to spark ideas for how ML could address your unique business challenges, and broaden your thinking about ML opportunities.

Specifically, we will cover the following topics in this chapter:

ML use cases in financial servicesML use cases in media and entertainmentML use cases in healthcare and life sciencesML use cases in manufacturingML use cases in retailML use cases in the automotive industry

If you already have extensive experience as a ML practitioner with an in-depth understanding of your industry’s use cases and solutions, and you are not interested in learning about other industries, you may wish to skip this chapter and proceed directly to the next chapter where we introduce ML algorithms.

ML use cases in financial services

The Financial Services Industry (FSI) has always been at the forefront of technological innovation, and ML adoption is no exception. In recent years, we have seen a range of ML solutions being implemented across different business functions within financial services. For example, in capital markets, ML is being used across front, middle, and back offices to aid investment decisions, trade optimization, risk management, and transaction settlement processing. In insurance, companies are using ML to streamline underwriting, prevent fraud, and automate claim management. While in banking, banks are using it to improve customer experience, combat fraud, and facilitate loan approval decisions. In the following sections, we will explore different core business areas within financial services and how ML can be applied to overcome some of these business challenges.

Capital market front office

In finance, the front office is the revenue-generating