32,39 €
David Ping, Head of GenAI and ML Solution Architecture for global industries at AWS, provides expert insights and practical examples to help you become a proficient ML solutions architect, linking technical architecture to business-related skills.
You'll learn about ML algorithms, cloud infrastructure, system design, MLOps , and how to apply ML to solve real-world business problems. David explains the generative AI project lifecycle and examines Retrieval Augmented Generation (RAG), an effective architecture pattern for generative AI applications. You’ll also learn about open-source technologies, such as Kubernetes/Kubeflow, for building a data science environment and ML pipelines before building an enterprise ML architecture using AWS. As well as ML risk management and the different stages of AI/ML adoption, the biggest new addition to the handbook is the deep exploration of generative AI.
By the end of this book , you’ll have gained a comprehensive understanding of AI/ML across all key aspects, including business use cases, data science, real-world solution architecture, risk management, and governance. You’ll possess the skills to design and construct ML solutions that effectively cater to common use cases and follow established ML architecture patterns, enabling you to excel as a true professional in the field.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 857
The Machine Learning Solutions Architect Handbook
Second Edition
Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI
David Ping
The Machine Learning Solutions Architect Handbook
Second Edition
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Bhavesh Amin
Acquisition Editor – Peer Reviews: Gaurav Gavas
Project Editor: Amisha Vathare
Content Development Editor: Tanya D’cruz
Copy Editor: Safis Editing
Technical Editor: Anjitha Murali
Proofreader: Safis Editing
Indexer: Hemangini Bari
Presentation Designer: Ajay Patule
Developer Relations Marketing Executive: Monika Sangwan
First published: January 2022
Second edition: April 2024
Production reference: 1080424
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80512-250-0
www.packt.com
David Ping is a seasoned technology executive with over 25 years of experience in the technology and financial services sectors. Specializing in cloud architecture, AI/ML, generative AI, ML platforms, and data analytics, he currently leads a global AI/ML solutions architecture team for industries at AWS, guiding companies worldwide in deploying cutting-edge AI/ML solutions. Previously holding executive roles at Credit Suisse and JPMorgan, David began his career as a software engineer at Intel after graduating with an engineering degree from Cornell University.
Sepehr Pakbaz has been developing software since 2000 and has experience in full-stack software development, working with a variety of programming languages such as Python, JavaScript, .NET, and recently Golang. He has also worked as a product owner, consultant, and cloud solution architect. He has worked for companies like IBM and Microsoft in the past and is currently a Solutions Architect at Amazon Web Services. Additionally, he works as a consultant for his own company, Starspak LLC, as a side hustle.
Chakravarthy Nagarajan is a technology evangelist with 23 years of industry experience in ML, big data, and high performance computing. He is currently working as a Principal AI/ML Specialist Solutions Architect at Amazon Web Services based in Bay Area, USA. He helps customers solve real-world complex business problems by building prototypes with end-to-end AI/ML solutions on cloud and edge devices. His specialization includes generative AI, computer vision, natural language processing, time series forecasting, and personalization. In his current role, Chakravarthy helps customers across start-ups, enterprises, and ISVs to solve their business problems using AI and ML solutions across North America.
Amit Nandi is a Solutions and Enterprise Architect specializing in driving innovation across diverse industries, including financial, pharmaceutical, manufacturing, and retail. He is recognized for architecting and implementing groundbreaking business paradigms through the integration of big data technologies, real-time streaming, and cutting-edge ML and AI solutions. He built an ML/AI - powered cybersecurity platform and enabled MLOps for the research team of a large pharmaceutical company.
Join our community’s Discord space for discussions with the author and other readers:
https://packt.link/mlsah
Preface
Who this book is for
What this book covers
To get the most out of this book
Get in touch
Navigating the ML Lifecycle with ML Solutions Architecture
ML versus traditional software
ML lifecycle
Business problem understanding and ML problem framing
Data understanding and data preparation
Model training and evaluation
Model deployment
Model monitoring
Business metric tracking
ML challenges
ML solutions architecture
Business understanding and ML transformation
Identification and verification of ML techniques
System architecture design and implementation
ML platform workflow automation
Security and compliance
Summary
Exploring ML Business Use Cases
ML use cases in financial services
Capital market front office
Sales trading and research
Investment banking
Wealth management
Capital market back office operations
Net Asset Value review
Post-trade settlement failure prediction
Risk management and fraud
Anti-money laundering
Trade surveillance
Credit risk
Insurance
Insurance underwriting
Insurance claim management
ML use cases in media and entertainment
Content development and production
Content management and discovery
Content distribution and customer engagement
ML use cases in healthcare and life sciences
Medical imaging analysis
Drug discovery
Healthcare data management
ML use cases in manufacturing
Engineering and product design
Manufacturing operations – product quality and yield
Manufacturing operations – machine maintenance
ML use cases in retail
Product search and discovery
Targeted marketing
Sentiment analysis
Product demand forecasting
ML use cases in the automotive industry
Autonomous vehicles
Perception and localization
Decision and planning
Control
Advanced driver assistance systems (ADAS)
Summary
Exploring ML Algorithms
Technical requirements
How machines learn
Overview of ML algorithms
Consideration for choosing ML algorithms
Algorithms for classification and regression problems
Linear regression algorithms
Logistic regression algorithms
Decision tree algorithms
Random forest algorithm
Gradient boosting machine and XGBoost algorithms
K-nearest neighbor algorithm
Multi-layer perceptron (MLP) networks
Algorithms for clustering
Algorithms for time series analysis
ARIMA algorithm
DeepAR algorithm
Algorithms for recommendation
Collaborative filtering algorithm
Multi-armed bandit/contextual bandit algorithm
Algorithms for computer vision problems
Convolutional neural networks
ResNet
Algorithms for natural language processing (NLP) problems
Word2Vec
BERT
Generative AI algorithms
Generative adversarial network
Generative pre-trained transformer (GPT)
Large Language Model
Diffusion model
Hands-on exercise
Problem statement
Dataset description
Setting up a Jupyter Notebook environment
Running the exercise
Summary
Data Management for ML
Technical requirements
Data management considerations for ML
Data management architecture for ML
Data storage and management
AWS Lake Formation
Data ingestion
Kinesis Firehose
AWS Glue
AWS Lambda
Data cataloging
AWS Glue Data Catalog
Custom data catalog solution
Data processing
ML data versioning
S3 partitions
Versioned S3 buckets
Purpose-built data version tools
ML feature stores
Data serving for client consumption
Consumption via API
Consumption via data copy
Special databases for ML
Vector databases
Graph databases
Data pipelines
Authentication and authorization
Data governance
Data lineage
Other data governance measures
Hands-on exercise – data management for ML
Creating a data lake using Lake Formation
Creating a data ingestion pipeline
Creating a Glue Data Catalog
Discovering and querying data in the data lake
Creating an Amazon Glue ETL job to process data for ML
Building a data pipeline using Glue workflows
Summary
Exploring Open-Source ML Libraries
Technical requirements
Core features of open-source ML libraries
Understanding the scikit-learn ML library
Installing scikit-learn
Core components of scikit-learn
Understanding the Apache Spark ML library
Installing Spark ML
Core components of the Spark ML library
Understanding the TensorFlow deep learning library
Installing TensorFlow
Core components of TensorFlow
Hands-on exercise – training a TensorFlow model
Understanding the PyTorch deep learning library
Installing PyTorch
Core components of PyTorch
Hands-on exercise – building and training a PyTorch model
How to choose between TensorFlow and PyTorch
Summary
Kubernetes Container Orchestration Infrastructure Management
Technical requirements
Introduction to containers
Overview of Kubernetes and its core concepts
Namespaces
Pods
Deployment
Kubernetes Job
Kubernetes custom resources and operators
Services
Networking on Kubernetes
Security and access management
API authentication and authorization
Hands-on – creating a Kubernetes infrastructure on AWS
Problem statement
Lab instruction
Summary
Open-Source ML Platforms
Core components of an ML platform
Open-source technologies for building ML platforms
Implementing a data science environment
Building a model training environment
Registering models with a model registry
Serving models using model serving services
The Gunicorn and Flask inference engine
The TensorFlow Serving framework
The TorchServe serving framework
KFServing framework
Seldon Core
Triton Inference Server
Monitoring models in production
Managing ML features
Automating ML pipeline workflows
Apache Airflow
Kubeflow Pipelines
Designing an end-to-end ML platform
ML platform-based strategy
ML component-based strategy
Summary
Building a Data Science Environment Using AWS ML Services
Technical requirements
SageMaker overview
Data science environment architecture using SageMaker
Onboarding SageMaker users
Launching Studio applications
Preparing data
Preparing data interactively with SageMaker Data Wrangler
Preparing data at scale interactively
Processing data as separate jobs
Creating, storing, and sharing features
Training ML models
Tuning ML models
Deploying ML models for testing
Best practices for building a data science environment
Hands-on exercise – building a data science environment using AWS services
Problem statement
Dataset description
Lab instructions
Setting up SageMaker Studio
Launching a JupyterLab notebook
Training the BERT model in the Jupyter notebook
Training the BERT model with the SageMaker Training service
Deploying the model
Building ML models with SageMaker Canvas
Summary
Designing an Enterprise ML Architecture with AWS ML Services
Technical requirements
Key considerations for ML platforms
The personas of ML platforms and their requirements
ML platform builders
Platform users and operators
Common workflow of an ML initiative
Platform requirements for the different personas
Key requirements for an enterprise ML platform
Enterprise ML architecture pattern overview
Model training environment
Model training engine using SageMaker
Automation support
Model training lifecycle management
Model hosting environment
Inference engines
Authentication and security control
Monitoring and logging
Adopting MLOps for ML workflows
Components of the MLOps architecture
Monitoring and logging
Model training monitoring
Model endpoint monitoring
ML pipeline monitoring
Service provisioning management
Best practices in building and operating an ML platform
ML platform project execution best practices
ML platform design and implementation best practices
Platform use and operations best practices
Summary
Advanced ML Engineering
Technical requirements
Training large-scale models with distributed training
Distributed model training using data parallelism
Parameter server overview
AllReduce overview
Distributed model training using model parallelism
Naïve model parallelism overview
Tensor parallelism/tensor slicing overview
Implementing model-parallel training
Achieving low-latency model inference
How model inference works and opportunities for optimization
Hardware acceleration
Central processing units (CPUs)
Graphics processing units (GPUs)
Application-specific integrated circuit
Model optimization
Quantization
Pruning (also known as sparsity)
Graph and operator optimization
Graph optimization
Operator optimization
Model compilers
TensorFlow XLA
PyTorch Glow
Apache TVM
Amazon SageMaker Neo
Inference engine optimization
Inference batching
Enabling parallel serving sessions
Picking a communication protocol
Inference in large language models
Text Generation Inference (TGI)
DeepSpeed-Inference
FastTransformer
Hands-on lab – running distributed model training with PyTorch
Problem statement
Dataset description
Modifying the training script
Modifying and running the launcher notebook
Summary
Building ML Solutions with AWS AI Services
Technical requirements
What are AI services?
Overview of AWS AI services
Amazon Comprehend
Amazon Textract
Amazon Rekognition
Amazon Transcribe
Amazon Personalize
Amazon Lex V2
Amazon Kendra
Amazon Q
Evaluating AWS AI services for ML use cases
Building intelligent solutions with AI services
Automating loan document verification and data extraction
Loan document classification workflow
Loan data processing flow
Media processing and analysis workflow
E-commerce product recommendation
Customer self-service automation with intelligent search
Designing an MLOps architecture for AI services
AWS account setup strategy for AI services and MLOps
Code promotion across environments
Monitoring operational metrics for AI services
Hands-on lab – running ML tasks using AI services
Summary
AI Risk Management
Understanding AI risk scenarios
The regulatory landscape around AI risk management
Understanding AI risk management
Governance oversight principles
AI risk management framework
Applying risk management across the AI lifecycle
Business problem identification and definition
Data acquisition and management
Risk considerations
Risk mitigations
Experimentation and model development
Risk considerations
Risk mitigations
AI system deployment and operations
Risk considerations
Risk mitigations
Designing ML platforms with governance and risk management considerations
Data and model documentation
Lineage and reproducibility
Observability and auditing
Scalability and performance
Data quality
Summary
Bias, Explainability, Privacy, and Adversarial Attacks
Understanding bias
Understanding ML explainability
LIME
SHAP
Understanding security and privacy-preserving ML
Differential privacy
Understanding adversarial attacks
Evasion attacks
PGD attacks
HopSkipJump attacks
Data poisoning attacks
Clean-label backdoor attack
Model extraction attack
Attacks against generative AI models
Defense against adversarial attacks
Robustness-based methods
Detector-based method
Open-source tools for adversarial attacks and defenses
Hands-on lab – detecting bias, explaining models, training privacy-preserving mode, and simulating adversarial attack
Problem statement
Detecting bias in the training dataset
Explaining feature importance for a trained model
Training privacy-preserving models
Simulate a clean-label backdoor attack
Summary
Charting the Course of Your ML Journey
ML adoption stages
Exploring AI/ML
Disjointed AI/ML
Integrated AI/ML
Advanced AI/ML
AI/ML maturity and assessment
Technical maturity
Business maturity
Governance maturity
Organization and talent maturity
Maturity assessment and improvement process
AI/ML operating models
Centralized model
Decentralized model
Hub and spoke model
Solving ML journey challenges
Developing the AI vision and strategy
Getting started with the first AI/ML initiative
Solving scaling challenges with AI/ML adoption
Solving ML use case scaling challenges
Solving technology scaling challenges
Solving governance scaling challenges
Summary
Navigating the Generative AI Project Lifecycle
The advancement and economic impact of generative AI
What industries are doing with generative AI
Financial services
Healthcare and life sciences
Media and entertainment
Automotive and manufacturing
The lifecycle of a generative AI project and the core technologies
Business use case selection
FM selection and evaluation
Initial screening via manual assessment
Automated model evaluation
Human evaluation
Assessing AI risks for FMs
Other evaluation consideration
Building FMs from scratch via pre-training
Adaptation and customization
Domain adaptation pre-training
Fine-tuning
Reinforcement learning from human feedback
Prompt engineering
Model management and deployment
The limitations, risks, and challenges of adopting generative AI
Summary
Designing Generative AI Platforms and Solutions
Operational considerations for generative AI platforms and solutions
New generative AI workflow and processes
New technology components
New roles
Exploring generative AI platforms
The prompt management component
FM benchmark workbench
Supervised fine-tuning and RLHF
FM monitoring
The retrieval-augmented generation pattern
Open-source frameworks for RAG
LangChain
LlamaIndex
Evaluating a RAG pipeline
Advanced RAG patterns
Designing a RAG architecture on AWS
Choosing an LLM adaptation method
Response quality
Cost of the adaptation
Implementation complexity
Bringing it all together
Considerations for deploying generative AI applications in production
Model readiness
Decision-making workflow
Responsible AI assessment
Guardrails in production environments
External knowledge change management
Practical generative AI business solutions
Generative AI-powered semantic search engine
Financial data analysis and research workflow
Clinical trial recruiting workflow
Media entertainment content creation workflow
Car design workflow
Contact center customer service operation
Are we close to having artificial general intelligence?
The symbolic approach
The connectionist/neural network approach
The neural-symbolic approach
Summary
Other Books You May Enjoy
Index
Cover
Index
Once you’ve read The Machine Learning Solutions Architect Handbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily.
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below:https://packt.link/free-ebook/9781805122500
Submit your proof of purchase.That’s it! We’ll send your free PDF and other benefits to your email directly.The field of artificial intelligence (AI) and machine learning (ML) has had a long history. Over the last 70+ years, ML has evolved from checker game-playing computer programs in the 1950s to advanced AI capable of beating the human world champion in the game of Go. More recently, Generative AI (GenAI) technology such as ChatGPT has been taking the industry by storm, generating huge interest among company executives and consumers alike, promising new ways to transform businesses such as drug discovery, new media content, financial report analysis, and consumer product design. Along the way, the technology infrastructure for ML has also evolved from a single machine/server for small experiments and models to highly complex end-to-end ML platforms capable of training, managing, and deploying tens of thousands of ML models. The hyper-growth in the AI/ML field has resulted in the creation of many new professional roles, such as MLOpsengineering, AI/ML product management, ML software engineering, AI risk manager, and AI strategist across a range of industries.
Machine learning solutions architecture (ML solutions architecture) is another relatively new discipline that is playing an increasingly critical role in the full end-to-end ML lifecycle as ML projects become increasingly complex in terms of business impact, science sophistication, and the technology landscape.
This chapter will help you understand where ML solutions architecture fits in the full data science lifecycle. We will discuss the different steps it will take to get an ML project from the ideation stage to production and the challenges faced by organizations, such as use case identification, data quality issues, and shortage of ML talent when implementing an ML initiative. Finally, we will finish the chapter by briefly discussing the core focus areas of ML solutions architecture, including system architecture, workflow automation, and security and compliance.
In this chapter, we are going to cover the following main topics:
ML versus traditional softwareThe ML lifecycle and its key challengesWhat is ML solutions architecture, and where does it fit in the overall lifecycle?Upon completing this chapter, you will understand the role of an ML solutions architect and what business and technology areas you need to focus on to support end-to-end ML initiatives. The intent of this chapter is to offer a fundamental introduction to the ML lifecycle for those in the early stages of their exploration in the field. Experienced ML practitioners may wish to skip this foundational overview and proceed directly to more advanced content.
The more advanced section commences in Chapter 4; however, many technical practitioners may find Chapter 2 helpful, as numerous technical practitioners often need more business understanding of where ML can be applied in different businesses and workflows. Additionally, Chapter 3, could prove beneficial for certain practitioners, as it provides an introduction to ML algorithms for those new to this topic and can also serve as a refresher for those practicing these concepts regularly.
Before I started working in the field of AI/ML, I spent many years building computer software platforms for large financial services institutions. Some of the business problems I worked on had complex rules, such as identifying companies for comparable analysis for investment banking deals or creating a master database for all the different companies’ identifiers from the different data providers. We had to implement hardcoded rules in database-stored procedures and application server backends to solve these problems. We often debated if certain rules made sense or not for the business problems we tried to solve.
As rules changed, we had to reimplement the rules and make sure the changes did not break anything. To test for new releases or changes, we often replied to human experts to exhaustively test and validate all the business logic implemented before the production release. It was a very time-consuming and error-prone process and required a significant amount of engineering, testing against the documented specification, and rigorous change management for deployment every time new rules were introduced, or existing rules needed to be changed. We often replied to users to report business logic issues in production, and when an issue was reported in production, we sometimes had to open up the source code to troubleshoot or explain the logic of how it worked. I remember I often asked myself if there were better ways to do this.
After I started working in the field of AI/ML, I started to solve many similar challenges using ML techniques. With ML, I did not need to come up with complex rules that often require deep data and domain expertise to create or maintain the complex rules for decision making. Instead, I focused on collecting high-quality data and used ML algorithms to learn the rules and patterns from the data directly. This new approach eliminated many of the challenging aspects of creating new rules (for example, a deep domain expertise requirement, or avoiding human bias) and maintaining existing rules. To validate the model before the production release, we could examine model performance metrics such as accuracy. While it still required data science expertise to interpret the model metrics against the nature of the business problems and dataset, it did not require exhaustive manual testing of all the different scenarios. When a model was deployed into production, we would monitor if the model performed as expected by monitoring any significant changes in production data versus the data we have collected for model training. We would collect new unseen data and labels for production data and test the model performance periodically to ensure that its predictive accuracy remains robust when faced with new, previously unseen production data. To explain why a model made a decision the way it did, we did not need to open up the source code to re-examine the hardcoded logic. Instead, we would rely on ML techniques to help explain the relative importance of different input features to understand what factors were most influential in the decision-making by the ML models.
The following figure shows a graphical view of the process differences between developing a piece of software and training an ML model:
Figure 1.1: ML and computer software
Now that you know the difference between ML and traditional software, it is time to dive deep into understanding the different stages in an ML lifecycle.
One of the early ML projects that I worked on was a fascinating yet daunting sports predictive analytics problem for a major league brand. I was given a list of predictive analytics outcomes to think about to see if there were ML solutions for the problems. I was a casual viewer of the sport; I didn’t know anything about the analytics to be generated, nor the rules of the games in the detail that was needed. I was provided with some sample data but had no idea what to do with it.
The first thing I started to work on was an immersion in the sport itself. I delved into the intricacies of the game, studying the different player positions and events that make up each game and play. Only after being armed with the newfound domain knowledge did the data start to make sense. Together with the stakeholder, we evaluated the impact of the different analytics outcomes and assessed the modeling feasibility based on the data we had. With a clear understanding of the data, we came up with a couple of top ML analytics with the most business impact to focus on. We also decided how they would be integrated into the existing business workflow, and how they would be measured on their impacts.
Subsequently, I delved deeper into the data to ascertain what information was available and what was lacking. The raw dataset had a lot of irrelevant data points that needed to be removed while the relevant data points needed to be transformed to provide the strongest signals for model training. I processed and prepared the dataset based on a few of the ML algorithms I had considered and conducted experiments to determine the best approach. I lacked a tool to track the different experiment results, so I had to document what I had done manually. After some initial rounds of experimentation, it became evident that the existing data was not sufficient to train a high-performance model. Hence, I decided to build a custom deep learning model to incorporate data of different modalities as the data points had temporal dependencies and required additional spatial information for the modeling. The data owner was able to provide the additional datasets I required, and after more experiments with custom algorithms and significant data preparations and feature engineering, I eventually trained a model that met the business objectives.
After completing the model, another hard challenge began – deploying and operationalizing the model in production and integrating it into the existing business workflow and system architecture. We engaged in many architecture and engineering discussions and eventually built out a deployment architecture for the model.
As you can see from my personal experience, the journey from business idea to ML production deployment involved many steps. A typical lifecycle of an ML project follows a formal structure, which includes several essential stages like business understanding, data acquisition and understanding, data preparation, model building, model evaluation, and model deployment. Since a big component of the lifecycle is experimentation with different datasets, features, and algorithms, the whole process is highly iterative. Furthermore, it is essential to note that there is no guarantee of a successful outcome. Factors such as the availability and quality of data, feature engineering techniques (the process of using domain knowledge to extract useful features from raw data), and the capability of the learning algorithms, among others, can all affect the final results.
Figure 1.2: ML lifecycle
The preceding figure illustrates the key steps in ML projects, and in the subsequent sections, we will delve into each of these steps in greater detail.
The first stage in the lifecycle is business understanding. This stage involves the understanding of the business goals and defining business metrics that can measure the project’s success. For example, the following are some examples of business goals:
Cost reduction for operational processes, such as document processing.Mitigation of business or operational risks, such as fraud and compliance. Product or service revenue improvements, such as better target marketing, new insight generation for better decision making, and increased customer satisfaction.To measure the success, you may use specific business metrics such as the number of hours reduced in a business process, an increased number of true positive frauds detected, a conversion rate improvement from target marketing, or the number of churn rate reductions. This is an essential step to get right to ensure there is sufficient justification for an ML project and that the outcome of the project can be successfully measured.
After you have defined the business goals and business metrics, you need to evaluate if there is an ML solution for the business problem. While ML has a wide scope of applications, it is not always an optimal solution for every business problem.
The saying that “data is the new oil” holds particularly true for ML. Without the required data, you cannot move forward with an ML project. That’s why the next step in the ML lifecycle is data acquisition, understanding, and preparation.
Based on the business problems and ML approach, you will need to gather and comprehend the available data to determine if you have the right data and data volume to solve the ML problem. For example, suppose the business problem to address is credit card fraud detection. In that case, you will need datasets such as historical credit card transaction data, customer demographics, account data, device usage data, and networking access data. Detailed data analysis is then necessary to determine if the dataset features and quality are sufficient for the modeling tasks. You also need to decide if the data needs labeling, such as fraud or not-fraud. During this step, depending on the data quality, a significant amount of data wrangling might be performed to prepare and clean the data and to generate the dataset for model training and model evaluation, depending on the data quality.
Using the training and validation datasets established, a data scientist must run a number of experiments using different ML algorithms and dataset features for feature selection and model development. This is a highly iterative process and could require numerous runs of data processing and model development to find the right algorithm and dataset combination for optimal model performance. In addition to model performance, factors such as data bias and model explainability may need to be considered to comply with internal or regulatory requirements.
Prior to deployment into production, the model quality must be validated using the relevant technical metrics, such as the accuracy score. This is usually accomplished using a holdout dataset, also known as a test dataset, to gauge how the model performs on unseen data. It is crucial to understand which metrics are appropriate for model validation, as they vary depending on the ML problems and the dataset used. For example, model accuracy would be a suitable validation metric for a document classification use case if the number of document types is relatively balanced. However, model accuracy would not be a good metric to evaluate the model performance for a fraud detection use case – this is because the number of frauds is small and even if the model predicts not-fraud all the time, the model accuracy could still be very high.
After the model is fully trained and validated to meet the expected performance metric, it can be deployed into production and the business workflow. There are two main deployment concepts here. The first involves the deployment of the model itself to be used by a client application to generate predictions. The second concept is to integrate this prediction workflow into a business workflow application. For example, deploying the credit fraud model would either host the model behind an API for real-time prediction or as a package that can be loaded dynamically to support batch predictions. Moreover, this prediction workflow also needs to be integrated into business workflow applications for fraud detection, which might include the fraud detection of real-time transactions, decision automation based on prediction output, and fraud detection analytics for detailed fraud analytics.
The ML lifecycle does not end with model deployment. Unlike software, whose behavior is highly deterministic since developers explicitly code its logic, an ML model could behave differently in production from its behavior in model training and validation. This could be caused by changes in the production data characteristics, data distribution, or the potential manipulation of request data. Therefore, model monitoring is an important post-deployment step for detecting model performance degradation (a.k.a model drift) or dataset distribution change in the production environment (a.k.a data drift).
The actual business impact should be tracked and measured as an ongoing process to ensure the model delivers the expected business benefits. This may involve comparing the business metrics before and after the model deployment, or A/B testing where a business metric is compared between workflows with or without the ML model. If the model does not deliver the expected benefits, it should be re-evaluated for improvement opportunities. This could also mean framing the business problem as a different ML problem. For example, if churn prediction does not help improve customer satisfaction, then consider a personalized product/service offering to solve the problem.
Over the years, I have worked on many real-world problems using ML solutions and encountered different challenges faced by different industries during ML adoptions.
I often get the same question when working on ML projects: We have a lot of data – can you help us figure out what insights we can generate using ML? I refer to companies with this question as having a business use case challenge. Not being able to identify business use cases for ML is a very big hurdle for many companies. Without a properly identified business problem and its value proposition and benefit, it becomes difficult to initiate an ML project.
In my conversations with different companies across their industries, data-related challenges emerge as a frequent issue. This includes data quality, data inventory, data accessibility, data governance, and data availability. This problem affects both data-poor and data-rich companies and is often exacerbated by data silos, data security, and industry regulations.
The shortage of data science and ML talent is another major challenge I have heard from many companies. Companies, in general, are having a tough time attracting and retaining top ML talents, which is a common problem across all industries. As ML platforms become more complex and the scope of ML projects increases, the need for other ML-related functions starts to surface. Nowadays, in addition to just data scientists, an organization would also need functional roles for ML product management, ML infrastructure engineering, and ML operations management.
Based on my experiences, I have observed that cultural acceptance of ML-based solutions is another significant challenge for broad adoption. There are individuals who perceive ML as a threat to their job functions, and their lack of knowledge in ML makes them hesitant to adopt these new methods in their business workflows.
The practice of ML solutions architecture aims to help solve some of the challenges in ML. In the next section, we will explore ML solutions architecture and its role in the ML lifecycle.
When I initially worked with companies as an ML solutions architect, the landscape was quite different from what it is now. The focus was mainly on data science and modeling, and the problems at hand were small in scope. Back then, most of the problems could be solved using simple ML techniques. The datasets were small, and the infrastructure required was not too demanding. The scope of the ML initiative at these companies was limited to a few data scientists or teams. As an ML architect at that time, I primarily needed to have solid data science skills and general cloud architecture knowledge to get the job done.
In more recent years, the landscape of ML initiatives has become more intricate and multifaceted, necessitating involvement from a broader range of functions and personas at companies. My engagement has expanded to include discussions with business executives about ML strategies and organizational design to facilitate the broad adoption of AI/ML throughout their enterprises. I have been tasked with designing more complex ML platforms, utilizing a diverse range of technologies for large enterprises to meet stringent security and compliance requirements. ML workflow orchestration and operations have become increasingly crucial topics of discussion, and more and more companies are looking to train large ML models with enormous amounts of training data. The number of ML models trained and deployed by some companies has skyrocketed to tens of thousands from a few dozen models in just a few years. Furthermore, sophisticated and security-sensitive customers have sought guidance on topics such as ML privacy, model explainability, and data and model bias. As an ML solutions architect, I’ve noticed that the skills and knowledge required to be successful in this role have evolved significantly.
Trying to navigate the complexities of a business, data, science, and technology landscape can be a daunting task. As an ML solutions architect, I have seen firsthand the challenges that companies face in bringing all these pieces together. In my view, ML solutions architecture is an essential discipline that serves as a bridge connecting the different components of an ML initiative. Drawing on my years of experience working with companies of all sizes and across diverse industries, I believe that an ML solutions architect plays a pivotal role in identifying business needs, developing ML solutions to address these needs, and designing the technology platforms necessary to run these solutions. By collaborating with various business and technology partners, an ML solutions architect can help companies unlock the full potential of their data and realize tangible benefits from their ML initiatives.
The following figure illustrates the core functional areas covered by the ML solutions architecture:
Figure 1.3: ML solutions architecture coverage
In the following sections, we will explore each of these areas in greater detail:
Business understanding: Business problem understanding and transformation using AI and ML.Identification and verification of ML techniques: Identification and verification of ML techniques for solving specific ML problems.System architecture of the ML technology platform: System architecture design and implementation of the ML technology platforms.MLOps: ML platform automation technical design.Security and compliance: Security, compliance, and audit considerations for the ML platform and ML models.So, let’s dive in!
The goal of the business workflow analysis is to identify inefficiencies in the workflows and determine if ML can be applied to help eliminate pain points, improve efficiency, or even create new revenue opportunities.
Picture this: you are tasked with improving a call center’s operations. You know there are inefficiencies that need to be addressed, but you’re not sure where to start. That’s where business workflow analysis comes in. By analyzing the call center’s workflows, you can identify pain points such as long customer wait times, knowledge gaps among agents, and the inability to extract customer insights from call recordings. Once you have identified these issues, you can determine what data is available and which business metrics need to be improved. This is where ML comes in. You can use ML to create virtual assistants for common customer inquiries, transcribe audio recordings to allow for text analysis, and detect customer intent for product cross-sell and up-sell. But sometimes, you need to modify the business process to incorporate ML solutions. For example, if you want to use call recording analytics to generate insights for cross-selling or up-selling products, but there’s no established process to act on those insights, you may need to introduce an automated target marketing process or a proactive outreach process by the sales team.
Once you have come up with a list of ML options, the next step is to determine if the assumption behind the ML approach is valid. This could involve conducting a simple proof of concept (POC) modeling to validate the available dataset and modeling approach, or technology POC using pre-built AI services, or testing of ML frameworks. For example, you might want to test the feasibility of text transcription from audio files using an existing text transcription service or build a customer propensity model for a new product conversion from a marketing campaign.
It is worth noting that ML solutions architecture does not focus on developing new machine algorithms, a job best suited for applied data scientists or research data scientists. Instead, ML solutions architecture focuses on identifying and applying ML algorithms to address a range of ML problems such as predictive analytics, computer vision, or natural language processing. Also, the goal of any modeling task here is not to build production-quality models but rather to validate the approach for further experimentations by full-time applied data scientists.
The most important aspect of the ML solutions architect’s role is the technical architecture design of the ML platform. The platform will need to provide the technical capability to support the different phases of the ML cycle and personas, such as data scientists and operations engineers. Specifically, an ML platform needs to have the following core functions:
Data explorations and experimentation: Data scientists use ML platforms for data exploration, experimentation, model building, and model evaluation. ML platforms need to provide capabilities such as data science development tools for model authoring and experimentation, data wrangling tools for data exploration and wrangling, source code control for code management, and a package repository for library package management.Data management and large-scale data processing: Data scientists or data engineers will need the technical capability to ingest, store, access, and process large amounts of data for cleansing, transformation, and feature engineering.Model training infrastructure management: ML platforms will need to provide model training infrastructure for different modeling training using different types of computing resources, storage, and networking configurations. It also needs to support different types of ML libraries or frameworks, such as scikit-learn, TensorFlow, and PyTorch.Model hosting/serving: ML platforms will need to provide the technical capability to host and serve the model for prediction generations, for real-time, batch, or both.Model management: Trained ML models will need to be managed and tracked for easy access and lookup, with relevant metadata.Feature management: Common and reusable features will need to be managed and served for model training and model serving purposes.A key aspect of ML platform design is workflow automation and continuous integration/continuous deployment (CI/CD), also known as MLOps. ML is a multi-step workflow – it needs to be automated, which includes data processing, model training, model validation, and model hosting. Infrastructure provisioning automation and self-service is another aspect of automation design. Key components of workflow automation include the following:
Pipeline design and management: The ability to create different automation pipelines for various tasks, such as model training and model hosting.Pipeline execution and monitoring: The ability to run different pipelines and monitor the pipeline execution status for the entire pipeline and each of the steps in the ML cycle such as data processing and model training.Model monitoring configuration: The ability to monitor the model in production for various metrics, such as data drift (where the distribution of data used in production deviates from the distribution of data used for model training), model drift (where the performance of the model degrades in the production compared with training results), and bias detection (the ML model replicating or amplifying bias towards certain individuals).Another important aspect of ML solutions architecture is the security and compliance consideration in a sensitive or enterprise setting:
Authentication and authorization: The ML platform needs to provide authentication and authorization mechanisms to manage access to the platform and different resources and services.Network security: The ML platform needs to be configured for different network security controls such as a firewall and an IP address access allowlist to prevent unauthorized access.Data encryption: For security-sensitive organizations, data encryption is another important aspect of the design consideration for the ML platform.Audit and compliance: Audit and compliance staff need the information to help them understand how decisions are made by the predictive models if required, the lineage of a model from data to model artifacts, and any bias exhibited in the data and model. The ML platform will need to provide model explainability, bias detection, and model traceability across the various datastore and service components, among other capabilities.Various industry technology providers have established best practices to guide the design and implementation of ML infrastructure, which is part of the ML solutions architect’s practices. Amazon Web Services, for example, created Machine Learning Lens to provide architectural best practices across crucial domains like operational excellence, security, reliability, performance, cost optimization, and sustainability. Following these published guidelines can help practitioners implement robust and effective ML solutions.
In this chapter, I have shared some of my personal experience as an ML solutions architect and provided an overview of core concepts and components involved in the ML lifecycle. We discussed the key responsibilities of the ML solutions architect role throughout the lifecycle. This chapter aimed to give you an understanding of the technical and business domains required to work effectively as an ML solutions architect. With this foundational knowledge, you should now have an appreciation for the breadth of this role and its integral part in delivering successful ML solutions.
In the upcoming chapter, we will dive into various ML use cases across different industries, such as financial services and media and entertainment, to gain further insights into the practical applications of ML.
Join our community’s Discord space for discussions with the author and other readers:
https://packt.link/mlsah
As anML practitioner, it is essential for me to develop a deep understanding of different businesses to have effective conversations with business and technology leaders. This should not come as a surprise since the ultimate goal of any ML solutions architecture is to solve practical business problems with science and technology solutions. Therefore, one of the main areas of focus for ML solutions architecture is to develop a broad understanding of different business domains, workflows, and relevant data. Without this understanding, it would be challenging to make sense of the data and design and develop practical ML solutions for business problems.
In this chapter, we will explore various real-world ML use cases across several industry verticals. We will examine the key business workflows and challenges faced by industries such as financial services and retail, and how ML technologies can help solve these challenges. The aim of this chapter is not to make you an expert in any particular industry or its ML use cases and techniques but rather to expose you to real-world ML use cases in business contexts and workflows. After reading this chapter, you will be equipped to apply similar analytical thinking regarding ML solutions in your own line of business. You will gain perspective on identifying and evaluating where ML technology can provide value in your workflows, processes, and objectives. The cross-industry examples and scenarios are intended to spark ideas for how ML could address your unique business challenges, and broaden your thinking about ML opportunities.
Specifically, we will cover the following topics in this chapter:
ML use cases in financial servicesML use cases in media and entertainmentML use cases in healthcare and life sciencesML use cases in manufacturingML use cases in retailML use cases in the automotive industryIf you already have extensive experience as a ML practitioner with an in-depth understanding of your industry’s use cases and solutions, and you are not interested in learning about other industries, you may wish to skip this chapter and proceed directly to the next chapter where we introduce ML algorithms.
The Financial Services Industry (FSI) has always been at the forefront of technological innovation, and ML adoption is no exception. In recent years, we have seen a range of ML solutions being implemented across different business functions within financial services. For example, in capital markets, ML is being used across front, middle, and back offices to aid investment decisions, trade optimization, risk management, and transaction settlement processing. In insurance, companies are using ML to streamline underwriting, prevent fraud, and automate claim management. While in banking, banks are using it to improve customer experience, combat fraud, and facilitate loan approval decisions. In the following sections, we will explore different core business areas within financial services and how ML can be applied to overcome some of these business challenges.
In finance, the front office is the revenue-generating