Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure - Kristen Kehrer - E-Book

Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure E-Book

Kristen Kehrer

0,0
25,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

A much-needed guide to implementing new technology in workspaces

From experts in the field comes Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure, a book that provides data scientists and managers with best practices at the intersection of management, large language models (LLMs), machine learning, and data science. This groundbreaking book will change the way that you view the pipeline of data science. The authors provide an introduction to modern machine learning, showing you how it can be viewed as a holistic, end-to-end system—not just shiny new gadget in an otherwise unchanged operational structure. By adopting a data-centric view of the world, you can begin to see unstructured data and LLMs as the foundation upon which you can build countless applications and business solutions. This book explores a whole world of decision making that hasn't been codified yet, enabling you to forge the future using emerging best practices.

  • Gain an understanding of the intersection between large language models and unstructured data
  • Follow the process of building an LLM-powered application while leveraging MLOps techniques such as data versioning and experiment tracking
  • Discover best practices for training, fine tuning, and evaluating LLMs
  • Integrate LLM applications within larger systems, monitor their performance, and retrain them on new data

This book is indispensable for data professionals and business leaders looking to understand LLMs and the entire data science pipeline.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 225

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Introduction

Acknowledgments

About the Authors

About the Technical Editor

Chapter 1

: A Gentle Introduction to Modern Machine Learning

Data Science Is Diverging from Business Intelligence

From CRISP-DM to Modern, Multicomponent ML Systems

The Emergence of LLMs Has Increased ML's Power and Complexity

What You Can Expect from This Book

Chapter 2

: An End-to-End Approach

Components of a YouTube Search Agent

Principles of a Production Machine Learning System

Chapter 3

: A Data-Centric View

The Emergence of Foundation Models

The Role of Off-the-Shelf Components

The Data-Driven Approach

A Note on Data Ethics

Building the Dataset

Knowing “Just Enough” Engineering

Notes

Chapter 4

: Standing Up Your LLM

Selecting Your LLM

Experiment Management with LLMs

LLM Inference

Optimizing LLM Inference with Experiment Management

Fine-Tuning LLMs

Wrapping Things Up

Notes

Chapter 5

: Putting Together an Application

Prototyping with Gradio

Creating Graphics with Plotnine

Deploying Models as APIs

Monitoring an LLM

Wrapping Things Up

Chapter 6

: Rounding Out the ML Life Cycle

Deploying a Simple Random Forest Model

An Introduction to Model Monitoring

Model Monitoring with Evidently AI

Building a Model Monitoring System

Final Thoughts on Monitoring

Chapter 7

: Review of Best Practices

Step 1: Understand the Problem

Step 2: Model Selection and Training

Step 3: Deploy and Maintain

Step 4: Collaborate and Communicate

Emerging Trends in LLMs

Next Steps in Learning

Appendix: Additional LLM Example

Index

Copyright

End User License Agreement

List of Illustrations

Chapter 2

Figure 2.1 YouTube search query

Figure 2.2 Components of a YouTube search

Chapter 3

Figure 3.1 Vector database options in Azure

Figure 3.2 Scalability concerns

Figure 3.3 Functionality

Figure 3.4 Features beyond vector search

Figure 3.5 Click Content Creation

Figure 3.6 Creating a project

Figure 3.7 machine-learning-upgrade cluster

Figure 3.8 Connecting youtube

Figure 3.9 Data Preview tab

Figure 3.10 Load Collection button to preview data

Figure 3.11 Viewing YouTube data with Zilliz

Figure 3.12 Data lineage model

Figure 3.13 Comet dashboard

Figure 3.14 Data stored as an artifact

Figure 3.15 Data artifact as JSON

Figure 3.16 JSON for data artifact

Chapter 4

Figure 4.1 Experiment dashboard

Figure 4.2 Top 10 probable next tokens

Figure 4.3 Top 10 probable next tokens based on math problem

Figure 4.4 Top 10 probable next tokens based on context

Figure 4.5 “ReAct: Synergizing Reasoning and Acting in Language Models” illu...

Figure 4.6 Tracking training runs with Comet

Chapter 5

Figure 5.1 Prototype using Gradio

Figure 5.2 Gradio app with more features

Figure 5.3 Adding the author selector

Figure 5.4 Adding tabs

Figure 5.5 Changing button color

Figure 5.6 Adding a download button

Figure 5.7 Deployed model

Chapter 6

Figure 6.1 Model monitoring

Figure 6.2 Evidently: Built-in metrics

Chapter 7

Figure 7.1 Google trends for LLM

Figure 7.2 Most popular LLM model downloads

Guide

Cover

Table of Contents

Title Page

Copyright

Introduction

Acknowledgments

About the Authors

About the Technical Editor

Begin Reading

Appendix: Additional LLM Example

Index

End User License Agreement

Pages

iii

ix

x

xi

xiii

xv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

209

210

211

212

213

214

215

216

217

218

iv

219

MACHINE LEARNING UPGRADE

A Data Scientist’s Guide to MLOps®, LLMs, and ML Infrastructure

 

 

KRISTEN KEHRER AND CALEB KAISER

 

 

 

 

 

 

 

Introduction

Welcome to a journey through the dynamic world of modern machine learning (ML)! In this book, we'll guide you from the data scientist's role with historical roots in business intelligence to the forefront of today's cutting-edge, multicomponent systems. You can find a GitHub with code examples from the book at https://github.com/machine-learning-upgrade so you can follow along.

We intend this book to be something you can read all the way through. This is not an index of methods or a comprehensive book on machine learning. Our aim is to cover the challenges associated with modern-day machine learning with a particular focus on data versioning, experiment tracking, post-production model monitoring, and deployment to equip you with the code and examples to start leveraging best practices immediately.

Chapter 1 lays the groundwork, revealing how the workflow for managing machine learning has evolved from traditional, more linear frameworks for data science like CRISP-DM to the advent of language model-powered applications, or large language models (LLMs). We set the stage by emphasizing the need for a unified framework that hints at the thrilling path ahead—building an LLM-powered application together.

As we delve into Chapter 2, prepare to witness an end-to-end approach to machine learning, exploring its life cycle, the principles of a production machine learning system and the core of our LLM application.

Chapter 3 zooms in on the data-centric view, emphasizing the role of data in modern ML. This is a hands-on chapter, where we create embeddings and harness the power of vector databases for text similarity searches. We couple ethical guidelines and data versioning strategies to ensure a responsible and comprehensive approach.

Then comes Chapter 4, where we guide you through selecting the right LLM, leveraging LangChain, and fine-tuning LLM performance.

With each part seamlessly connected, we venture into Chapter 5 to assemble our components and transition from prototype to application. We also demonstrate how to build dashboarding and application programming interfaces (APIs) to make your model results available to end users.

But it doesn't stop there. Chapter 6 completes the ML life cycle, tackling model monitoring, retraining pipelines, and envisioning future deployment strategies and stakeholder communication.

Finally, in Chapter 7, we recap the best practices uncovered throughout this journey, explore emerging trends in LLMs, and provide resources for further learning.

This book is more than a guide—it's an adventure, an invitation to traverse the landscapes of modern machine learning, and an opportunity to equip yourself with the tools and knowledge to navigate it. So, fasten your seatbelt, friends, and let's get going!

Acknowledgments

Writing this book has been a collaborative journey filled with shared vision and support from an incredible network of individuals who have made this possible.

We are immensely grateful to the team at Wiley, particularly James Minatel and Gus Miklos, whose dedication and expertise transformed our manuscript into a polished book. We appreciate your commitment to excellence.

Our profound appreciation goes to the technical editor, Harpreet Sahota, who provided invaluable feedback and challenged us to refine our ideas and improve the manuscript. Your insights and guidance were crucial in shaping the final book.

We extend our heartfelt thanks to the readers who will engage with our collective work. We hope this book offers valuable insights and sparks new ideas in your explorations.

To each person who has contributed, directly or indirectly, to this collaborative effort, thank you for being part of this journey.

With gratitude,

Kristen Kehrer

Caleb Kaiser

About the Authors

Kristen Kehrer has been a builder and tinkerer delivering machine learning models since 2010 for e-commerce, healthcare, and utility companies. Ranked as a global LinkedIn Top Voice in data science and analytics in 2018 with 95,000 followers in data science, Kristen is the creator of Data Moves Me. She was previously a faculty member and subject-matter expert at Emeritus Institute of Management. Kristen earned an MS in applied statistics from Worcester Polytechnic Institute and a BS in mathematics.

Caleb Kaiser is a full-stack engineer at Comet. Caleb was previously on the founding team at Cortex Labs. He also worked at Scribe Media on the author platform team and completed a BA of fine art in writing from the School of the Art Institute of Chicago.

About the Technical Editor

Harpreet Sahota is a self-described generative AI hacker. He earned undergraduate and graduate degrees in statistics and mathematics and has been in the “data world” since 2013, working as an actuary, biostatistician, data scientist, and machine learning engineer with expertise in statistics, machine learning, MLOps, LLMOps, and generative AI (with a focus on multimodal retrieval augmented generation). He loves tinkering with new technology and spending time with his wife, Romie, and kids, Jugaad and Jind. His book Practical Retrieval Augmented Generation will publish with Wiley in 2025.

Chapter 1A Gentle Introduction to Modern Machine Learning

Over the last 20 years, data science has largely been focused on using data to inform business decisions. Typical data science projects have centered around gathering, cleaning, and modeling data or creating a dashboard before finally producing a presentation to share results with stakeholders. This pipeline has been the backbone of many important business decisions. It has driven quite a bit of revenue. There have been many, many dashboards.

Traditionally, we might refer to the projects where we perform descriptive analysis to make informed decisions as business intelligence (BI). And in theory, BI is a specific field that is part of the data sciences. Data science technically refers more broadly to the practice of applying statistical methods (including modeling), coding, and domain knowledge to data, whereas business intelligence more narrowly applies to taking a data-driven approach to business decisions that focus more on descriptive and diagnostic analytics than the predictive analytics you might see from data scientists. However, we consider all analysts and BI professionals to be working in the “data sciences.”

In practice, if you've had a job as an analyst or a data scientist in the last decade, you've probably spent a lot of time on business intelligence, in one way or another.

Many people might cry foul on this claim, pointing out that business intelligence, as we traditionally think of it, falls under the domain of roles with titles like “BI analyst,” while data scientists tend to have more varied and research-focused responsibilities. While that might be true on some level, breaking down the responsibilities and functions of different roles in an average analytics organization makes it difficult to neatly separate data science from BI, and there will always be overlap when working with data.

For example, as an analyst at an average company, you'd likely be responsible for answering “what happened?” questions, using descriptive analytics to provide a snapshot of past performance. You might use Excel, SQL, and visualization software to generate reports and dashboards. You would likely monitor key performance indicators (KPIs) and help make strategic decisions based on historical data. There is also a chance that as a BI analyst or a business analyst the KPIs, data sources, and machine learning models (if any) used in this process are set up for you before you start the project—you manage them.

Now, when the company has completely new data to process, it is often made available via self-service for nontechnical stakeholders through yet another dashboard (this is where those excruciating debates about Tableau versus Power BI can crop up).

In general, this is a useful way to think about the distinction between data scientists and explicit BI roles at average companies. A data scientist will usually be responsible for the more technical, research-intensive projects in an analytics organization: exploring new data sources, implementing predictive analytics, performing hypothesis tests, researching new machine learning models, etc. However, much of what they do is still focused on, or in service to, what we can broadly define as business intelligence.

Or at least, this used to be the case.

Data Science Is Diverging from Business Intelligence

We should underline how research-focused data science work often is, especially when it comes to machine learning. In advanced analytics, where many data scientists' roles now sit; if you were to build a machine learning model, there is a good chance that the model would only ever be used for research. Years ago, in particular, it would be unlikely that you'd ever work on a model that was deployed or implemented in the product. You might build out a behavioral customer segmentation or build a predictive model with the goal being to understand your customers better. The new information about your customers might drive new features or changes to the product through further hypothesis testing. Still, the model itself might not be used any more once the research is communicated to the business.

This isn't because data scientists don't want to train models that have huge impacts; it's because this has historically just been really hard. In 2022, Gartner published a survey of companies using machine learning (ML) across the United States, Germany, and the United Kingdom, which found that only 54% of the models their data scientists developed ever made it to production, and that is after years of development within the ML ecosystem.

So instead, much of the actual business value delivered by data scientists has come from less “flashy” BI work, while their model building research has served primarily as résumé padding. But finally, this is starting to change.

Modeling itself has become more feasible, and it's becoming more common for it to be used in products. New tools like AutoML and improvements to existing ML frameworks have made it much easier for data scientists to train useful models on virtually any type of data. Transfer learning, popularized by computer vision and further spurred on by the explosion of large language models, has made training impactful models even easier in many respects. Even deployment has become more approachable as the ML infrastructure ecosystem has had time to develop.

As a result, data scientists are more and more working on modeling use cases that add to the business, and this is moving them away from traditional BI work. Increasingly, data scientists are responsible for building and monitoring machine learning models. This brave new world, however, has introduced an entirely new set of challenges and responsibilities for data scientists, and this is the core motivation of this book—to help you transition from the BI-focused world of data science to the new world, in which data scientists engineer production-focused, multicomponent machine learning systems.

Many of the principles covered in this book can be broadly referred to as MLOps (or LLMOps), but to be clear, our goal is not to become MLOps engineers, as it is not a data scientist's job to manage the tooling and infrastructure. The hope is instead that by understanding MLOps principles, data scientists can build reliable, scalable, and reproducible models that have real-world impact.

Before digging deeper into these principles, we should take some time to explore the changes in machine learning that have powered this transition.

From CRISP-DM to Modern, Multicomponent ML Systems

Machine learning has come a long way since its inception. In the 1990s, CRoss-Industry Standard Process for Data Mining (CRISP-DM) was introduced to describe the typical phases of a data modeling project. For a long time, it was the dominant framework for managing the machine learning workflow. CRISP-DM was instrumental in promoting the idea that data science should not be ad hoc but rather needed to be an organized process. The CRISP-DM framework featured six key steps:

Business understanding:

Defining the objectives and requirements of the project from a business perspective.

Data understanding:

Collecting and exploring the data to understand its quality and structure.

Data preparation:

Cleaning, transforming, and organizing the data to make it suitable for machine learning.

Modeling:

Building and evaluating machine learning models to solve the business problem.

Evaluation:

Assessing the performance of the models in achieving the business objectives.

Deployment:

Integrating the model into the production environment.

This neat, linear progression is a perfect fit for some of the data projects we previously described. In this context, a model is just a part of a pipeline, built to generate a particular report.

However, modern machine learning systems are much more akin to products than pipelines. They have many interconnected components and are applied to a much wider range of problems. On a basic technical level, they introduce several difficult challenges:

Data size and variety:

With the advent of big data, machine learning projects often involve massive datasets with diverse data types, such as text, images, and structured data. New approaches are needed to handle these different data sources.

Complex algorithms:

Machine learning algorithms have become more sophisticated, particularly with the emergence of deep learning, reinforcement learning, etc. These algorithms require specialized tools and frameworks for implementation and training.

Model deployment:

Modern machine learning systems require continuous model updates and monitoring, making deployment a complex, ongoing process.

Scalability:

Generating reports once a month is one thing, but performing real-time inference on demand for thousands of concurrent users is a massive challenge.

Collaboration:

Machine learning often involves a team of data scientists, data engineers, and domain experts. Collaboration tools and platforms have become essential to managing these cross-functional teams.

Ethical considerations and governance:

While not strictly a technical concern, the increasing impact of machine learning on society has made ethical concerns and governance practices a higher priority. Ensuring fairness, transparency, and compliance have become essential components of machine learning systems.

Luckily, some very talented data scientists and engineers have been working on these problems for years, and the modern ML ecosystem is full of tools that make these challenges significantly easier. We have solutions for data versioning, experiment tracking, and collaborating across teams. We have model registries, and tools for monitoring models in production. Data lakes and warehouses enable us to manage, store, and query large volumes of data effectively. Machine learning frameworks have made modeling so much easier. There are even open source libraries for ensuring ethical and responsible use of machine learning models.

Throughout this book, we'll explore all of these tools and actually build projects on top of them. But, before we go any deeper, we should take a moment to discuss perhaps the biggest shift in the machine learning ecosystem over the last decade: large language models.

The Emergence of LLMs Has Increased ML's Power and Complexity

Now, we turn our attention to one of the most transformative developments in the field of machine learning: the emergence of large language models (LLMs), which has significantly increased both the power and complexity of ML systems. Models like GPT-3, BERT, and their successors have redefined the limits of what's possible in natural language understanding and generation.

These models, characterized by their immense size and pretrained parameter estimates, have proven versatile and capable of various tasks. For example, LLMs have achieved unprecedented performance in natural language understanding (NLU) functions, such as sentiment analysis, text summarization, language translation, and question-answering. They have fundamentally changed the landscape of natural language processing (NLP). LLMs can generate coherent, context-aware text across diverse styles and domains and are the driving force behind the proliferation of content-generation tools, chatbots, and AI-assisted writing.

A pretrained AI model has typically been trained on a large dataset to perform a certain task. The data can be of any type including images, text, audio, tabular data, etc, depending on the use case. They are typically state-of-the-art deep learning models.

In addition, pretrained LLMs have become powerful tools for transfer learning. By fine-tuning domain-specific data, these models can be adapted to various applications, reducing the need for extensive labeled data with each new use case. They can even be applied to tasks involving modalities beyond text, such as images and audio. This convergence of modalities opens new avenues for complex, multimodal applications such as image captioning, visual question-answering, and more.

Increased capabilities, however, come with increased complexity. For example, consider the following common LLM tasks:

Language understanding:

LLMs can understand nuanced language and context, enabling more sophisticated and context-aware AI systems, such as chat agents. However, chat agents require frequent inference (i.e., the process of running inference with a trained AI model to make a prediction or solve a task). How can we handle 1,000 concurrent users, each requiring inference every few seconds, with a 17-billion parameter model?

Knowledge extraction:

LLMs can extract structured knowledge from unstructured text, which has broad applications in data mining, information retrieval, and content curation. But how do we ingest and store this knowledge? How do we make it possible for our system to even find it dynamically?

Content generation:

LLMs can generate content that is highly creative and contextually relevant, such as poetry, code, and even entire articles. This complexity extends to AI-generated art, music, and literature development. How do we generate content while ensuring that we respect copyright law? How do we protect against racism or otherwise toxic output? What about misinformation?

Multimodal AI:

Integrating LLMs with other deep learning models, such as convolutional neural network (CNNs) tasks like image processing, leads to complex, powerful multimodal AI systems. These systems can understand and generate content that combines text, images, and other data types. However, all of these models must be deployed and managed in concert. How can we do this effectively?

Research and development efforts are making these models more efficient, interpretable, and less resource-intensive. As LLMs become more accessible, they have the potential to democratize AI and empower even more innovation in diverse fields.

Throughout this book, as we explore the new world of machine learning, we will use an LLM-focused project to demonstrate. For completeness, we'll also present examples along the way with tabular data.

What You Can Expect from This Book

This book is principally a guide to MLOps and LLMOps for data scientists from a more traditional BI-focused or research-heavy background. We will be building projects of our own using the tools and principles we explore throughout the following chapters. We hope you will use these techniques in your future projects. Broadly, we will cover the following:

Data pipeline and version control:

We will introduce the concept of data pipelines and version control for data preparation. This ensures that data is consistently processed and changes to the pipeline can be tracked and managed.

Experimentation and model versioning:

We will expand the modeling phase to include things such as architecture search and model versioning.

Continuous evaluation:

We will extend the evaluation phase to involve continuous model performance monitoring and implement automated checks to detect performance degradation over time.

Continuous deployment:

We will update the deployment phase to include continuous deployment, enabling the rapid deployment of updated models. We never added A/B testing.

Monitoring and model retraining:

We will add a maintenance phase to monitor models in production, retraining our models on new data as needed.

Collaborative tools and workflow:

We will promote collaborative tools and workflows that facilitate cross-functional cooperation between data scientists, data engineers, and operations teams.

MLOps principles are essential to navigate the evolving landscape of data science and machine learning. The challenges we've discussed, from data versioning to model evaluation and deployment, highlight the need for a systematic and scalable approach. By incorporating data pipelines, version control, experimentation, continuous evaluation, and collaborative workflows into your projects, you'll enhance reproducibility and scalability and ensure that your models remain effective and adaptive over time.

As we've already discussed, we'll use an LLM-based project to demonstrate these principles. Our application will perform question-answering using YouTube videos as an external data source. In the next chapter, we'll look at the philosophy behind our project (and modern ML systems in general) before diving into our project in Chapter 3.

Chapter 2An End-to-End Approach

The focus of this book is on building end-to-end, production machine learning systems. With that in mind, we should begin by defining what these terms mean. We promise—this isn't just pedantry. Over the last 20 years, terms like end-to-end and production have been thrown around a lot in the world of data science, and depending on the time period, their definitions may vary wildly.

Imagine working on a business intelligence team at a shoe retailer in 2015, when working with data looked quite different than it does today. Your team is tasked with sales forecasting for the next quarter. What would your end-to-end system look like?