E-Book
35,99 €

Machine Learning Engineering with Python E-Book

Andrew P. McMahon

0,0

35,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

The Second Edition of Machine Learning Engineering with Python is the practical guide that MLOps and ML engineers need to build solutions to real-world problems. It will provide you with the skills you need to stay ahead in this rapidly evolving field.

The book takes an examples-based approach to help you develop your skills and covers the technical concepts, implementation patterns, and development methodologies you need. You'll explore the key steps of the ML development lifecycle and create your own standardized "model factory" for training and retraining of models. You'll learn to employ concepts like CI/CD and how to detect different types of drift.

Get hands-on with the latest in deployment architectures and discover methods for scaling up your solutions. This edition goes deeper in all aspects of ML engineering and MLOps, with emphasis on the latest open-source and cloud-based technologies. This includes a completely revamped approach to advanced pipelining and orchestration techniques.

With a new chapter on deep learning, generative AI, and LLMOps, you will learn to use tools like LangChain, PyTorch, and Hugging Face to leverage LLMs for supercharged analysis. You will explore AI assistants like GitHub Copilot to become more productive, then dive deep into the engineering considerations of working with deep learning.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 614

Veröffentlichungsjahr: 2023

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Machine Learning Engineering with Python

Second Edition

Manage the lifecycle of machine learning models using MLOps with practical examples

Andrew P. McMahon

BIRMINGHAM—MUMBAI

Machine Learning Engineering with Python

Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Bhavesh Amin

Acquisition Editor – Peer Reviews: Gaurav Gavas

Project Editor: Amisha Vathare

Content Development Editor: Elliot Dallow

Copy Editor: Safis Editing

Technical Editor: Anjitha Murali

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Presentation Designer: Rajesh Shirsath

Developer Relations Marketing Executive: Monika Sangwan

First published: November 2021

Second edition: August 2023

Production reference: 2280823

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83763-196-4

www.packt.com

Foreword

Throughout my career in Fortune 500 companies and startups, I have witnessed the various cycles of machine learning momentum firsthand. With each cycle, our economy has broadened, opportunities evolved, and the world moved forward with innovations. This is machine learning; welcome! This world is an ever-evolving field that has witnessed unprecedented growth over the past decade. As artificial intelligence has become increasingly integrated into our lives, it is essential to grasp the principles of building robust, scalable, and efficient machine learning systems.

It is spectacular to see how the world has changed since I released my book, Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch (O’Reilly, March 2023).

As an industry, we have noticed a massive adoption of OpenAI services, evolving GPT models, and industry openness, this is a phenomenal time to dive deeper, take notes, and learn from hands-on practitioners.

Machine Learning Engineering with Python is your gateway to mastering the art of turning machine learning models into real-world applications, including all the bits and pieces of building pipelines with Airflow, data processing with Spark, LLMs, CI/CD for machine learning, and working with AWS services.

Andy breaks the space into tangible, easy-to-read, and compelling chapters, each focusing on specific use cases and dedicated technologies to pass his wisdom to you, including how to approach projects as an ML engineer working with deep learning, large-scale serving and training, and LLMs.

This fantastic resource bridges the gap between theory and practice, offering a hands-on, Python-focused approach to ML engineering. Whether you are an aspiring data scientist, a seasoned machine learning practitioner, or a software engineer venturing into the realm of AI for the first time, this book equips you with the knowledge and tools to thrive in the ML engineering domain.

What I love about this book is that it is very practical. It covers the space well, from complete beginners to hands-on. You can use it to get a bigger picture of the space and then dive deep with code samples and real-world projects.

Another distinguishing aspect of this book is its focus on Python— a popular language in data science and ML engineering. Python’s versatility, ease of use, and vast ecosystem of libraries makes it the ideal choice for turning machine learning models into tangible solutions. Throughout the book, Andy leverages Python’s powerful libraries like NumPy, pandas, scikit-learn, TensorFlow, and an array of open-source solutions.

Whether you are an individual learner seeking to enhance your skills or an organization striving to harness the potential of machine learning, Machine Learning Engineering with Python is your ultimate companion.

Adi Polak

Author of Scaling Machine Learning with Spark

Contributors

About the author

Andrew Peter McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year’ at the 2022 British Data Awards and ‘Data Scientist of the Year’ by the Data Science Foundation in 2019.

It takes a village to do anything in this life, and I wish to thank my Mum, Dad, and brothers, David and Paul, wider family and friends for being that village for me. But most of all I want to thank my wife Hayley for being my rock, my best friend, and the only person I could go through this crazy life with; and my sons, Teddy and Alfie, for being all the inspiration I’ll ever need. This book is dedicated to you three.

About the reviewers

Hamza Tahir is a software developer turned ML engineer. An indie hacker by heart, he loves ideating, implementing, and launching data-driven products. He has previously worked on creating machine learning based products like PicHance, Scrilys, BudgetML, and you-tldr. Based on his learnings from deploying ML in production for predictive maintenance use-cases in his previous startup, he co-created ZenML, an open-source MLOps framework for easily creating production grade ML pipelines on any infrastructure stack.

I’d like to give my appreciation to all my ZenML colleagues, who are working hard every day to make the most accessible MLOps product on the market. I’d also like to thank Isabel, for her consistent and unwavering support.

Prince Canuma is an accomplished ML engineer with an extensive background in MLOps, ML, data science, computer vision, and NLP. Born in Maputo, Mozambique, Prince embarked on his journey into the realm of ML after earning a diploma in IT and telecommunications from the Institute of Transports and Communications (ITC) in 2017. At present, Prince is an integral part of Neptune.ai. In his tenure, he has excelled as an ML engineer and carved a niche for himself in developer relations.

Beyond his professional commitments, Prince’s expertise shines brightly in his significant research contributions, especially in the realm of Automatic Speech Recognition (ASR). His most groundbreaking work revolved around benchmarking and enhancing OpenAI’s Whisper ASR system by introducing speaker diarization and a targeted language model.

On a personal note, I extend my deepest gratitude to Eliseu Canuma and the entire Canuma family for their unwavering support and encouragement throughout the process of producing this book. Your faith in my capabilities has been the cornerstone of my involvement in this project.

Jonah Turner is a student at Toulouse 3 Paul Sabatier University pursuing a master’s in artificial intelligence. He was previously a data scientist at Sandhills Global working in computer vision, NLP, and DataOps. He enjoys traveling in southern France, browsing GitHub, and contributing to open-source machine learning projects.

Join our community on Discord

Join our community’s Discord space for discussion with the author and other readers:

https://packt.link/mle

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Introduction to ML Engineering

Technical requirements

Defining a taxonomy of data disciplines

Data scientist

ML engineer

ML operations engineer

Data engineer

Working as an effective team

ML engineering in the real world

What does an ML solution look like?

Why Python?

High-level ML system design

Example 1: Batch anomaly detection service

Example 2: Forecasting API

Example 3: Classification pipeline

Summary

The Machine Learning Development Process

Technical requirements

Setting up our tools

Setting up an AWS account

Concept to solution in four steps

Comparing this to CRISP-DM

Discover

Using user stories

Play

Develop

Selecting a software development methodology

Package management (conda and pip)

Poetry

Code version control

Git strategies

Model version control

Deploy

Knowing your deployment options

Understanding DevOps and MLOps

Building our first CI/CD example with GitHub Actions

Continuous model performance testing

Continuous model training

Summary

From Model to Model Factory

Technical requirements

Defining the model factory

Learning about learning

Defining the target

Cutting your losses

Preparing the data

Engineering features for machine learning

Engineering categorical features

Engineering numerical features

Designing your training system

Training system design options

Train-run

Train-persist

Retraining required

Detecting data drift

Detecting concept drift

Setting the limits

Diagnosing the drift

Remediating the drift

Other tools for monitoring

Automating training

Hierarchies of automation

Optimizing hyperparameters

Hyperopt

Optuna

AutoML

auto-sklearn

AutoKeras

Persisting your models

Building the model factory with pipelines

Scikit-learn pipelines

Spark ML pipelines

Summary

Packaging Up

Technical requirements

Writing good Python

Recapping the basics

Tips and tricks

Adhering to standards

Writing good PySpark

Choosing a style

Object-oriented programming

Functional programming

Packaging your code

Why package?

Selecting use cases for packaging

Designing your package

Building your package

Managing your environment with Makefiles

Getting all poetic with Poetry

Testing, logging, securing, and error handling

Testing

Securing your solutions

Analyzing your own code for security issues

Analyzing dependencies for security issues

Logging

Error handling

Not reinventing the wheel

Summary

Deployment Patterns and Tools

Technical requirements

Architecting systems

Building with principles

Exploring some standard ML patterns

Swimming in data lakes

Microservices

Event-based designs

Batching

Containerizing

Hosting your own microservice on AWS

Pushing to ECR

Hosting on ECS

Building general pipelines with Airflow

Airflow

Airflow on AWS

Revisiting CI/CD for Airflow

Building advanced ML pipelines

Finding your ZenML

Going with the Kubeflow

Selecting your deployment strategy

Summary

Scaling Up

Technical requirements

Scaling with Spark

Spark tips and tricks

Spark on the cloud

AWS EMR example

Spinning up serverless infrastructure

Containerizing at scale with Kubernetes

Scaling with Ray

Getting started with Ray for ML

Scaling your compute for Ray

Scaling your serving layer with Ray

Designing systems at scale

Summary

Deep Learning, Generative AI, and LLMOps

Going deep with deep learning

Getting started with PyTorch

Scaling and taking deep learning into production

Fine-tuning and transfer learning

Living it large with LLMs

Understanding LLMs

Consuming LLMs via API

Coding with LLMs

Building the future with LLMOps

Validating LLMs

PromptOps

Summary

Building an Example ML Microservice

Technical requirements

Understanding the forecasting problem

Designing our forecasting service

Selecting the tools

Training at scale

Serving the models with FastAPI

Response and request schemas

Managing models in your microservice

Pulling it all together

Containerizing and deploying to Kubernetes

Containerizing the application

Scaling up with Kubernetes

Deployment strategies

Summary

Building an Extract, Transform, Machine Learning Use Case

Technical requirements

Understanding the batch processing problem

Designing an ETML solution

Selecting the tools

Interfaces and storage

Scaling of models

Scheduling of ETML pipelines

Executing the build

Building an ETML pipeline with advanced Airflow features

Summary

Other Books You May Enjoy

Index

Landmarks

Cover

Index

Preface

”Software is eating the world, but AI is going to eat software.”

— Jensen Huang, CEO of Nvidia

Machine learning (ML), part of the wider field of Artificial Intelligence (AI), is rightfully recognized as one of the most powerful tools available for organizations to extract value from their data. As the capabilities of ML algorithms have grown over the years, it has become increasingly obvious that implementing such algorithms in a scalable, fault-tolerant, and automated way requires the creation of new disciplines. These disciplines, ML Engineering (MLE) and ML Operations (MLOps), are the focus of this book.

The book covers a wide variety of topics in order to help you understand the tools, techniques, and processes you can apply to engineer your ML solutions, with an emphasis on introducing the key concepts so that you can build on them in your future work. The aim is to develop fundamentals and a broad understanding that will stand the test of time, rather than just provide a series of introductions to the latest tools, although we do cover a lot of the latest tools!

All of the code examples are given in Python, the most popular programming language in the world (at the time of writing) and the lingua franca for data applications. Python is a high-level and object-oriented language with a rich ecosystem of tools focused on data science and ML. For example, packages such as scikit-learn and pandas have have become part of the standard lexicon for data science teams across the world. The central tenet of this book is that knowing how to use packages like these is not enough. In this book, we will use these tools, and many, many more, but focus on how to wrap them up in production-grade pipelines and deploy them using appropriate cloud and open-source tools.

We will cover everything from how to organize your ML team, to software development methodologies and best practices, to automating model building through to packaging your ML code, how to deploy your ML pipelines to a variety of different targets and then on to how to scale your workloads for large batch runs. We will also discuss, in an entirely new chapter for this second edition, the exciting world of applying ML engineering and MLOps to deep learning and generative AI, including how to start building solutions using Large Language Models (LLMs) and the new field of LLM Operations (LLMOps).

The second edition of Machine Learning Engineering with Python goes into a lot more depth than the first edition in almost every chapter, with updated examples and more discussion of core concepts. There is also a far wider selection of tools covered and a lot more focus on open-source tools and development. The ethos of focusing on core concepts remains, but it is my hope that this wider view means that the second edition will be an excellent resource for those looking to gain practical knowledge of machine learning engineering.

Although a far greater emphasis has been placed on using open-source tooling, many examples will also leverage services and solutions from Amazon Web Services (AWS). I believe that the accompanying explanations and discussions will, however, mean that you can still apply everything you learn here to any cloud provider or even in an on-premises setting.

Machine Learning Engineering with Python, Second Edition will help you to navigate the challenges of taking ML to production and give you the confidence to start applying MLOps in your projects. I hope you enjoy it!

Who this book is for

This book is for machine learning engineers, data scientists, and software developers who want to build robust software solutions with ML components. It is also relevant to anyone who manages or wants to understand the production life cycle of these systems. The book assumes intermediate-level knowledge of Python and some basic exposure to the concepts of machine learning. Some basic knowledge of AWS and the use of Unix tools like bash or zsh will also be beneficial.

What this book covers

Chapter 1, Introduction to ML Engineering, explains the core concepts of machine learning engineering and machine learning operations. Roles within ML teams are discussed in detail and the challenges of ML engineering and MLOps are laid out.

Chapter 2, The Machine Learning Development Process, explores how to organize and successfully execute an ML engineering project. This includes a discussion of development methodologies like Agile, Scrum and CRISP-DM, before sharing a project methodology developed by the author that is referenced to throughout the book. This chapter also introduces continuous integration/continouous deployment (CI/CD) and developer tools.

Chapter 3, From Model to Model Factory, shows how to standardize, systematize and automate the process of training and deploying machine learning models. This is done using the author’s concept of the model factory, a methodology for repeatable model creation and validation. This chapter also discusses key theoretical concepts important for understanding machine learning models and covers different types of drift detection and model retrain triggering criteria.

Chapter 4, Packaging Up, discusses best practices for coding in Python and how this relates to building your own packages, libraries and components for reuse in multiple projects. This chapter covers fundamental Python programming concepts before moving onto more advanced concepts, and then discusses package and environment management, testing, logging and error handling and security.

Chapter 5, Deployment Patterns and Tools, teaches you some of the standard ways you can design your ML system and then get it into production. This chapter focusses on architectures, system design and deployment patterns first before moving onto using more advanced tools to deploy microservices, including containerization and AWS Lambda. The popular ZenML and Kubeflow pipelining and deployment platforms are then reviewed in detail with examples.

Chapter 6, Scaling Up, is all about developing with large datasets in mind. For this, the Apache Spark and Ray frameworks are discussed in detail with worked examples. The focus for this chapter is on scaling up batch workloads where massive compute is required.

Chapter 7, Deep Learning, Generative AI and LLMOps, covers the latest concepts and techniques for training and deploying deep learning models for production use cases. This chapter includes material discussing the new wave of generative models, with a particular focus on Large Language Models (LLMs) and the challenges for ML engineers looking to productionize these models. This leads us onto define the core elements of LLM Operations (LLMOps).

Chapter 8, Building an Example ML Microservice, walks through the building of a machine learning microservice that serves a forecasting solution using FastAPI, Docker and Kubernetes. This pulls together many of the previous concepts developed throughout the book.

Chapter 9, Building an Extract, Transform, Machine Learning Use Case, builds out an example of a batch processing ML system that leverages standard ML algorithms and augments these with the use of LLMs. This shows a concrete application of LLMs and LLMOps, as well as providing a more advanced discussion of Airflow DAGs.

To get the most out of this book

In this book, some previous exposure to Python development is assumed. Many introductory concepts are covered for completeness but in general it will be easier to get through the examples if you have already written at least some Python programs before. The book also assumes some exposure to the main concepts from machine learning, such as what a model is, what training and inference refer to and an understanding of similar concepts. Several of these are recapped in the text but again it will be a smoother ride if you have previously been acquainted with the main ideas behind building a machine learning model, even at a rudimentary level.On the technical side, to get the most out of the examples in the book, you will need access to a computer or server where you have privileges to install and run Python and other software packages and applications. For many of the examples, access to a UNIX type terminal, such as bash or zsh, is assumed. The examples in this book were written and tested on both a Linux machine running Ubuntu LTS and an M2 Macbook Pro running macOS. If you use a different setup, for example Windows, the examples may require some translation in order to work for your system. Note that the use of the M2 Macbook Pro means several examples show some additional information to get the examples working on Apple Silicon devices. These sections can comfortably be skipped if your system does not require this extra setup.Many of the Cloud based examples leverage Amazon Web Services (AWS) and so require an AWS account with billing setup. Most of the examples will use the free-tier services available from AWS but this is not always possible. Caution is advised in order to avoid large bills. If in doubt, it is recommended you consult the AWS documentation for more information. As a concrete example of this, In Chapter 5, Deployment Patterns and Tools, we use the Managed Workflows with Apache Spark (MWAA) service from AWS. There is no free tier option for MWAA so as soon as you spin up the example, you will be charged for the environment and any instances. Ensure you are happy to do this before proceeding and I recommend tearing down your MWAA instances when finished.Conda and Pip are used for package and environment management throughout this book, but Poetry is also used in many cases. To facilitate easy reproduction of development environments for each chapters in the book’s GitHub repository, (https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python-Second-Edition), each chapter of the book has a corresponding folder and within that folder are requirements.txt and Conda environment.yml files, as well as helpful README files. The commands for replicating the environments and any other requirements are given at the beginning of each chapter within the book.If you are using the digital version of this book, I still adviseyou to type the code yourself or access the code from the book’s GitHub repository (https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python-Second-Edition). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

As mentioned above, the code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/LMqir.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: “First, we must import the TabularDrift detector from the alibi-detect package, as well as the relevant packages for loading and splitting the data.”

A block of code is set as follows:

from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split import alibi from alibi_detect.cd import TabularDrift

Any command-line input or output is written as follows and are indicated as command-line commands in the main body of the text:

pip install tensorflow-macos

Bold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes appear in the text like this. For example: “Select the Deploy button. This will provide a dropdown where you can select Create service.”

References to additional resources or background information appear like this.

Helpful tips and important caveats appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packtpub.com/submit-errata, click Submit Errata, and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read Machine Learning Engineering with Python - Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837631964

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

1 Introduction to ML Engineering

Welcome to the second edition of Machine Learning Engineering with Python, a book that aims to introduce you to the exciting world of making machine learning (ML) systems production-ready.

In the two years since the first edition of this book was released, the world of ML has moved on substantially. There are now far more powerful modeling techniques available, more complex technology stacks, and a whole host of new frameworks and paradigms to keep up to date with. To help extract the signal from the noise, the second edition of this book covers a far larger variety of topics in more depth than the first edition, while still focusing on the critical tools and techniques you will need to build up your ML engineering expertise. This edition will cover the same core topics such as how to manage your ML projects, how to create your own high-quality Python ML packages, and how to build and deploy reusable training and monitoring pipelines, while adding discussion around more modern tooling. It will also showcase and dissect different deployment architectures in more depth and discuss more ways to scale your applications using AWS and cloud-agnostic tooling. This will all be done using a variety of the most popular and latest open-source packages and frameworks, from classics like Scikit-Learn and Apache Spark to Kubeflow, Ray,and ZenML. Excitingly, this edition also has new sections dedicated entirely to Transformers and Large Language Models (LLMs) like ChatGPT and GPT-4, including examples using Hugging Face and OpenAI APIs to fine-tune and build pipelines using these extraordinary new models. As in the first edition, the focus is on equipping you with the solid foundation you need to go far deeper into each of these components of ML engineering. The aim is that by the end of this book, you will be able to confidently build, scale, and deploy production-grade ML systems in Python using these latest tools and concepts.

You will get a lot from this book even if you do not run the technical examples, or even if you try to apply the main points in other programming languages or with different tools. As already mentioned, the aim is to create a solid conceptual foundation you can build on. In covering the key principles, the aim is that you come away from this book feeling more confident in tackling your own ML engineering challenges, whatever your chosen toolset.

In this first chapter, you will learn about the different types of data roles relevant to ML engineering and why they are important, how to use this knowledge to build and work within appropriate teams, some of the key points to remember when building working ML products in the real world, how to start to isolate appropriate problems for engineered ML solutions, and how to create your own high-level ML system designs for a variety of typical business problems.

We will cover these topics in the following sections:

Defining a taxonomy of data disciplinesAssembling your teamML engineering in the real worldWhat does an ML solution look like?High-level ML system design

Now that we have explained what we are going after in this first chapter, let’s get started!

Technical requirements

Throughout the book, all code examples will assume the use of Python 3.10.8 unless specified otherwise. Examples in this edition have been run on a 2022 Macbook Pro with an M2 Apple silicon chip, with Rosetta 2 installed to allow backward compatibility with Intel-based applications and packages. Most examples have also been tested on a Linux machine running Ubuntu 22.04 LTS. The required Python packages for each chapter are stored in conda environment .yml files in the appropriate chapter folder in the book’s Git repository. We will discuss package and environment management in detail later in the book. But in the meantime, assuming you have a GitHub account and have configured your environment to be able to pull and push from GitHub remote repositories, to get started you can clone the book repository from the command line:

git clone https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python-Second-Edition.git

Assuming you have Anaconda or Miniconda installed, you can then navigate to the Chapter01 folder of the Git repository for this book and run:

conda env create –f mlewp-chapter01.yml

This will set up the environment you can use to run the examples given in this chapter. A similar procedure can be followed for each chapter, but each section will also call out any installation requirements specific to those examples.

Now we have done some setup, we will start to explore the world of ML engineering and how it fits into a modern data ecosystem. Let’s begin our exploration of the world of ML engineering!

Note: Before running the conda commands given in this section you may have to install a specific library manually. Some versions of the Facebook Prophet library require versions of PyStan that can struggle to build on Macbooks running Apple silicon. If you run into this issue, then you should try to install the package httpstan manually. First, go to https://github.com/stan-dev/httpstan/tags and select a version of the package to install. Download the tar.gz or .zip of that version and extract it. Then you can navigate to the extracted folder and run the following commands:

make python3 -m pip install poetry python3 -m poetry build python3 -m pip install dist/*.whl

You may also run into an error like the following when you call model.fit() in the later example:

dyld[29330]: Library not loaded: '@rpath/libtbb.dylib'

If this is the case you will have to run the following commands, substituting in the correct path for your Prophet installation location in the Conda environment:

cd /opt/homebrew/Caskroom/miniforge/base/envs/mlewp-chapter01/lib/python3.10/site-packages/prophet/stan_model/ install_name_tool -add_rpath @executable_path/cmdstan-2.26.1/stan/lib/stan_math/lib/tbb prophet_model.bin

Oh, the joys of doing ML on Apple silicon!

Defining a taxonomy of data disciplines

The explosion of data and the potential applications of it over the past few years have led to a proliferation of job roles and responsibilities. The debate that once raged over how a data scientist was different from a statistician has now become extremely complex. I would argue, however, that it does not have to be so complicated. The activities that have to be undertaken to get value from data are pretty consistent, no matter what business vertical you are in, so it should be reasonable to expect that the skills and roles you need to perform these steps will also be relatively consistent. In this chapter, we will explore some of the main data disciplines that I think you will always need in any data project. As you can guess, given the name of this book, I will be particularly keen to explore the notion of ML engineering and how this fits into the mix.

Let’s now look at some of the roles involved in using data in the modern landscape.

Data scientist

After the Harvard Business Review declared that being a data scientist was The Sexiest Job of the 21st Century (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century), this job role became one of the most sought after, but also hyped, in the mix. Its popularity remains high, but the challenges of taking advanced analytics and ML into production have meant there has been more and more of a shift toward engineering roles within data-driven organizations. The traditional data scientist role can cover an entire spectrum of duties, skills, and responsibilities depending on the business vertical, the organization, or even just personal preference. No matter how this role is defined, however, there are some key areas of focus that should always be part of the data scientist’s job profile:

Analysis: A data scientist should be able to wrangle, munge, manipulate, and consolidate datasets before performing calculations on the data that help us to understand it. Analysis is a broad term, but it’s clear that the end result is knowledge of your dataset that you didn’t have before you started, no matter how basic or complex.Modeling: The thing that gets everyone excited (potentially including you, dear reader) is the idea of modeling phenomena found in your data. A data scientist usually has to be able to apply statistical, mathematical, and ML techniques to data, in order to explain processes or relationships contained within it and to perform some sort of prediction.Working with the customer or user: The data scientist role usually has some more business-directed elements so that the results of the previous two points can support decision-making in an organization. This could be done by presenting the results of the analysis in PowerPoint presentations or Jupyter notebooks, or even sending an email with a summary of the key results. It involves communication and business acumen in a way that goes beyond classic tech roles.

ML engineer

The gap between creating ML proof-of-concept and building robust software, what I often refer to in talks as “the chasm,” has led to the rise of what I would now argue is one of the most important roles in technology. The ML engineer serves an acute need to translate the world of data science modeling and exploration into the world of software products and systems engineering. Since this is no easy feat, the ML engineer has become increasingly sought after and is now a critical piece of the data-driven software value chain. If you cannot get things into production, you are not generating value, and if you are not generating value, well we know that’s not great!

You can articulate the need for this type of role quite nicely by considering a classic voice assistant. In this case, a data scientist would usually focus on translating the business requirements into a working speech-to-text model, potentially a very complex neural network, and showing that it can perform the desired voice transcription task in principle. ML engineering is then all about how you take that speech-to-text model and build it into a product, service, or tool that can be used in production. Here, it may mean building some software to train, retrain, deploy, and track the performance of the model as more transcription data is accumulated, or as user preferences are understood. It may also involve understanding how to interface with other systems and provide results from the model in the appropriate formats. For example, the results of the model may need to be packaged into a JSON object and sent via a REST API call to an online supermarket, in order to fulfill an order.

Data scientists and ML engineers have a lot of overlapping skillsets and competencies but have different areas of focus and strengths (more on this later), so they will usually be part of the same project team and may have either title, but it will be clear what hat they wear from what they do in that project.

Similar to the data scientist, we can define the key areas of focus for the ML engineer:

Translation: Taking models and research code in a variety of formats and translating them into slicker, more robust pieces of code.

This can be done using OO programming, functional programming, a mix, or something else, but it basically helps to take the proof-of-conceptwork of the data scientist and turn it into something that is far closer to being trusted in a production environment.

Architecture: Deployments of any piece of software do not occur in a vacuum and will always involve lots of integrated parts. This is true of ML solutions as well. The ML engineer has to understand how the appropriate tools and processes link together so that the models built with the data scientist can do their job and do it at scale.Productionization: The ML engineer is focused on delivering a solution and so should understand the customer’s requirements inside out, as well as be able to understand what that means for the project development. The end goal of the ML engineer is not to provide a good model (though that is part of it), nor is it to provide something that basically works. Their job is to make sure that the hard work on the data science side of things generates the maximum potential value in a real-world setting.

ML operations engineer

ML engineering will be the focus of this book, but there is an important role now emerging with the aim of enabling ML engineers to do their work with higher quality, at greater pace, and at a larger scale. These are the Machine Learning Operations (MLOps) engineers. This role is all about building out the tools and capabilities that enable the tasks of the ML engineer and data scientists. This role focuses more on building out the tooling, platforms, and automation used by the other roles, and so connects them nicely. That is not to say MLOps engineers will not be used in specific projects or builds; it is just that their main value-add comes not from this but from enabling the capabilities used during a specific project or build. If we revisit the example of the speech-to-text solution described in the ML engineer section, we can get a flavor of this. Where the ML engineer will be worried about building out a solution that works seamlessly in production, the MLOps engineer will work hard to build out the platform or toolset that the ML engineer uses to do this. The ML engineer will build pipelines, but the MLOps engineer may build pipeline templates; the ML engineer may use continuous integration/continuous deployment (CI/CD) practices (more on this later), but the MLOps engineer will enable that capability and define the best practice to use CI/CD smoothly. Finally, where the ML engineer thinks “How do I solve this specific problem robustly using the proper tools and techniques?”, the MLOps engineer asks “How do I make sure that the ML engineers and data scientists will be able to, in general, solve the types of problems they need to, and how can I continually update and improve that setup?”

As we did with the data scientist and ML engineer, let us define some of the key areas of focus for the MLOps engineer:

Automation: Increasing the level of automation across the data science and ML engineering workflows through the use of techniques such as CI/CD and Infrastructure-as-Code (IAC). Pre-package software that can be deployed to allow for smoother deployments of solutions through these capabilities and more, such as automation scripts or standardized templates.Platform engineering: Working to integrate a series of useful services together in order to build out the ML platform for the different data-driven teams to use. This can include developing integrations across orchestration tools, compute, and more data-driven services until they become a holistic whole for use by ML engineers and data scientists.Enabling key MLOps capabilities: MLOps consists of a set of practices and techniques that enable the productionization of ML models by the other engineers in the team. Capabilities such as model management and model monitoring should be enabled by the MLOps engineers in a way that can be used at scale across multiple projects.

It should be noted that some of the topics covered in this book could be carried out by an MLOps engineer and that there is naturally some overlap. This should not concern us too much, as MLOps is based on quite a generic set of practices and capabilities that can be encompassed by multiple roles (see Figure 1.1).

Data engineer

The data engineers are the people responsible for getting the commodity that everything else in the preceding sections is based on from A to B with high fidelity, appropriate latency, and as little effort on the part of the other team members as possible. You cannot create any type of software product, never mind an ML product, without data.

The key areas of focus for a data engineer are as follows:

Quality: Getting data from A to B is a pointless exercise if the data is garbled, fields are missing, or IDs are screwed up. The data engineer cares about avoiding this and uses a variety of techniques and tools, generally to ensure that the data that left the source system is what lands in your data storage layer.Stability: Similar to the previous point on quality, if the data comes from A to B but it only arrives every second Wednesday if it’s not a rainy day, then what’s the point?

Data engineers spend a lot of time and effort and use their considerable skills to ensure that data pipelines are robust, reliable, and can be trusted to deliver when promised.

Access: Finally, the aim of getting data from A to B is for it to be used by applications, analyses, and ML models, so the nature of B is important. The data engineer will have a variety of technologies to hand to surface data and should work with the data consumers (our data scientists and ML engineers, among others) to define and create appropriate data models within these solutions:

Figure 1.1: A diagram showing the relationships between data science, ML engineering, and data engineering.

As mentioned previously, this book focuses on the work of the ML engineer and how you can learn some of the skills useful for that role, but it is important to remember that you will not be working in a vacuum. Always keep in mind the profiles of the other roles (and many more not covered here that will exist in your project team) so that you work most effectively together. Data is a team sport after all!

Now that you understand the key roles in a modern data team and how they cover the spectrum of activities required to build successful ML products, let’s look at how you can put them together to work efficiently and effectively.

Working as an effective team

In modern software organizations, there are many different methodologies to organize teams and get them to work effectively together. We will cover some of the project management methodologies that are relevant in Chapter 2, The Machine Learning Development Process, but in the meantime, this section will discuss some important points you should consider if you are ever involved in forming a team, or even if you just work as part of a team, that will help you become an effective teammate or lead.

First, always bear in mind that nobody can do everything. You can find some very talented people out there, but do not ever think one person can do everything you will need to the level you require. This is not just unrealistic; it is bad practice and will negatively impact the quality of your products. Even when you are severely resource-constrained, the key is for your team members to have a laser-like focus to succeed.

Second, blended is best. We all know the benefits of diversity for organizations and teams in general and this should, of course, apply to your ML team as well. Within a project, you will need mathematics, code, engineering, project management, communication, and a variety of other skills to succeed. So, given the previous point, make sure you cover this in at least some sense across your team.

Third, tie your team structure to your projects in a dynamic way. If you work on a project that is mostly about getting the data in the right place and the actual ML models are really simple, focus your team profile on the engineering and data modeling aspects. If the project requires a detailed understanding of the model, and it is quite complex, then reposition your team to make sure this is covered. This is just sensible and frees up team members who would otherwise have been underutilized to work on other projects.

As an example, suppose that you have been tasked with building a system that classifies customer data as it comes into your shiny new data lake, and the decision has been taken that this should be done at the point of ingestion via a streaming application. The classification has already been built for another project. It is already clear that this solution will heavily involve the skills of the data engineer and the ML engineer, but not so much the data scientist, since that portion of the work will have been completed in another project.

In the next section, we will look at some important points to consider when deploying your team on a real-world business problem.

ML engineering in the real world

The majority of us who work in ML, analytics, and related disciplines do so for organizations with a variety of different structures and motives. These could be for for-profit corporations, not-for-profits, charities, or public sector organizations like government or universities. In pretty much all of these cases, we do not do this work in a vacuum and not with an infinite budget of time or resources. It is important, therefore, that we consider some of the important aspects of doing this type of work in the real world.

First of all, the ultimate goal of your work is to generate value. This can be calculated and defined in a variety of ways, but fundamentally your work has to improve something for the company or its customers in a way that justifies the investment put in. This is why most companies will not be happy for you to take a year to play with new tools and then generate nothing concrete to show for it, or to spend your days only reading the latest papers. Yes, these things are part of any job in technology, and they can definitely be super-fun, but you have to be strategic about how you spend your time and always be aware of your value proposition.

Secondly, to be a successful ML engineer in the real world, you cannot just understand the technology; you must understand the business. You will have to understand how the company works day to day, you will have to understand how the different pieces of the company fit together, and you will have to understand the people of the company and their roles. Most importantly, you have to understand the customer, both of the business and your work. If you do not know the motivations, pains, and needs of the people you build for, then how can you be expected to build the right thing?

Finally, and this may be controversial, the most important skill for you to become a successful ML engineer in the real world is one that this book will not teach you, and that is the ability to communicate effectively. You will have to work in a team, with a manager, with the wider community and business, and, of course, with your customers, as mentioned above. If you can do this and you know the technology and techniques (many of which are discussed in this book), then what can stop you?

But what kinds of problems can you solve with ML when you work in the real world? Well, let’s start with another potentially controversial statement: a lot of the time, ML is not the answer. This may seem strange given the title of this book, but it is just as important to know when not to apply ML as when to apply it. This will save you tons of expensive development time and resources.

ML is ideal for cases when you want to do a semi-routine task faster, with more accuracy, or at a far larger scale than is possible with other solutions.

Some typical examples are given in the following table, along with some discussion as to whether or not ML would be an appropriate tool to solve the problem:

Requirement

Is ML Appropriate?

Details

Anomaly detection of energy pricing signals.

Yes

You will want to do this on large numbers of points on potentially varying time signals.

Improving data quality in an ERP system.

This sounds more like a process problem. You can try and apply ML to this but often it is better to make the data entry process more automated or the process more robust.

Forecasting item consumption for a warehouse.

Yes

ML will be able to do this more accurately than a human can, so this is a good area of application.

Summarizing data for business reviews.

Maybe

This can be required at scale but it is not an ML problem – simple queries against your data will do.

Table 1.1: Potential use cases for ML.

As this table of simple examples hopefully starts to make clear, the cases where ML is the answer are ones that can usually be very well framed as a mathematical or statistical problem. After all, this is what ML really is – a series of algorithms rooted in mathematics that can iterate some internal parameters based on data. Where the lines start to blur in the modern world are through advances in areas such as deep learning or reinforcement learning, where problems that we previously thought would be very hard to phrase appropriately for standard ML algorithms can now be tackled.

The other tendency to watch out for in the real world (to go along with let’s use ML for everything) is the worry that people have about ML coming for their job and that it should not be trusted. This is understandable: a report by PwC in 2018 suggested that 30% of UK jobs will be impacted by automation by the 2030s (Will Robots Really Steal Our Jobs?: https://www.pwc.co.uk/economic-services/assets/international-impact-of-automation-feb-2018.pdf). What you have to try and make clear when working with your colleagues and customers is that what you are building is there to supplement and augment their capabilities, not to replace them.

Let’s conclude this section by revisiting an important point: the fact that you work for a company means, of course, that the aim of the game is to create value appropriate to the investment. In other words, you need to show a good Return on Investment (ROI). This means a couple of things for you practically:

You have to understand how different designs require different levels of investment. If you can solve your problem by training a deep neural net on a million images with a GPU running 24/7 for a month, or you know you can solve the same problem with some basic clustering and a few statistics on some standard hardware in a few hours, which should you choose?You have to be clear about the value you will generate. This means you need to work with experts and try to translate the results of your algorithm into actual dollar values. This is so much more difficult than it sounds, so you should take the time you need to get it right. And never, ever over-promise. You should always under-promise and over-deliver.

Adoption is not guaranteed. Even when building products for your colleagues within a company, it is important to understand that your solution will be tested every time someone uses it post-deployment. If you build shoddy solutions, then people will not use them, and the value proposition of what you have done will start to disappear.

Now that you understand some of the important points when using ML to solve business problems, let’s explore what these solutions can look like.

What does an ML solution look like?

When you think of ML engineering, you would be forgiven for defaulting to imagining working on voice assistance and visual recognition apps (I fell into this trap in previous pages – did you notice?). The power of ML, however, lies in the fact that wherever there is data and an appropriate problem, it can help and be integral to the solution.

Some examples might help make this clearer. When you type a text message and your phone suggests the next words, it can very often be using a natural language model under the hood. When you scroll any social media feed or watch a streaming service, recommendation algorithms are working double time. If you take a car journey and an app forecasts when you are likely to arrive at your destination, there is going to be some kind of regression at work. Your loan application often results in your characteristics and application details being passed through a classifier. These applications are not the ones shouted about on the news (perhaps with the exception of when they go horribly wrong), but they are all examples of brilliantly put-together ML engineering.

In this book, the examples we will work through will be more like these – typical scenarios for ML encountered in products and businesses every day. These are solutions that, if you can build them confidently, will make you an asset to any organization.

We should start by considering the broad elements that should constitute any ML solution, as indicated in the following diagram:

Figure 1.2: Schematic of the general components or layers of any ML solution and what they are responsible for.

Your storage layer constitutes the endpoint of the data engineering process and the beginning of the ML one. It includes your data for training, your results from running your models, your artifacts, and important metadata. We can also consider this as the layer including your stored code.

The compute layer is where the magic happens and where most of the focus of this book will be. It is where training, testing, prediction, and transformation all (mostly) happen. This book is all about making this layer as well engineered as possible and interfacing with the other layers.

You can break this layer down to incorporate these pieces as shown in the following workflow:

Figure 1.3: The key elements of the compute layer.

IMPORTANT NOTE

The details are discussed later in the book, but this highlights the fact that at a fundamental level, your compute processes for any ML solution are really just about taking some data in and pushing some data out.

The application layer is where you share your ML solution’s results with other systems. This could be through anything from application database insertion to API endpoints, message queues, or visualization tools. This is the layer through which your customer eventually gets to use the results, so you must engineer your system to provide clean and understandable outputs, something we will discuss later.

And that is it in a nutshell. We will go into detail about all of these layers and points later, but for now, just remember these broad concepts and you will start to understand how all the detailed technical pieces fit together.

Why Python?

Before moving on to more detailed topics, it is important to discuss why Python has been selected as the programming language for this book. Everything that follows that pertains to higher-level topics, such as architecture and system design, can be applied to solutions using any or multiple languages, but Python has been singled out here for a few reasons.

Python is colloquially known as the lingua franca of data. It is a non-compiled, not strongly typed, and multi-paradigm programming language that has a clear and simple syntax. Its tooling ecosystem is also extensive, especially in the analytics and ML space.

Packages such as scikit-learn, numpy, scipy, and a host of others form the backbone of a huge amount of technical and scientific development across the world. Almost every major new software library for use in the data world has a Python API. It is the most popular programming language in the world, according to the TIOBE index (https://www.tiobe.com/tiobe-index/) at the time of writing (August 2023).

Given this, being able to build your systems using Python means you will be able to leverage all of the excellent ML and data science tools available in this ecosystem, while also ensuring that you build applications that can play nicely with other software.

High-level ML system design

When you get down to the nuts and bolts of building your solution, there are so many options for tools, tech, and approaches that it can be very easy to be overwhelmed. However, as alluded to in the previous sections, a lot of this complexity can be abstracted to understand the bigger picture via some back-of-the-envelope architecture and designs. This is always a useful exercise once you know what problem you will try and solve, and it is something I recommend doing before you make any detailed choices about implementation.

To give you an idea of how this works in practice, what follows are a few worked-through examples where a team has to create a high-level ML systems design for some typical business problems. These problems are similar to ones I have encountered before and will likely be similar to ones you will encounter in your own work.

Example 1: Batch anomaly detection service

You work for a tech-savvy taxi ride company with a fleet of thousands of cars. The organization wants to start making ride times more consistent and understand longer journeys in order to improve the customer experience and, thereby, increase retention and return business. Your ML team is employed to create an anomaly detection service to find rides that have unusual ride time or ride length behaviors. You all get to work, and your data scientists find that if you perform clustering on sets of rides using the features of ride distance and time, you can clearly identify outliers worth investigating by the operations team. The data scientists present the findings to the CTO and other stakeholders before getting the go-ahead to develop this into a service that will provide an outlier flag as a new field in one of the main tables of the company’s internal analysis tool.

In this example, we will simulate some data to show how the taxi company’s data scientists could proceed. In the repository for the book, which can be found at https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python-Second-Edition, if you navigate to the folder Chapter01, you will see a script called clustering_example.py. If you have activated the conda environment provided via the mlewp-chapter01.yml environment file, then you can run this script with:

python3 clustering_example.py

After a successful run you should see that three files are created: taxi-rides.csv, taxi-labels.json, and taxi-rides.png. The image in taxi-rides.png should look something like that shown in Figure 1.4.

We will walk through how this script is built up:

First, let’s define a function that will simulate some ride distances based on the random distribution given in numpy and return a numpy array containing the results. The reason for the repeated lines is so that we can create some base behavior and anomalies in the data, and you can clearly compare against the speeds we will generate for each set of taxis in the next step: import numpy as np from numpy.random import MT19937 from numpy.random import