E-Book
39,59 €

Machine Learning Engineering with Python E-Book

Andrew P. McMahon

0,0

39,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Machine learning engineering is a thriving discipline at the interface of software development and machine learning. This book will help developers working with machine learning and Python to put their knowledge to work and create high-quality machine learning products and services.

Machine Learning Engineering with Python takes a hands-on approach to help you get to grips with essential technical concepts, implementation patterns, and development methodologies to have you up and running in no time. You'll begin by understanding key steps of the machine learning development life cycle before moving on to practical illustrations and getting to grips with building and deploying robust machine learning solutions. As you advance, you'll explore how to create your own toolsets for training and deployment across all your projects in a consistent way. The book will also help you get hands-on with deployment architectures and discover methods for scaling up your solutions while building a solid understanding of how to use cloud-based tools effectively. Finally, you'll work through examples to help you solve typical business problems.

By the end of this book, you'll be able to build end-to-end machine learning services using a variety of techniques and design your own processes for consistently performant machine learning engineering.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 318

Veröffentlichungsjahr: 2021

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Machine Learning Engineering with Python

Manage the production life cycle of machine learning models using MLOps with practical examples

Andrew P. McMahon

BIRMINGHAM—MUMBAI

Machine Learning Engineering with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi

Senior Editor: David Sugarman

Content Development Editor: Nathanya Dias

Technical Editor: Sonam Pandey

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Jyoti Chauhan

First published: November 2021

Production reference: 1280921

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-925-9

www.packt.com

Contributors

About the author

Andrew Peter (Andy) McMahon is a machine learning engineer and data scientist with experience of working in, and leading, successful analytics and software teams. His expertise centers on building production-grade ML systems that can deliver value at scale. He is currently ML Engineering Lead at NatWest Group and was previously Analytics Team Lead at Aggreko.

He has an undergraduate degree in theoretical physics from the University of Glasgow, as well as master's and Ph.D. degrees in condensed matter physics from Imperial College London. In 2019, Andy was named Data Scientist of the Year at the International Data Science Awards. He currently co-hosts the AI Right podcast, discussing hot topics in AI with other members of the Scottish tech scene.

This book, and everything I've ever achieved, would not have been possible without a lot of people. I wish to thank my mum, for introducing me to science and science fiction, my dad, for teaching me not to have any regrets, and my wider family and friends for making my life full of laughter. Most of all, I want to thank my wife, Hayley, and my son, Teddy, for being my sunshine every single day and giving me a reason to keep pushing myself to be the best I can be.

About the reviewers

Daksh Trehan began his career as a data analyst. His love for data and statistics is unimaginable. Various statistical techniques introduced him to the world of ML and data science. While his focus is on being a data analyst, he loves to forecast given data using ML techniques. He understands the power of data in today's world and constantly tries to change the world using various ML techniques and his concrete data visualization skills. He loves to write articles on ML and AI, and these have bagged him more than 100,000 views to date. He has also contributed as an ML consultant to 365 Days as a TikTok creator, written by Dr. Markus Rach, which is available publicly on the Amazon e-book store.

Ved Prakash Upadhyay is an experienced machine learning professional. He did his master's in information science at the University of Illinois Urbana-Champaign. Currently, he is working at IQVIA as a senior machine learning engineer. His work focuses on building recommendation systems for various pharma clients of IQVIA. He has strong experience with productionalizing machine learning pipelines and is skilled with the different tools that are used in the industry. Furthermore, he has acquired an in-depth conceptual knowledge of machine learning algorithms. IQVIA is a leading global provider of advanced analytics, technology solutions, and clinical research services to the life sciences industry.

Michael Petrey is a data scientist with a background in education and consulting. He holds a master's in analytics from Georgia Tech and loves using data visualization and analysis to get people the best tools for their jobs. You might find Michael on a hike near Atlanta, eating ice cream in Boston, or at a café in Wellington.

Preface

Section 1: What Is ML Engineering?

Chapter 1: Introduction to ML Engineering

Technical requirements

Defining a taxonomy of data disciplines

Data scientist

ML engineer

Data engineer

Assembling your team

ML engineering in the real world

What does an ML solution look like?

Why Python?

High-level ML system design

Example 1: Batch anomaly detection service

Example 2: Forecasting API

Example 3: Streamed classification

Summary

Chapter 2: The Machine Learning Development Process

Technical requirements

Setting up our tools

Setting up an AWS account

Concept to solution in four steps

Discover

Play

Develop

Deploy

Summary

Section 2: ML Development and Deployment

Chapter 3: From Model to Model Factory

Technical requirements

Defining the model factory

Designing your training system

Training system design options

Train-run

Train-persist

Retraining required

Detecting drift

Engineering features for consumption

Engineering categorical features

Engineering numerical features

Learning about learning

Defining the target

Cutting your losses

Hierarchies of automation

Optimizing hyperparameters

AutoML

Auto-sklearn

Persisting your models

Building the model factory with pipelines

Scikit-learn pipelines

Spark ML pipelines

Summary

Chapter 4: Packaging Up

Technical requirements

Writing good Python

Recapping the basics

Tips and tricks

Adhering to standards

Writing good PySpark

Choosing a style

Object-oriented programming

Functional programming

Packaging your code

Why package?

Selecting use cases for packaging

Designing your package

Building your package

Testing, logging, and error handling

Testing

Logging

Error handling

Not reinventing the wheel

Summary

Chapter 5: Deployment Patterns and Tools

Technical requirements

Architecting systems

Exploring the unreasonable effectiveness of patterns

Swimming in data lakes

Microservices

Event-based designs

Batching

Containerizing

Hosting your own microservice on AWS

Pushing to ECR

Hosting on ECS

Creating a load balancer

Pipelining 2.0

Revisiting CI/CD

Summary

Chapter 6: Scaling Up

Technical requirements

Scaling with Spark

Spark tips and tricks

Spark on the cloud

Spinning up serverless infrastructure

Containerizing at scale with Kubernetes

Summary

Section 3: End-to-End Examples

Chapter 7: Building an Example ML Microservice

Technical requirements

Understanding the forecasting problem

Designing our forecasting service

Selecting the tools

Executing the build

Training pipeline and forecaster

Training and forecast handlers

Summary

Chapter 8: Building an Extract Transform Machine Learning Use Case

Technical requirements

Understanding the batch processing problem

Designing an ETML solution

Selecting the tools

Interfaces

Scaling of models

Scheduling of ETML pipelines

Executing the build

Not reinventing the wheel in practice

Using the Gitflow workflow

Injecting some engineering practices

Other Books You May Enjoy

Preface

Machine Learning (ML) is rightfully recognized as one of the most powerful tools available for organizations to extract value from their data. As the capabilities of ML algorithms have grown over the years, it has become increasingly obvious that implementing them in a scalable, fault-tolerant, and automated way is a discipline in its own right. This discipline, ML engineering, is the focus of this book.

The book covers a wide variety of topics in order to help you understand the tools, techniques, and processes you can apply to engineer your ML solutions, with an emphasis on introducing the key concepts so that you can build on them in your own work. Much of what we will cover will also help you maintain and monitor your solutions, the purview of the closely related discipline of Machine Learning Operations (MLOps).

All the code examples are given in Python, the most popular programming language for data applications. Python is a high-level and object-oriented language with a rich ecosystem of tools focused on data science and ML. Packages such as scikit-learn and pandas often form the backbone of ML modeling code in data science teams across the world. In this book, we will also use these tools but discuss how to wrap them up in production-grade pipelines and deploy them using appropriate cloud and open source tools. We will not spend a lot of time on how to build the best ML model, though some of the tools covered will certainly help with that. Instead, the aim is to understand what to do after you have an ML model.

Many of the examples in the book will leverage services and solutions from Amazon Web Services (AWS). I believe that the accompanying explanations and discussions will, however, mean that you can still apply everything you learn here to any cloud provider or even in an on-premises setting.

Machine Learning Engineering with Python will help you to navigate the challenges of taking ML to production and give you the confidence to start applying MLOps in your organizations.

Who this book is for

This book is for ML engineers, data scientists, and software developers who want to build robust software solutions with ML components. It is also relevant to anyone who manages or wants to understand the production life cycle of these systems. The book assumes intermediate-level knowledge of Python. Basic knowledge of AWS and Bash will also be beneficial.

What this book covers

Chapter 1, Introduction to ML Engineering, explains what we mean by ML engineering and how this relates to the disciplines of data science and data engineering. It covers what you need to do to build an effective ML engineering team, as well as what real software solutions containing ML can look like.

Chapter 2, The Machine Learning Development Process, explores a development process that will be applicable to almost any ML engineering project. It discusses how you can set your development tooling up for success for later chapters as well.

Chapter 3, From Model to Model Factory, teaches you how to build solutions that train multiple ML models during the product life cycle. It also covers drift detection and pipelining to help you start to build out your MLOps practices.

Chapter 4, Packaging Up, discusses best practices for coding in Python and how this relates to building your own packages and libraries for reuse in multiple projects.

Chapter 5, Deployment Patterns and Tools, teaches you some of the standard ways you can get your ML system into production. In particular, the chapter will focus on hosting solutions in the cloud.

Chapter 6, Scaling Up, teaches you how to take your solutions and scale them to massive datasets or large numbers of prediction requests using Apache Spark and serverless infrastructure.

Chapter 7, Building an Example ML Microservice, walks through how to use what you have learned elsewhere in the book to build a forecasting service that can be triggered via an API.

Chapter 8, Building an Extract Transform Machine Learning Use Case, walks through how to use what you have learned to build a pipeline that performs batch processing. We do this by adding a lot of our newly acquired ML engineering best practices to the simple package created in Chapter 4, Packaging Up.

To get the most out of this book

To get the most out of the examples in the book, you will need access to a computer or server where you have privileges to install and run Python and Apache Spark applications. For many of the examples, you will also require access to a terminal, such as Bash. The examples in the book were built on a Linux machine running Bash so you may need to translate some pieces for your operating system and terminal. For some examples using AWS, you will require an account where you can enable billing. Examples in the book used Apache Spark v3.0.2.

In Chapter 5, Deployment Patterns and Tools, we use the Managed Workflows with Apache Spark (MWAA) service from AWS. There is no free tier option for MWAA so as soon as you spin up the example, you will be charged for the environment and any instances. Ensure you are happy to do this before proceeding and I recommend closing down your MWAA instances when finished.

In Chapter 7, Building an Example ML Microservice, we build out a use case leveraging the AWS Forecast service, which is only available in a subset of AWS Regions. To check the availability in your Region, and what Regions you can switch to for that example, you can use https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/.

Technical requirements are given in most of the chapters, but to support this, there are Conda environment .yml files provided in the book repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801079259_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map {

height: 100%;

margin: 0;

padding: 0

}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]

exten => s,1,Dial(Zap/1|30)

exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Machine Learning Engineering with Python, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Section 1: What Is ML Engineering?

The objective of this section is to provide a discussion of what activities could be classed as ML engineering and how this constitutes an important element of using data to generate value in organizations. You will also be introduced to an example software development process that captures the key aspects required in any successful ML engineering project.

This section comprises the following chapters:

Chapter 1, Introduction to ML EngineeringChapter 2, The Machine Learning Development Process

Chapter 1: Introduction to ML Engineering

Welcome to Machine Learning Engineering with Python, a book that aims to introduce you to the exciting world of making Machine Learning (ML) systems production-ready.

This book will take you through a series of chapters covering training systems, scaling up solutions, system design, model tracking, and a host of other topics, to prepare you for your own work in ML engineering or to work with others in this space. No book can be exhaustive on this topic, so this one will focus on concepts and examples that I think cover the foundational principles of this increasingly important discipline.

You will get a lot from this book even if you do not run the technical examples, or even if you try to apply the main points in other programming languages or with different tools. In covering the key principles, the aim is that you come away from this book feeling more confident in tackling your own ML engineering challenges, whatever your chosen toolset.

In this first chapter, you will learn about the different types of data role relevant to ML engineering and how to distinguish them; how to use this knowledge to build and work within appropriate teams; some of the key points to remember when building working ML products in the real world; how to start to isolate appropriate problems for engineered ML solutions; and how to create your own high-level ML system designs for a variety of typical business problems.

We will cover all of these aspects in the following sections:

Defining a taxonomy of data disciplines Assembling your teamML engineering in the real worldWhat does an ML solution look like?High-level ML system design

Now that we have explained what we are going after in this first chapter, let's get started!

Technical requirements

Throughout the book, we will assume that Python 3 is installed and working. The following Python packages are used in this chapter:

Scikit-learn 0.23.2NumPypandasimblearnProphet 0.7.1

Defining a taxonomy of data disciplines

The explosion of data and the potential applications of that data over the past few years have led to a proliferation of job roles and responsibilities. The debate that once raged over how a data scientist was different from a statistician has now become extremely complex. I would argue, however, that it does not have to be so complicated. The activities that have to be undertaken to get value from data are pretty consistent, no matter what business vertical you are in, so it should be reasonable to expect that the skills and roles you need to perform these steps will also be relatively consistent. In this chapter, we will explore some of the main data disciplines that I think you will always need in any data project. As you can guess, given the name of this book, I will be particularly keen to explore the notion of ML engineering and how this fits into the mix.

Let's now look at some of the roles involved in using data in the modern landscape.

Data scientist

Since the Harvard Business Review declared that being a data scientist was The Sexiest Job of the 21st Century (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century), this title has become one of the most sought after, but also hyped, in the mix. A data scientist can cover an entire spectrum of duties, skills, and responsibilities depending on the business vertical, the organization, or even just personal preference. No matter how this role is defined, however, there are some key areas of focus that should always be part of the data scientist's job profile:

Analysis: A data scientist should be able to wrangle, mung, manipulate, and consolidate datasets before performing calculations on that data that help us to understand it. Analysis is a broad term, but it's clear that the end result is knowledge of your dataset that you didn't have before you started, no matter how basic or complex.Modeling: The thing that gets everyone excited (potentially including you, dear reader) is the idea of modeling data. A data scientist usually has to be able to apply statistical, mathematical, and machine learning models to data in order to explain it or perform some sort of prediction.Working with the customer or user: The data science role usually has some more business-directed elements so that the results of steps 1 and 2 can support decision making in the organization. This could be done by presenting the results of analysis in PowerPoints or Jupyter notebooks or even sending an email with a summary of the key results. It involves communication and business acumen in a way that goes beyond classic tech roles.

ML engineer

A newer kid on the block, and indeed the subject of this book, is the ML engineer. This role has risen to fill the perceived gap between the analysis and modeling of data science and the world of software products and robust systems engineering.

You can articulate the need for this type of role quite nicely by considering a classic voice assistant. In this case, a data scientist would usually focus on translating the business requirements into a working speech-to-text model, potentially a very complex neural network, and showing that it can perform the desired voice transcription task in principle. ML engineering is then all about how you take that speech-to-text model and build it into a product, service, or tool that can be used in production. Here, it may mean building some software to train, retrain, deploy, and track the performance of the model as more transcription data is accumulated, or user preferences are understood. It may also involve understanding how to interface with other systems and how to provide results from the model in the appropriate formats, for example, interacting with an online store.

Data scientists and ML engineers have a lot of overlapping skill sets and competencies, but have different areas of focus and strengths (more on this later), so they will usually be part of the same project team and may have either title, but it will be clear what hat they are wearing from what they do in that project.

Similar to the data scientist, we can define the key areas of focus for the ML engineer:

Translation: Taking models and research code in a variety of formats and translating this into slicker, more robust pieces of code. This could be done using OO programming, functional programming, a mix, or something else, but basically helps to take the Proof-Of-Concept work of the data scientist and turn it into something that is far closer to being trusted in a production environment.Architecture: Deployments of any piece of software do not occur in a vacuum and will always involve lots of integrated parts. This is true of machine learning solutions as well. The ML engineer has to understand how the appropriate tools and processes link together so that the models built with the data scientist can do their job and do it at scale.Productionization: The ML engineer is focused on delivering a solution and so should understand the customer's requirements inside out, as well as be able to understand what that means for the project development. The end goal of the ML engineer is not to provide a good model (though that is part of it), nor is it to provide something that basically works. Their job is to make sure that the hard work on the data science side of things generates the maximum potential value in a real-world setting.

Data engineer

The most important people in any data team (in my opinion) are the people who are responsible for getting the commodity that everything else in the preceding sections is based on from A to B with high fidelity, appropriate latency, and with as little effort on the part of the other team members as possible. You cannot create any type of software product, never mind a machine learning product, without data.

The key areas of focus for a data engineer are as follows:

Quality: Getting data from A to B is a pointless exercise if the data is garbled, fields are missing, or IDs are screwed up. The data engineer cares about avoiding this and uses a variety of techniques and tools, generally to ensure that the data that left the source system is what lands in your data storage layer.Stability: Similar to the previous point on quality, if the data comes from A to B but it only does it every second Wednesday if it's not a rainy day, then what's the point? Data engineers spend a lot of time and effort and use their considerable skills to ensure that data pipelines are robust, reliable, and can be trusted to deliver when promised.Access: Finally, the aim of getting the data from A to B is for it to be used by applications, analyses, and machine learning models, so the nature of the B is important. The data engineer will have a variety of technologies to hand for surfacing data and should work with the data consumers (our data scientists and machine learning engineers, among others) to define and create appropriate data models within these solutions:

Figure 1.1 – A diagram showing the relationships between data science, ML engineering, and data engineering

As mentioned previously, this book focuses on the work of the ML engineer and how you can learn some of the skills useful for that role, but it is always important to remember that you will not be working in a vacuum. Always keep in mind the profiles of the other roles (and many more not covered here that will exist in your project team) so that you work most effectively together. Data is a team sport after all!

Assembling your team

There are no set rules about how you should pull together a team for your machine learning project, but there are some good general principles to follow, and gotchas to avoid.

First, always bear in mind that unicorns do not exist. You can find some very talented people out there, but do not ever think one person can do everything you will need to the level you require. This is not just a bit unrealistic; it is bad practice and will negatively impact the quality of your products. Even when you are severely resource-constrained, the key is for your team members to have a laser-like focus to succeed.

Secondly, blended is best. We all know the benefits of diversity for organizations and teams in general and this should, of course, apply to your machine learning team as well. Within a project, you will need the mathematics, the code, the engineering, the project management, the communication, and a variety of other skills to succeed. So, given the previous point, make sure you cover this in at least some sense across your team.

Third, tie your team structure to your projects in a dynamic way. If you are working on a project that is mostly about getting the data in the right place and the actual machine learning models are really simple, focus your team profile on the engineering and data modeling aspects. If the project requires a detailed understanding of the model, and it is quite complex, then reposition your team to make sure this is covered. This is just sensible and frees up team members who would otherwise have been underutilized to work on other projects.

As an example, suppose that you have been tasked with building a system that classifies customer data as it comes into your shiny new data lake, and the decision has been taken that this should be done at the point of ingestion via a streaming application. The classification has already been built for another project. It is already clear that this solution will heavily involve the skills of the data engineer and the ML engineer, but not so much the data scientist since that portion of work has been completed in another project.

In the next section, we will look at some important points to consider when deploying your team on a real-world business problem.

ML engineering in the real world

The majority of us who work in machine learning, analytics, and related disciplines do so for for-profit companies. It is important therefore that we consider some of the important aspects of doing this type of work in the real world.

First of all, the ultimate goal of your work is to generate value. This can be calculated and defined in a variety of ways, but fundamentally your work has to improve something for the company or their customers in a way that justifies the investment put in. This is why most companies will not be happy for you to take a year to play with new tools and then generate nothing concrete to show for it (not that you would do this anyway, it is probably quite boring) or to spend your days reading the latest papers and only reading the latest papers. Yes, these things are part of any job in technology, and especially any job in the world of machine learning, but you have to be strategic about how you spend your time and always be aware of your value proposition.

Secondly, to be a successful ML engineer in the real world, you cannot just understand the technology; you must understand the business. You will have to understand how the company works day to day, you will have to understand how the different pieces of the company fit together, and you will have to understand the people of the company and their roles. Most importantly, you have to understand the customer, both of the business and of your work. If you do not know the motivations, pains, and needs of the people you are building for, then how can you be expected to build the right thing?

Finally, and this may be controversial, the most important skill for you being a successful ML engineer in the real world is one that this book will not teach you, and that is the ability to communicate effectively. You will have to work in a team, with a manager, with the wider community and business, and, of course, with your customers, as mentioned above. If you can do this and you know the technology and techniques (many of which are discussed in this book), then what can stop you?

But what kind of problems can you solve with machine learning when you work in the real world? Well, let's start with another potentially controversial statement: a lot of the time, machine learning is not the answer. This may seem strange given the title of this book, but it is just as important to know when not to apply machine learning as when to apply it. This will save you tons of expensive development time and resources.

Machine learning is ideal for cases when you want to do a semi-routine task faster, with more accuracy, or at a far larger scale than is possible with other solutions. Some typical examples are given in the following table, along with some discussion as to whether or not ML would be an appropriate tool for solving the problem:

Figure 1.2 – Potential use cases for ML

As this table of simple examples hopefully starts to make clear, the cases where machine learning is the answer are ones that can usually be very well framed as a mathematical or statistical problem. After all, this is what machine learning really is; a series of algorithms rooted in mathematics that can iterate some internal parameters based on data. Where the lines start to blur in the modern world are through advances in areas such as deep learning or reinforcement learning, where problems that we previously thought would be very hard to phrase appropriately for standard ML algorithms can now be tackled.

The other tendency to watch out for in the real world (to go along with let's use ML for everything) is the worry that people have that ML is coming for their job and should not be trusted. This is understandable: a report by PwC in 2018 suggested that 30% of UK jobs will be impacted by automation by the 2030s (Will Robots Really Steal Our Jobs?: https://www.pwc.co.uk/economic-services/assets/international-impact-of-automation-feb-2018.pdf). What you have to try and make clear when working with your colleagues and customers is that what you are building is there to supplement and augment their capabilities, not to replace them.

Let's conclude this section by revisiting an important point: the fact that you are working for a company means, of course, that the aim of the game is to create value appropriate to the investment. In other words, you need to show a good Return On Investment (ROI). This means a couple of things for you practically:

You have to understand how different designs require different levels of investment. If you can solve your problem by training a deep neural net on a million images with a GPU running 24/7 for a month, or you know you can solve the same problem with some basic clustering and a bit of statistics on some standard hardware in a few hours, which should you choose?You have to be clear about the value you will generate. This means you need to work with experts and try to translate the results of your algorithm into actual dollar values. This is so much more difficult than it sounds, so you should take the time you need to get it right. And never, ever over-promise. You should always under-promise and over-deliver.

Adoption is not guaranteed. Even when building products for your colleagues within a company, it is important to understand that your solution will be tested every time someone uses it post-deployment. If you build shoddy solutions, then people will not use them, and the value proposition of what you have done will start to disappear.

Now that you understand some of the important points when using ML to solve business problems, let's explore what these solutions can look like.

What does an ML solution look like?

When you think of ML engineering, you would be forgiven for defaulting to imagining working on voice assistance and visual recognition apps (I fell into this trap in previous pages, did you notice?). The power of ML, however, lies in the fact that wherever there is data and an appropriate problem, it can help and be integral to the solution.

Some examples might help make this clearer. When you type a text message and your phone suggests the next words, it can very often be using a natural language model under the hood. When you scroll any social media feed or watch a streaming service, recommendation algorithms are working double time. If you take a car journey and an app forecasts when you are likely to arrive at your destination, there is going to be some kind of regression at work. Your loan application often results in your characteristics and application details being passed through a classifier. These applications are not the ones shouted about on the news (perhaps with the exception of when they go horribly wrong), but they are all examples of brilliantly put-together ML engineering.

In this book, the examples we work through will be more like these; typical scenarios for machine learning encountered in products and businesses every day. These are solutions that, if you can build them confidently, will make you an asset to any organization.

We should start by considering the broad elements that should constitute any ML solution, as indicated in the following diagram:

Figure 1.3 – Schematic of the general components or layers of any ML solution and what they are responsible for

Your storage layer constitutes the endpoint of the data engineering process and the beginning of the ML one. It includes your data for training, your results from running your models, your artifacts, and important metadata. We can also consider this as including your stored code.

The compute layer is where the magic happens and where most of the focus of this book will be. It is where training, testing, prediction, and transformation all (mostly) happen. This book is all about making this layer as well-engineered as possible and interfacing with the other layers. You can blow this layer up to incorporate these pieces as in the following workflow:

Figure 1.4 – The key elements of the compute layer

Important note

The details are discussed later in the book, but this highlights the fact that at a fundamental level, your compute processes for any ML solution are really just about taking some data in and pushing some data out.

The surfacing layer is where you share your ML solution's results with other systems. This could be through anything from application database insertion to API endpoints, to message queues, to visualization tools. This is the layer through which your customer eventually gets to use the results, so you must engineer your system to provide clean and understandable outputs, something we will discuss later.

And that is it in a nutshell. We will go into detail about all of these layers and points later, but for now, just remember these broad concepts and you will start to understand how all the detailed technical pieces fit together.

Why Python?

Before moving on to more detailed topics, it is important to discuss why Python has been selected as the programming language for this book. Everything that follows that pertains to higher-level topics such as architecture and system design can be applied to solutions using any or multiple languages, but Python has been singled out here for a few reasons.

Python is colloquially known as the lingua franca of data. It is a non-compiled, not strongly typed, and multi-paradigm programming language that has clear and simple syntax. Its tooling ecosystem is also extensive, especially in the analytics and machine learning space. Packages such as scikit-learn, numpy, scipy

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Machine Learning Engineering with Python E-Book

Andrew P. McMahon

Machine Learning Engineering with Python

Machine Learning Engineering with Python

Contributors

About the author

About the reviewers

Table of Contents

Preface

Section 1: What Is ML Engineering?

Chapter 1: Introduction to ML Engineering

Technical requirements

Defining a taxonomy of data disciplines

Data scientist

ML engineer

Data engineer

Assembling your team

ML engineering in the real world

What does an ML solution look like?

Why Python?

High-level ML system design

Example 1: Batch anomaly detection service

Example 2: Forecasting API

Example 3: Streamed classification

Summary

Chapter 2: The Machine Learning Development Process

Technical requirements

Setting up our tools

Setting up an AWS account

Concept to solution in four steps

Discover

Play

Develop

Deploy

Summary

Section 2: ML Development and Deployment

Chapter 3: From Model to Model Factory

Technical requirements

Defining the model factory

Designing your training system

Training system design options

Train-run

Train-persist

Retraining required

Detecting drift

Engineering features for consumption

Engineering categorical features

Engineering numerical features

Learning about learning

Defining the target

Cutting your losses

Hierarchies of automation

Optimizing hyperparameters

AutoML

Auto-sklearn

Persisting your models

Building the model factory with pipelines

Scikit-learn pipelines

Spark ML pipelines

Summary

Chapter 4: Packaging Up

Technical requirements

Writing good Python

Recapping the basics

Tips and tricks

Adhering to standards

Writing good PySpark

Choosing a style

Object-oriented programming

Functional programming

Packaging your code

Why package?

Selecting use cases for packaging

Designing your package

Building your package

Testing, logging, and error handling

Testing

Logging

Error handling

Not reinventing the wheel