35,99 €
Machine Learning with LightGBM and Python is a comprehensive guide to learning the basics of machine learning and progressing to building scalable machine learning systems that are ready for release.
This book will get you acquainted with the high-performance gradient-boosting LightGBM framework and show you how it can be used to solve various machine-learning problems to produce highly accurate, robust, and predictive solutions. Starting with simple machine learning models in scikit-learn, you’ll explore the intricacies of gradient boosting machines and LightGBM. You’ll be guided through various case studies to better understand the data science processes and learn how to practically apply your skills to real-world problems. As you progress, you’ll elevate your software engineering skills by learning how to build and integrate scalable machine-learning pipelines to process data, train models, and deploy them to serve secure APIs using Python tools such as FastAPI.
By the end of this book, you’ll be well equipped to use various -of-the-art tools that will help you build production-ready systems, including FLAML for AutoML, PostgresML for operating ML pipelines using Postgres, high-performance distributed training and serving via Dask, and creating and running models in the Cloud with AWS Sagemaker.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 313
Veröffentlichungsjahr: 2023
Machine Learning with LightGBM and Python
A practitioner’s guide to developing production-ready machine learning systems
Andrich van Wyk
BIRMINGHAM—MUMBAI
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Tejashwini R
Senior Editor: Gowri Rekha
Content Development Editor: Manikandan Kurup
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Production Designer: Shyam Sundar Korumilli
Marketing Coordinator: Vinishka Kalra
First published: September 2023
Production reference: 1220923
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN: 978-1-80056-474-9
www.packtpub.com
Countless nights and weekends have been dedicated to completing this book, and I would like to thank my wife, Irene, for her eternal support, without which, nobody would be reading any of this. Further, I’m grateful to my daughter, Emily, for inspiring me to reach a little further.
– Andrich van Wyk
Andrich van Wyk has 15 years of experience in machine learning R&D, building AI-driven solutions, and consulting in the AI domain. He also has broad experience as a software engineer and architect with over a decade of industry experience working on enterprise systems.
He graduated cum laude with an M.Sc. in Computer Science from the University of Pretoria, focusing on neural networks and evolutionary algorithms.
Andrich enjoys writing about machine learning engineering and the software industry at large. He currently resides in South Africa with his wife and daughter.
Valentine Shkulov is a renowned visiting lecturer at a top tech university, where he seamlessly melds academia with real-world expertise as a distinguished Data Scientist in Fintech and E-commerce. His ingenuity in crafting ML-driven solutions has transformed businesses, from tech giants to budding startups. Valentine excels at introducing AI innovations and refining current systems, ensuring they profoundly influence vital business metrics. His passion for navigating product challenges has established him as a pioneer in leveraging ML to elevate businesses.
Above all, a heartfelt thanks to my spouse, the unwavering pillar of support in my remarkable journey.
Kayewan M Karanjia has over 7 years of experience in machine learning, artificial intelligence (AI), and data technologies, and brings a wealth of expertise to his current role at DrDoctor. Here, as a machine learning engineer, he is dedicated to implementing advanced machine learning models that have a direct impact on enhancing healthcare services and process optimization for the NHS. In the past, he has also worked with multiple MNCs such as Reliance Industries Limited, and implemented solutions for the government of India.
In this part, we will initiate our exploration of machine learning by grounding you in its fundamental concepts, ranging from basic terminologies to intricate algorithms like random forests. We will delve deep into ensemble learning, highlighting the power of decision trees when combined, and then shift our focus to the gradient-boosting framework, LightGBM. Through hands-on examples in Python and comparative analyses against techniques like XGBoost and deep neural networks, you’ll gain both a foundational understanding and practical competence in the realm of machine learning, especially with LightGBM.
This part will include the following chapters:
Chapter 1, Introducing Machine LearningChapter 2, Ensemble Learning – Bagging and BoostingChapter 3, An Overview of LightGBM in PythonChapter 4, Comparing LightGBM, XGBoost, and Deep LearningOur journey starts with an introduction to machine learning and the fundamental concepts we’ll use throughout this book.
We’ll start by providing an overview of machine learning from a software engineering perspective. Then, we’ll introduce the core concepts that are used in the field of machine learning and data science: models, datasets, learning paradigms, and other details. This introduction will include a practical example that clearly illustrates the machine learning terms discussed.
We will also introduce decision trees, a crucially important machine learning algorithm that is our first step to understanding LightGBM.
After completing this chapter, you will have established a solid foundation in machine learning and the practical application of machine learning techniques.
The following main topics will be covered in this chapter:
What is machine learning?Introducing models, datasets, and supervised learningDecision tree learningThis chapter includes examples of simple machine learning algorithms and introduces working with scikit-learn. You must install a Python environment with scikit-learn, NumPy, pandas, and Jupyter Notebook. The code for this chapter is available at https://github.com/PacktPublishing/Practical-Machine-Learning-with-LightGBM-and-Python/tree/main/chapter-1.
Machine learning is a part of the broader artificial intelligence field that involves methods and techniques that allow computers to “learn” specific tasks without explicit programming.
Machine learning is just another way to write programs, albeit automatically, from data. Abstractly, a program is a set of instructions that transforms inputs into specific outputs. A programmer’s job is to understand all the relevant inputs to a computer program and develop a set of instructions to produce the correct outputs.
However, what if the inputs are beyond the programmer’s understanding?
For example, let’s consider creating a program to forecast the total sales of a large retail store. The inputs to the program would be various factors that could affect sales. We could imagine factors such as historical sales figures, upcoming public holidays, stock availability, any special deals the store might be running, and even factors such as the weather forecast or proximity to other stores.
In our store example, the traditional approach would be to break down the inputs into manageable, understandable (by a programmer) pieces, perhaps consult an expert in store sales forecasting, and then devise handcrafted rules and instructions to attempt to forecast future sales.
While this approach is certainly possible, it is also brittle (in the sense that the program might have to undergo extensive changes regarding the input factors) and wholly based on the programmer’s (or domain expert’s) understanding of the problem. With potentially thousands of factors and billions of examples, this problem becomes untenable.
Machine learning offers us an alternative to this approach. Instead of creating rules and instructions, we repeatedly show the computer examples of the tasks we need to accomplish and then get it to figure out how to solve them automatically.
However, where we previously had a set of instructions, we now have a trained model instead of a programmed one.
The key realization here, especially if you are coming from a software background, is that our machine learning program still functions like a regular program: it accepts input, has a way to process it, and produces output. Like all other software programs, machine learning software must be tested for correctness, integrated into other systems, deployed, monitored, and optimized. Collectively, this forms the field of machine learning engineering. We’ll cover all these aspects and more in later chapters.
Broadly speaking, machine learning has three main paradigms: supervised, unsupervised, and reinforcement learning.
With supervised learning, the model is trained on labeled data: each instance in the dataset has its associated correct output, or label, for the input example. The model is expected to learn to predict the label for unseen input examples.
With unsupervised learning, the examples in the dataset are unlabeled; in this case, the model is expected to discover patterns and relationships in the data. Examples of unsupervised approaches are clustering algorithms, anomaly detection, and dimensionality reduction algorithms.
Finally, reinforcement learning entails a model, usually called an agent, interacting with a particular environment and learning by receiving penalties or rewards for specific actions. The goal is for the agent to perform actions that maximize its reward. Reinforcement learning is widely used in robotics, control systems, or training computers to play games.
LightGBM and most other algorithms discussed later in this book are examples of supervised learning techniques and are the focus of this book.
The following section dives deeper into the machine learning terminology we’ll use throughout this book and the details of the machine learning process.
In the previous section, we introduced a model as a construct to replace a set of instructions that typically comprise a program to perform a specific task. This section covers models and other core machine learning concepts in more detail.
More formally, a model is a mathematical or algorithmic representation of a specific process that performs a particular task. A machine learning model learns a particular task by being trained on a dataset using a training algorithm.
Note
An alternative term for training is fit. Historically, fit stems from the statistical field. A model is said to “fit the data” when trained. We’ll use both terms interchangeably throughout this book.
Many distinct types of models exist, all of which use different mathematical, statistical, or algorithmic techniques to model the training data. Examples of machine learning algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks.
A distinction is made between the model type and a trained instance of that model: the majority of machine learning models can be trained to perform various tasks. For example, decision trees (a model type) can be trained to forecast sales, recognize heart disease, and predict football match results. However, each of these tasks requires a different instance of a decision tree that has been trained on a distinct dataset.
What a specific model does depends on the model’s parameters. Parameters are also sometimes called weights, which are technically particular types of model parameters.
A training algorithm is an algorithm for finding the most appropriate model parameters for a specific task.
We determine the quality of fit, or how well the model performs, using an objective function. This is a mathematical function that measures the difference between the predicted output and the actual output for a given input. The objective function quantifies the performance of a model. We may seek to minimize or maximize the objective function depending on the problem we are solving. The objective is often measured as an error we aim to minimize during training.
We can summarize the model training process as follows: a training algorithm uses data from a dataset to optimize a model’s parameters for a particular task, as measured through an objective function.
While a model is composed of parameters, the training algorithm has parameters of its own called hyperparameters. A hyperparameter is a controllable value that influences the training process or algorithm. For example, consider finding the minimum of a parabola function: we could start by guessing a value and then take small steps in the direction that minimizes the function output. The step size would have to be chosen well: if our steps are too small, it will take a prohibitively long time to find the minimum. If the step size is too large, we may overshoot and miss the minimum and then continue oscillating (jumping back and forth) around the minimum:
Figure 1.1 – Effect of using a step size that is too large (left) and too small (right)
In this example, the step size would be a hyperparameter of our minimization algorithm. The effect of the step size is illustrated in Figure 1.1.
As explained previously, the machine learning model is trained using a dataset. Data is at the heart of the machine learning process, and data preparation is often the part of the process that takes up the most time.
Throughout this book, we’ll work with tabular datasets. Tabular datasets are very common in the real world and consist of rows and columns. Rows are often called samples, examples, or observations, and columns are usually called features, variables, or attributes.
Importantly, there is no restriction on the data type in a column. Features may be strings, numbers, Booleans, geospatial coordinates, or encoded formats such as audio, images, or video.
Datasets are also rarely perfectly defined. Data may be incomplete, noisy, incorrect, inconsistent, and contain various formats.
Therefore, data preparation and cleaning are essential parts of the machine learning process.
Data preparation concerns processing the data to make it suitable for machine learning and typically consists of the following steps:
Gathering and validation: Some datasets are initially too small or represent the problem poorly (the data is not representative of the actual data population it’s been sampled from). In these cases, the practitioner must collect more data, and validation must be done to ensure the data represents the problem.Checking for systemic errors and bias: It is vital to check for and correct any systemic errors in the collection and validation process that may lead to bias in the dataset. In our sales example, a systemic collection error may be that data was only gathered from urban stores and excluded rural ones. A model trained on only urban store data will be biased in forecasting store sales, and we may expect poor performance when the model is used to predict sales for rural stores.Cleaning the data: Any format or value range inconsistencies must be addressed. Any missing values also need to be handled in a way that does not introduce bias.Feature engineering: Certain features may need to be transformed to ensure the machine learning model can learn from them, such as numerically encoding a sentence of words. Additionally, new features may need to be prepared from existing features to help the model detect patterns.Normalizing and standardizing: The relative ranges of features must be normalized and standardized. Normalizing and standardizing ensure that no one feature has an outsized effect on the overall prediction.Balancing the dataset: In cases where the dataset is imbalanced – that is, it contains many more examples of one class or prediction than another – the dataset needs to be balanced. Balancing is typically done by oversampling the minority examples to balance the dataset.In Chapter 6, Solving Real-World Data Science Problems with LightGBM, we’ll go through the entire data preparation process to show how the preceding steps are applied practically.
Note
A good adage to remember is “garbage in, garbage out”. A model learns from any data given to it, including any flaws or biases contained in the data. When we train the model on garbage data, it results in a garbage model.
One final concept to understand regarding datasets is the training, validation, and test datasets. We split our datasets into these three subsets after the data preparation step is done:
The training set is the most significant subset and typically consists of 60% to 80% of the data. This data is used to train the model.The validation set is separate from the training data and is used throughout the training process to evaluate the model. Having independent validation data ensures that the model is evaluated on data it has not seen before, also known as its generalization ability. Hyperparameter tuning, a process covered in detail in Chapter 5, LightGBM Parameter Optimization with Optuna, also uses the validation set.Finally, the test set is an optional hold-out set, similar to the validation set. It is used at the end of the process to evaluate the model’s performance on data that was not part of the training or tuning process.Another use of the validation set is to monitor whether the model is overfitting the data. Let’s discuss overfitting in more detail.
To understand overfitting, we must first define what we mean by model generalization. As stated previously, generalization is the model’s ability to accurately predict data it has not seen before. Compared to training accuracy, generalization accuracy is more significant as an estimate of model performance as this indicates how our model will perform in production. Generalization comes in two forms, interpolation and extrapolation:
Interpolation refers to the model’s ability to predict a value between two known data points – stated another way, to generalize within the training data range. For example, let’s say we train our model with monthly data from January to July. When interpolating, we would ask the model to make a prediction on a particular day in April, a date within our training range.Extrapolation, as you might infer, is the model’s ability to predict values outside of the range defined by our training data. A typical example of extrapolation is forecasting – that is, predicting the future. In our previous example, if we ask the model to make a prediction in December, we expect it to extrapolate from the training data.Of the two types of generalization, extrapolation is much more challenging and may require a specific type of model to achieve. However, in both cases, a model can overfit the data, losing its ability to interpolate or extrapolate accurately.
Overfitting is a phenomenon where the model fits the training data too closely and loses its ability to generalize to unseen data. Instead of learning the underlying pattern in the data, the model has memorized the training data. More technically, the model fits the noise contained in the training data. The term noise stems from the concept of data containing signal and noise. Signal refers to the underlying pattern or information captured in the data we are trying to predict. In contrast, noise refers to random or irrelevant variations of data points that mask the signal.
For example, consider a dataset where we try to predict the rainfall for specific locations. The signal in the data would be the general trend of rainfall: rainfall increases in the winter or summer, or vice versa for other locations. The noise would be the slight variations in rainfall measurement for each month and location in our dataset.
The following graph illustrates the phenomenon of overfitting:
Figure 1.2 – Graph showing overfitting. The model has overfitted and predicted the training data perfectly but has lost the ability to generalize to the actual signal
The preceding figure shows the difference between signal and noise: each data point was sampled from the actual signal. The data follows the general pattern of the signal, with slight, random variations. We can see how the model has overfitted the data: the model has fit the training data perfectly but at the cost of generalization. We can also see that if we use the model to interpolate by predicting a value for 4, we get a result much higher than the actual signal (6.72 versus 6.2). Also shown is the model’s failure to extrapolate: the prediction for 12 is much lower than a forecast of the signal (7.98versus 8.6).
In reality, all real-world datasets contain noise. As data scientists, we aim to prepare the data to remove as much noise as possible, making the signal easier to detect. Data cleaning, normalization, feature selection, feature engineering, and regularization are techniques for removing noise from the data.
Since all real-world data contains noise, overfitting is impossible to eliminate. The following conditions may lead to overfitting:
An overly complex model: A model that is too complex for the amount of data we have utilizes additional complexity to memorize the noise in the data, leading to overfittingInsufficient data: If we don’t have enough training data for the model we use, it’s similar to an overly complex model, which overfits the dataToo many features: A dataset with too many features likely contains irrelevant (noisy) features that reduce the model’s generalizationOvertraining: Training the model for too long allows it to memorize the noise in the datasetAs the validation set is a part of the training data that remains unseen by the model, we use the validation set to monitor for overfitting. We can recognize the point of overfitting by looking at the training and generalization errors over time. At the point of overfitting, the validation error increases. In contrast, the training error continues to improve: the model is fitting noise in the training data and losing its ability to generalize.
Techniques that prevent overfitting usually aim to address the conditions that lead to overfitting we discussed previously. Here are some strategies to avoid overfitting:
Early stopping: We can stop training when we see the validation error beginning to increase.Simplifying the model: A less complex model with fewer parameters would be incapable of learning the noise in the training data, thereby generalizing better.Get more data: Either collecting more data or augmenting data is an effective method for preventing overfitting by giving the model a better chance to learn the signal in the data instead of the noise in a smaller dataset.Feature selection and dimensionality reduction: As some features might be irrelevant to the problem being solved, we can discard features we think are redundant or use techniques such as Principal Component Analysis to reduce the dimensionality (features).Adding regularization