Machine Learning with LightGBM and Python - Andrich van Wyk - E-Book

Machine Learning with LightGBM and Python E-Book

Andrich van Wyk

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Machine Learning with LightGBM and Python is a comprehensive guide to learning the basics of machine learning and progressing to building scalable machine learning systems that are ready for release.
This book will get you acquainted with the high-performance gradient-boosting LightGBM framework and show you how it can be used to solve various machine-learning problems to produce highly accurate, robust, and predictive solutions. Starting with simple machine learning models in scikit-learn, you’ll explore the intricacies of gradient boosting machines and LightGBM. You’ll be guided through various case studies to better understand the data science processes and learn how to practically apply your skills to real-world problems. As you progress, you’ll elevate your software engineering skills by learning how to build and integrate scalable machine-learning pipelines to process data, train models, and deploy them to serve secure APIs using Python tools such as FastAPI.
By the end of this book, you’ll be well equipped to use various -of-the-art tools that will help you build production-ready systems, including FLAML for AutoML, PostgresML for operating ML pipelines using Postgres, high-performance distributed training and serving via Dask, and creating and running models in the Cloud with AWS Sagemaker.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 313

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning with LightGBM and Python

A practitioner’s guide to developing production-ready machine learning systems

Andrich van Wyk

BIRMINGHAM—MUMBAI

Machine Learning with LightGBM and Python

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Tejashwini R

Senior Editor: Gowri Rekha

Content Development Editor: Manikandan Kurup

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Shyam Sundar Korumilli

Marketing Coordinator: Vinishka Kalra

First published: September 2023

Production reference: 1220923

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN: 978-1-80056-474-9

www.packtpub.com

Countless nights and weekends have been dedicated to completing this book, and I would like to thank my wife, Irene, for her eternal support, without which, nobody would be reading any of this. Further, I’m grateful to my daughter, Emily, for inspiring me to reach a little further.

– Andrich van Wyk

Contributors

About the author

Andrich van Wyk has 15 years of experience in machine learning R&D, building AI-driven solutions, and consulting in the AI domain. He also has broad experience as a software engineer and architect with over a decade of industry experience working on enterprise systems.

He graduated cum laude with an M.Sc. in Computer Science from the University of Pretoria, focusing on neural networks and evolutionary algorithms.

Andrich enjoys writing about machine learning engineering and the software industry at large. He currently resides in South Africa with his wife and daughter.

About the reviewers

Valentine Shkulov is a renowned visiting lecturer at a top tech university, where he seamlessly melds academia with real-world expertise as a distinguished Data Scientist in Fintech and E-commerce. His ingenuity in crafting ML-driven solutions has transformed businesses, from tech giants to budding startups. Valentine excels at introducing AI innovations and refining current systems, ensuring they profoundly influence vital business metrics. His passion for navigating product challenges has established him as a pioneer in leveraging ML to elevate businesses.

Above all, a heartfelt thanks to my spouse, the unwavering pillar of support in my remarkable journey.

Kayewan M Karanjia has over 7 years of experience in machine learning, artificial intelligence (AI), and data technologies, and brings a wealth of expertise to his current role at DrDoctor. Here, as a machine learning engineer, he is dedicated to implementing advanced machine learning models that have a direct impact on enhancing healthcare services and process optimization for the NHS. In the past, he has also worked with multiple MNCs such as Reliance Industries Limited, and implemented solutions for the government of India.

Table of Contents

Preface

Part 1: Gradient Boosting and LightGBM Fundamentals

1

Introducing Machine Learning

Technical requirements

What is machine learning?

Machine learning paradigms

Introducing models, datasets, and supervised learning

Models

Hyperparameters

Datasets

Overfitting and generalization

Supervised learning

Model performance metrics

A modeling example

Decision tree learning

Entropy and information gain

Building a decision tree using C4.5

Overfitting in decision trees

Building decision trees with scikit-learn

Decision tree hyperparameters

Summary

References

2

Ensemble Learning – Bagging and Boosting

Technical requirements

Ensemble learning

Bagging and random forests

Random forest

Gradient-boosted decision trees

Gradient descent

Gradient boosting

Gradient-boosted decision tree hyperparameters

Gradient boosting in scikit-learn

Advanced boosting algorithm – DART

Summary

References

3

An Overview of LightGBM in Python

Technical requirements

Introducing LightGBM

LightGBM optimizations

Hyperparameters

Limitations of LightGBM

Getting started with LightGBM in Python

LightGBM Python API

LightGBM scikit-learn API

Building LightGBM models

Cross-validation

Parameter optimization

Predicting student academic success

Summary

References

4

Comparing LightGBM, XGBoost, and Deep Learning

Technical requirements

An overview of XGBoost

Comparing XGBoost and LightGBM

Python XGBoost example

Deep learning and TabTransformers

What is deep learning?

Introducing TabTransformers

Comparing LightGBM, XGBoost, and TabTransformers

Predicting census income

Detecting credit card fraud

Summary

References

Part 2: Practical Machine Learning with LightGBM

5

LightGBM Parameter Optimization with Optuna

Technical requirements

Optuna and optimization algorithms

Introducing Optuna

Optimization algorithms

Pruning strategies

Optimizing LightGBM with Optuna

Advanced Optuna features

Summary

References

6

Solving Real-World Data Science Problems with LightGBM

Technical requirements

The data science life cycle

Defining the data science life cycle

Predicting wind turbine power generation with LightGBM

Problem definition

Data collection

Data preparation

EDA

Modeling

Model deployment

Communicating results

Classifying individual credit scores with LightGBM

Problem definition

Data collection

Data preparation

EDA

Modeling

Model deployment and results

Summary

References

7

AutoML with LightGBM and FLAML

Technical requirements

Automated machine learning

Automating feature engineering

Automating model selection and tuning

Risks of using AutoML systems

Introducing FLAML

Cost Frugal Optimization

BlendSearch

FLAML limitations

Case study – using FLAML with LightGBM

Feature engineering

FLAML AutoML

Zero-shot AutoML

Summary

References

Part 3: Production-ready Machine Learning with LightGBM

8

Machine Learning Pipelines and MLOps with LightGBM

Technical requirements

Introducing machine learning pipelines

Scikit-learn pipelines

Understanding MLOps

Deploying an ML pipeline for customer churn

Building an ML pipeline using scikit-learn

Building an ML API using FastAPI

Containerizing our API

Deploying LightGBM to Google Cloud

Summary

9

LightGBM MLOps with AWS SageMaker

Technical requirements

An introduction to AWS and SageMaker

AWS

SageMaker

SageMaker Clarify

Building a LightGBM ML pipeline with Amazon SageMaker

Setting up a SageMaker session

Preprocessing step

Model training and tuning

Evaluation, bias, and explainability

Deploying and monitoring the LightGBM model

Results

Summary

References

10

LightGBM Models with PostgresML

Technical requirements

Introducing PostgresML

Latency and round trips

Getting started with PostgresML

Training models

Deploying and prediction

PostgresML dashboard

Case study – customer churn with PostgresML

Data loading and preprocessing

Training and hyperparameter optimization

Predictions

Summary

References

11

Distributed and GPU-Based Learning with LightGBM

Technical requirements

Distributed learning with LightGBM and Dask

GPU training for LightGBM

Setting up LightGBM for the GPU

Running LightGBM on the GPU

Summary

References

Index

Other Books You May Enjoy

Part 1: Gradient Boosting and LightGBM Fundamentals

In this part, we will initiate our exploration of machine learning by grounding you in its fundamental concepts, ranging from basic terminologies to intricate algorithms like random forests. We will delve deep into ensemble learning, highlighting the power of decision trees when combined, and then shift our focus to the gradient-boosting framework, LightGBM. Through hands-on examples in Python and comparative analyses against techniques like XGBoost and deep neural networks, you’ll gain both a foundational understanding and practical competence in the realm of machine learning, especially with LightGBM.

This part will include the following chapters:

Chapter 1, Introducing Machine LearningChapter 2, Ensemble Learning – Bagging and BoostingChapter 3, An Overview of LightGBM in PythonChapter 4, Comparing LightGBM, XGBoost, and Deep Learning

1

Introducing Machine Learning

Our journey starts with an introduction to machine learning and the fundamental concepts we’ll use throughout this book.

We’ll start by providing an overview of machine learning from a software engineering perspective. Then, we’ll introduce the core concepts that are used in the field of machine learning and data science: models, datasets, learning paradigms, and other details. This introduction will include a practical example that clearly illustrates the machine learning terms discussed.

We will also introduce decision trees, a crucially important machine learning algorithm that is our first step to understanding LightGBM.

After completing this chapter, you will have established a solid foundation in machine learning and the practical application of machine learning techniques.

The following main topics will be covered in this chapter:

What is machine learning?Introducing models, datasets, and supervised learningDecision tree learning

Technical requirements

This chapter includes examples of simple machine learning algorithms and introduces working with scikit-learn. You must install a Python environment with scikit-learn, NumPy, pandas, and Jupyter Notebook. The code for this chapter is available at https://github.com/PacktPublishing/Practical-Machine-Learning-with-LightGBM-and-Python/tree/main/chapter-1.

What is machine learning?

Machine learning is a part of the broader artificial intelligence field that involves methods and techniques that allow computers to “learn” specific tasks without explicit programming.

Machine learning is just another way to write programs, albeit automatically, from data. Abstractly, a program is a set of instructions that transforms inputs into specific outputs. A programmer’s job is to understand all the relevant inputs to a computer program and develop a set of instructions to produce the correct outputs.

However, what if the inputs are beyond the programmer’s understanding?

For example, let’s consider creating a program to forecast the total sales of a large retail store. The inputs to the program would be various factors that could affect sales. We could imagine factors such as historical sales figures, upcoming public holidays, stock availability, any special deals the store might be running, and even factors such as the weather forecast or proximity to other stores.

In our store example, the traditional approach would be to break down the inputs into manageable, understandable (by a programmer) pieces, perhaps consult an expert in store sales forecasting, and then devise handcrafted rules and instructions to attempt to forecast future sales.

While this approach is certainly possible, it is also brittle (in the sense that the program might have to undergo extensive changes regarding the input factors) and wholly based on the programmer’s (or domain expert’s) understanding of the problem. With potentially thousands of factors and billions of examples, this problem becomes untenable.

Machine learning offers us an alternative to this approach. Instead of creating rules and instructions, we repeatedly show the computer examples of the tasks we need to accomplish and then get it to figure out how to solve them automatically.

However, where we previously had a set of instructions, we now have a trained model instead of a programmed one.

The key realization here, especially if you are coming from a software background, is that our machine learning program still functions like a regular program: it accepts input, has a way to process it, and produces output. Like all other software programs, machine learning software must be tested for correctness, integrated into other systems, deployed, monitored, and optimized. Collectively, this forms the field of machine learning engineering. We’ll cover all these aspects and more in later chapters.

Machine learning paradigms

Broadly speaking, machine learning has three main paradigms: supervised, unsupervised, and reinforcement learning.

With supervised learning, the model is trained on labeled data: each instance in the dataset has its associated correct output, or label, for the input example. The model is expected to learn to predict the label for unseen input examples.

With unsupervised learning, the examples in the dataset are unlabeled; in this case, the model is expected to discover patterns and relationships in the data. Examples of unsupervised approaches are clustering algorithms, anomaly detection, and dimensionality reduction algorithms.

Finally, reinforcement learning entails a model, usually called an agent, interacting with a particular environment and learning by receiving penalties or rewards for specific actions. The goal is for the agent to perform actions that maximize its reward. Reinforcement learning is widely used in robotics, control systems, or training computers to play games.

LightGBM and most other algorithms discussed later in this book are examples of supervised learning techniques and are the focus of this book.

The following section dives deeper into the machine learning terminology we’ll use throughout this book and the details of the machine learning process.

Introducing models, datasets, and supervised learning

In the previous section, we introduced a model as a construct to replace a set of instructions that typically comprise a program to perform a specific task. This section covers models and other core machine learning concepts in more detail.

Models

More formally, a model is a mathematical or algorithmic representation of a specific process that performs a particular task. A machine learning model learns a particular task by being trained on a dataset using a training algorithm.

Note

An alternative term for training is fit. Historically, fit stems from the statistical field. A model is said to “fit the data” when trained. We’ll use both terms interchangeably throughout this book.

Many distinct types of models exist, all of which use different mathematical, statistical, or algorithmic techniques to model the training data. Examples of machine learning algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks.

A distinction is made between the model type and a trained instance of that model: the majority of machine learning models can be trained to perform various tasks. For example, decision trees (a model type) can be trained to forecast sales, recognize heart disease, and predict football match results. However, each of these tasks requires a different instance of a decision tree that has been trained on a distinct dataset.

What a specific model does depends on the model’s parameters. Parameters are also sometimes called weights, which are technically particular types of model parameters.

A training algorithm is an algorithm for finding the most appropriate model parameters for a specific task.

We determine the quality of fit, or how well the model performs, using an objective function. This is a mathematical function that measures the difference between the predicted output and the actual output for a given input. The objective function quantifies the performance of a model. We may seek to minimize or maximize the objective function depending on the problem we are solving. The objective is often measured as an error we aim to minimize during training.

We can summarize the model training process as follows: a training algorithm uses data from a dataset to optimize a model’s parameters for a particular task, as measured through an objective function.

Hyperparameters

While a model is composed of parameters, the training algorithm has parameters of its own called hyperparameters. A hyperparameter is a controllable value that influences the training process or algorithm. For example, consider finding the minimum of a parabola function: we could start by guessing a value and then take small steps in the direction that minimizes the function output. The step size would have to be chosen well: if our steps are too small, it will take a prohibitively long time to find the minimum. If the step size is too large, we may overshoot and miss the minimum and then continue oscillating (jumping back and forth) around the minimum:

Figure 1.1 – Effect of using a step size that is too large (left) and too small (right)

In this example, the step size would be a hyperparameter of our minimization algorithm. The effect of the step size is illustrated in Figure 1.1.

Datasets

As explained previously, the machine learning model is trained using a dataset. Data is at the heart of the machine learning process, and data preparation is often the part of the process that takes up the most time.

Throughout this book, we’ll work with tabular datasets. Tabular datasets are very common in the real world and consist of rows and columns. Rows are often called samples, examples, or observations, and columns are usually called features, variables, or attributes.

Importantly, there is no restriction on the data type in a column. Features may be strings, numbers, Booleans, geospatial coordinates, or encoded formats such as audio, images, or video.

Datasets are also rarely perfectly defined. Data may be incomplete, noisy, incorrect, inconsistent, and contain various formats.

Therefore, data preparation and cleaning are essential parts of the machine learning process.

Data preparation concerns processing the data to make it suitable for machine learning and typically consists of the following steps:

Gathering and validation: Some datasets are initially too small or represent the problem poorly (the data is not representative of the actual data population it’s been sampled from). In these cases, the practitioner must collect more data, and validation must be done to ensure the data represents the problem.Checking for systemic errors and bias: It is vital to check for and correct any systemic errors in the collection and validation process that may lead to bias in the dataset. In our sales example, a systemic collection error may be that data was only gathered from urban stores and excluded rural ones. A model trained on only urban store data will be biased in forecasting store sales, and we may expect poor performance when the model is used to predict sales for rural stores.Cleaning the data: Any format or value range inconsistencies must be addressed. Any missing values also need to be handled in a way that does not introduce bias.Feature engineering: Certain features may need to be transformed to ensure the machine learning model can learn from them, such as numerically encoding a sentence of words. Additionally, new features may need to be prepared from existing features to help the model detect patterns.Normalizing and standardizing: The relative ranges of features must be normalized and standardized. Normalizing and standardizing ensure that no one feature has an outsized effect on the overall prediction.Balancing the dataset: In cases where the dataset is imbalanced – that is, it contains many more examples of one class or prediction than another – the dataset needs to be balanced. Balancing is typically done by oversampling the minority examples to balance the dataset.

In Chapter 6, Solving Real-World Data Science Problems with LightGBM, we’ll go through the entire data preparation process to show how the preceding steps are applied practically.

Note

A good adage to remember is “garbage in, garbage out”. A model learns from any data given to it, including any flaws or biases contained in the data. When we train the model on garbage data, it results in a garbage model.

One final concept to understand regarding datasets is the training, validation, and test datasets. We split our datasets into these three subsets after the data preparation step is done:

The training set is the most significant subset and typically consists of 60% to 80% of the data. This data is used to train the model.The validation set is separate from the training data and is used throughout the training process to evaluate the model. Having independent validation data ensures that the model is evaluated on data it has not seen before, also known as its generalization ability. Hyperparameter tuning, a process covered in detail in Chapter 5, LightGBM Parameter Optimization with Optuna, also uses the validation set.Finally, the test set is an optional hold-out set, similar to the validation set. It is used at the end of the process to evaluate the model’s performance on data that was not part of the training or tuning process.

Another use of the validation set is to monitor whether the model is overfitting the data. Let’s discuss overfitting in more detail.

Overfitting and generalization

To understand overfitting, we must first define what we mean by model generalization. As stated previously, generalization is the model’s ability to accurately predict data it has not seen before. Compared to training accuracy, generalization accuracy is more significant as an estimate of model performance as this indicates how our model will perform in production. Generalization comes in two forms, interpolation and extrapolation:

Interpolation refers to the model’s ability to predict a value between two known data points – stated another way, to generalize within the training data range. For example, let’s say we train our model with monthly data from January to July. When interpolating, we would ask the model to make a prediction on a particular day in April, a date within our training range.Extrapolation, as you might infer, is the model’s ability to predict values outside of the range defined by our training data. A typical example of extrapolation is forecasting – that is, predicting the future. In our previous example, if we ask the model to make a prediction in December, we expect it to extrapolate from the training data.

Of the two types of generalization, extrapolation is much more challenging and may require a specific type of model to achieve. However, in both cases, a model can overfit the data, losing its ability to interpolate or extrapolate accurately.

Overfitting is a phenomenon where the model fits the training data too closely and loses its ability to generalize to unseen data. Instead of learning the underlying pattern in the data, the model has memorized the training data. More technically, the model fits the noise contained in the training data. The term noise stems from the concept of data containing signal and noise. Signal refers to the underlying pattern or information captured in the data we are trying to predict. In contrast, noise refers to random or irrelevant variations of data points that mask the signal.

For example, consider a dataset where we try to predict the rainfall for specific locations. The signal in the data would be the general trend of rainfall: rainfall increases in the winter or summer, or vice versa for other locations. The noise would be the slight variations in rainfall measurement for each month and location in our dataset.

The following graph illustrates the phenomenon of overfitting:

Figure 1.2 – Graph showing overfitting. The model has overfitted and predicted the training data perfectly but has lost the ability to generalize to the actual signal

The preceding figure shows the difference between signal and noise: each data point was sampled from the actual signal. The data follows the general pattern of the signal, with slight, random variations. We can see how the model has overfitted the data: the model has fit the training data perfectly but at the cost of generalization. We can also see that if we use the model to interpolate by predicting a value for 4, we get a result much higher than the actual signal (6.72 versus 6.2). Also shown is the model’s failure to extrapolate: the prediction for 12 is much lower than a forecast of the signal (7.98versus 8.6).

In reality, all real-world datasets contain noise. As data scientists, we aim to prepare the data to remove as much noise as possible, making the signal easier to detect. Data cleaning, normalization, feature selection, feature engineering, and regularization are techniques for removing noise from the data.

Since all real-world data contains noise, overfitting is impossible to eliminate. The following conditions may lead to overfitting:

An overly complex model: A model that is too complex for the amount of data we have utilizes additional complexity to memorize the noise in the data, leading to overfittingInsufficient data: If we don’t have enough training data for the model we use, it’s similar to an overly complex model, which overfits the dataToo many features: A dataset with too many features likely contains irrelevant (noisy) features that reduce the model’s generalizationOvertraining: Training the model for too long allows it to memorize the noise in the dataset

As the validation set is a part of the training data that remains unseen by the model, we use the validation set to monitor for overfitting. We can recognize the point of overfitting by looking at the training and generalization errors over time. At the point of overfitting, the validation error increases. In contrast, the training error continues to improve: the model is fitting noise in the training data and losing its ability to generalize.

Techniques that prevent overfitting usually aim to address the conditions that lead to overfitting we discussed previously. Here are some strategies to avoid overfitting:

Early stopping: We can stop training when we see the validation error beginning to increase.Simplifying the model: A less complex model with fewer parameters would be incapable of learning the noise in the training data, thereby generalizing better.Get more data: Either collecting more data or augmenting data is an effective method for preventing overfitting by giving the model a better chance to learn the signal in the data instead of the noise in a smaller dataset.Feature selection and dimensionality reduction: As some features might be irrelevant to the problem being solved, we can discard features we think are redundant or use techniques such as Principal Component Analysis to reduce the dimensionality (features).Adding regularization