Machine Learning for Imbalanced Data - Kumar Abhishek - E-Book

Machine Learning for Imbalanced Data E-Book

Kumar Abhishek

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance.

Machine Learning for Imbalanced Data begins by introducing you to the challenges posed by imbalanced datasets and the importance of addressing these issues. It then guides you through techniques that enhance the performance of classical machine learning models when using imbalanced data, including various sampling and cost-sensitive learning methods.

As you progress, you’ll delve into similar and more advanced techniques for deep learning models, employing PyTorch as the primary framework. Throughout the book, hands-on examples will provide working and reproducible code that’ll demonstrate the practical implementation of each technique.

By the end of this book, you’ll be adept at identifying and addressing class imbalances and confidently applying various techniques, including sampling, cost-sensitive techniques, and threshold adjustment, while using traditional machine learning or deep learning models.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 415

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning for Imbalanced Data

Tackle imbalanced datasets using machine learning and deep learning techniques

Kumar Abhishek

Dr. Mounir Abdelaziz

BIRMINGHAM—MUMBAI

Machine Learning for Imbalanced Data

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Sanjana Gupta

Book Project Manager: Kirti Pisat

Senior Editor: Rohit Singh

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Nilesh Mohite

DevRel Marketing Coordinator: Vinishka Kalra

First published: November 2023

Production reference: 2221123

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80107-083-6

www.packtpub.com

Contributors

About the authors

Kumar Abhishek is a seasoned senior machine learning engineer at Expedia Group, US, specializing in risk analysis and fraud detection. With over a decade of machine learning and software engineering experience, Kumar has worked for companies such as Microsoft, Amazon, and a Bay Area start-up. Kumar holds a master’s degree in computer science from the University of Florida, Gainesville.

To my incredible wife who has been my rock and constant source of inspiration, our adorable son who fills our lives with joy, my wonderful parents for their unwavering support, and my close friends. Immense thanks to Christian, who has been a pivotal mentor and guide, for his meticulous reviews. My deepest gratitude to my co-author, Mounir, and contributor, Anshul; their dedication and solid contributions were essential in shaping this book. Lastly, I extend my sincere appreciation to Abhiram and the Packt team for their unwavering support.

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.

I would like to thank my family, especially my parents, for their support and encouragement. I also want to thank all the fantastic people I collaborated with, including my co-author, Packt editors, and reviewers. Without their help, writing this book wouldn’t have been possible.

Other contributor

Anshul Yadav is a software developer and trainer with a keen interest in machine learning, web development, and theoretical computer science. He likes to solve technical problems: the slinkier, the better. He has a B.Tech. degree in computer science and engineering from IIT Kanpur. Anshul loves to share the joy of learning with his audience.

About the reviewers

Christian Monson has nine years of industry experience working as a machine learning scientist specializing in Natural Language Processing (NLP) and speech recognition. For five of those years, he worked at Amazon improving the Alexa personal assistant. During the 2000s, he was a graduate student at Carnegie Mellon University and a postdoc at Oregon Health and Science University working on NLP. Christian completed his bachelor’s degree in computer science, with minors in math and physics, at Brigham Young University in 2000. In his free time, Christian creates video games and plays with his kids. Currently, he is a full-time tutor and mentor in machine learning. You can find Christian at www.aitalks.art or watch his videos at youtube.com/@_aitalks.

Abhiram Jagarlapudi is a principal software engineer with 10 years of experience in cloud computing and Artificial Intelligence (AI). At Amazon Web Services and Oracle Cloud, Abhiram was part of launching several public cloud services, later specializing in cloud AI services. He was part of a small team that built the software delivery infrastructure of Oracle Cloud, which started in 2016 and has since grown into a multi-billion-dollar business. He also designed and developed AI services for the Oracle Cloud and is passionate about applying that experience to improve and accelerate the delivery of machine learning.

Table of Contents

Preface

1

Introduction to Data Imbalance in Machine Learning

Technical requirements

Introduction to imbalanced datasets

Machine learning 101

What happens during model training?

Types of dataset and splits

Cross-validation

Common evaluation metrics

Confusion matrix

ROC

Precision-Recall curve

Relation between the ROC curve and PR curve

Challenges and considerations when dealing with imbalanced data

When can we have an imbalance in datasets?

Why can imbalanced data be a challenge?

When to not worry about data imbalance

Introduction to the imbalanced-learn library

General rules to follow

Summary

Questions

References

2

Oversampling Methods

Technical requirements

What is oversampling?

Random oversampling

Problems with random oversampling

SMOTE

How SMOTE works

Problems with SMOTE

SMOTE variants

Borderline-SMOTE

ADASYN

Working of ADASYN

Categorical features and SMOTE variants (SMOTE-NC and SMOTEN)

Model performance comparison of various oversampling methods

Guidance for using various oversampling techniques

When to avoid oversampling

Oversampling in multi-class classification

Summary

Exercises

References

3

Undersampling Methods

Technical requirements

Introducing undersampling

When to avoid undersampling the majority class

Fixed versus cleaning undersampling

Undersampling approaches

Removing examples uniformly

Random UnderSampling

ClusterCentroids

Strategies for removing noisy observations

ENN, RENN, and AllKNN

Tomek links

Neighborhood Cleaning Rule

Instance hardness threshold

Strategies for removing easy observations

Condensed Nearest Neighbors

One-sided selection

Combining undersampling and oversampling

Model performance comparison

Summary

Exercises

References

4

Ensemble Methods

Technical requirements

Bagging techniques for imbalanced data

UnderBagging

OverBagging

SMOTEBagging

Comparative performance of bagging methods

Boosting techniques for imbalanced data

AdaBoost

RUSBoost, SMOTEBoost, and RAMOBoost

Ensemble of ensembles

EasyEnsemble

Comparative performance of boosting methods

Model performance comparison

Summary

Questions

References

5

Cost-Sensitive Learning

Technical requirements

The concept of Cost-Sensitive Learning

Costs and cost functions

Types of cost-sensitive learning

Difference between CSL and resampling

Problems with rebalancing techniques

Understanding costs in practice

Cost-Sensitive Learning for logistic regression

Cost-Sensitive Learning for decision trees

Cost-Sensitive Learning using scikit-learn and XGBoost models

MetaCost – making any classification model cost-sensitive

Threshold adjustment

Methods for threshold tuning

Summary

Questions

References

6

Data Imbalance in Deep Learning

Technical requirements

A brief introduction to deep learning

Neural networks

Perceptron

Activation functions

Layers

Feedforward neural networks

Training neural networks

The effect of the learning rate on data imbalance

Image processing using Convolutional Neural Networks

Text analysis using Natural Language Processing

Data imbalance in deep learning

The impact of data imbalance on deep learning models

Overview of deep learning techniques to handle data imbalance

Multi-label classification

Summary

Questions

References

7

Data-Level Deep Learning Methods

Technical requirements

Preparing the data

Creating the training loop

Sampling techniques for deep learning models

Random oversampling

Dynamic sampling

Data augmentation techniques for vision

Data-level techniques for text classification

Dataset and baseline model

Document-level augmentation

Character and word-level augmentation

Discussion of other data-level deep learning methods and their key ideas

Two-phase learning

Expansive Over-Sampling

Using generative models for oversampling

DeepSMOTE

Neural style transfer

Summary

Questions

References

8

Algorithm-Level Deep Learning Techniques

Technical requirements

Motivation for algorithm-level techniques

Weighting techniques

Using PyTorch’s weight parameter

Handling textual data

Deferred re-weighting – a minor variant of the class weighting technique

Explicit loss function modification

Focal loss

Class-balanced loss

Class-dependent temperature Loss

Class-wise difficulty-balanced loss

Discussing other algorithm-based techniques

Regularization techniques

Siamese networks

Deeper neural networks

Threshold adjustment

Summary

Questions

References

9

Hybrid Deep Learning Methods

Technical requirements

Using graph machine learning for imbalanced data

Understanding graphs

Graph machine learning

Dealing with imbalanced data

Case study – the performance of XGBoost, MLP, and a GCN on an imbalanced dataset

Hard example mining

Online Hard Example Mining

Minority class incremental rectification

Utilizing the hard sample mining technique in minority class incremental rectification

Summary

Questions

References

10

Model Calibration

Technical requirements

Introduction to model calibration

Why bother with model calibration

Models with and without well-calibrated probabilities

Calibration curves or reliability plot

Brier score

Expected Calibration Error

The influence of data balancing techniques on model calibration

Plotting calibration curves for a model trained on a real-world dataset

Model calibration techniques

The calibration of model scores to account for sampling

Platt’s scaling

Isotonic regression

Choosing between Platt’s scaling and Isotonic regression

Temperature scaling

Label smoothing

The impact of calibration on a model’s performance

Summary

Questions

References

Appendix

Machine Learning Pipeline in Production

Machine learning training pipeline

Inferencing (online or batch)

Assessments

Chapter 1 – Introduction to Data Imbalance in Machine Learning

Chapter 2 – Oversampling Methods

Chapter 3 – Undersampling Methods

Chapter 4 – Ensemble Methods

Chapter 5 – Cost-Sensitive Learning

Chapter 6 – Data Imbalance in Deep Learning

Chapter 7 – Data-Level Deep Learning Methods

Chapter 8 – Algorithm-Level Deep Learning Techniques

Chapter 9 – Hybrid Deep Learning Methods

Chapter 10 – Model Calibration

Index

Other Books You May Enjoy