91,99 €
An essential guide for tackling outliers and anomalies in machine learning and data science.
In recent years, machine learning (ML) has transformed virtually every area of research and technology, becoming one of the key tools for data scientists. Robust machine learning is a new approach to handling outliers in datasets, which is an often-overlooked aspect of data science. Ignoring outliers can lead to bad business decisions, wrong medical diagnoses, reaching the wrong conclusions or incorrectly assessing feature importance, just to name a few.
Fundamentals of Robust Machine Learning offers a thorough but accessible overview of this subject by focusing on how to properly handle outliers and anomalies in datasets. There are two main approaches described in the book: using outlier-tolerant ML tools, or removing outliers before using conventional tools. Balancing theoretical foundations with practical Python code, it provides all the necessary skills to enhance the accuracy, stability and reliability of ML models.
Fundamentals of Robust Machine Learning readers will also find:
Fundamentals of Robust Machine Learning is ideal for undergraduate or graduate students in data science, machine learning, and related fields, as well as for professionals in the field looking to enhance their understanding of building models in the presence of outliers.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 643
Veröffentlichungsjahr: 2025
Resve SalehUniversity of British Columbia, Vancouver, Canada
Sohaib MajzoubUniversity of Sharjah, Sharjah, United Arab Emirates
A. K. Md. Ehsanes SalehCarleton University, Ottawa, Canada
Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
The manufacturer's authorized representative according to the EU General Product Safety Regulation is Wiley‐VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e‐mail: [email protected]
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data Applied for:
Hardback ISBN: 9781394294374
Cover Design: WileyCover Image: © Yuichiro Chino/Getty Images, © master_art/Shutterstock
We dedicate this book to
Shahidara Saleh
Lynn Hilchie Saleh
Bayan Majzoub
Outliers are part of almost every real‐world dataset. They can occur naturally as part of the characteristics of the data being collected. They can also be due to statistical noise in the environment that might be unavoidable. More commonly, they are associated with measurement error or instrumentation error. Another source is human error, such as typographical errors or misinterpreting the measurements of a device. If there are extreme outliers, they are often referred to as anomalies. Sometimes, the true data is differentiated from outliers by calling them inliers. While outliers may represent a small portion of the dataset, their impact can be quite significant.
The machine learning and data science techniques in use today largely ignore outliers and their potentially harmful effects. For many, outliers are somewhat of a nuisance during model building and prediction. They are hard to detect in both regression and classification problems. Therefore, it is easier to ignore them and hope for the best. Alternatively, various ad hoc techniques are used to remove them from the dataset even at the risk of inadvertently removing valuable inlier data in the process. But we have reached a point in data science where these approaches are no longer viable. In fact, new methods have emerged recently with great potential to properly address outliers and they should be investigated thoroughly.
The cost of ignoring this under‐reported and often overlooked aspect of data science can be significant. In particular, outliers and anomalies in datasets may lead to inaccurate models that result in making bad business decisions, producing questionable explanations of cause‐and‐effect, arriving at the wrong conclusions, or making incorrect medical diagnoses, just to name a few. A prediction is only as good as the model on which it is based, and if the model is faulty, so goes the prediction. Even one outlier can render a model unusable if it happens to be in the wrong location. Machine learning practitioners have not yet fully embraced a class of robust techniques that would provide more reliable models and more accurate predictions than is possible with present‐day methods. Robust methods are better‐suited to data science, especially when outliers are present. The overall goal of this book is to provide the rationale and techniques for robust machine learning and then build on that material toward robust data science.
This book is a comprehensive study of outliers in datasets and how to deal with them in machine learning. It evaluates the robustness of existing methods such as linear regression using least squares and Huber's method, and binary classification using the cross‐entropy loss for logistic regression and neural networks, as well as other popular methods including ‐nearest neighbors, support vector machines and random forest. It provides a number of new approaches using the log‐cosh loss which is very important in robust machine learning. Furthermore, techniques that surgically remove outliers from datasets for both regression and classification problems are presented. The book is about the pursuit of methods and procedures that recognize the adverse effects that outliers can have on the models built by machine learning tools. It is intended to move the field toward robust data science where the proper tools and methodologies are used to handle outliers. It introduces a number of new ideas and approaches to the theory and practice of robust machine learning and encourages readers to pursue further investigation in this field.
This book offers an interdisciplinary perspective on robust machine learning. The prerequisites are some familiarity with probability and statistics, as well as the basics of machine learning and data science. All three areas are covered in equal measure. For those who are new to the field and are looking to understand key concepts, we do provide the necessary introductory and tutorial material in each subject area at the beginning of each chapter. Readers with an undergraduate‐level knowledge of the subject matter will benefit greatly from this book.
You may have heard the phrase “regression to the mean.” In this book, we discuss “regression to the median.” The methods currently in use target the mean of the data to estimate the model parameters. However, the median is a better target because it is more stable in the presence of outliers. It is important to recognize that data science should be conducted using methods that are reliable and stable which is what the median‐based approach can offer. There are good reasons why we frequently hear phrases like “the median house price” or “the median household income.” It holds the key to building outlier‐tolerant models. Furthermore, robust methods offer stability and accuracy with or without outliers in the dataset.
We use the term “robust machine learning” as many of the techniques originate in the field of robust statistics. The term “robust” may seem somewhat unusual and confusing to some, but it is a well‐established term in the field of statistics. It was coined in the 1950s and has been used ever since. Note that the term robust machine learning has been used in other contexts in the literature, but here we specifically refer to “outlier‐tolerant” methods.
One may wonder why robust methods have not already been incorporated into machine learning tools. This is in part due to the long history of the non‐robust estimation methods in statistics and their natural migration to the machine learning community over the past two decades. Attempts to use the loss function (which is robust) were not successful in the past, whereas the loss (which is not robust) was much easier to understand and implement. It is strongly tied to the Gaussian distribution which made it even more compelling, especially in terms of the maximum likelihood procedure. The same can be said of the cross‐entropy loss used in binary classification. Most practitioners today are still employing least squares and cross‐entropy methods, both of which are not robust in the presence of outliers. We will show that the log‐cosh loss is robust, that it can be derived using maximum likelihood principles and inherits all the nice properties required of a loss function for use in machine learning. This removes all the past reasons for not using robust methods.
The approach taken in this book regarding outliers is to show how to robustify existing methods and apply them to data science problems. It revisits a number of the key machine tasks such as clustering, linear regression, logistic regression, and neural networks, and describes how they can all be robustified. It also covers the use of penalty estimators in the context of robust methods. In particular, the ridge, LASSO, and aLASSO methods are described. They are evaluated in terms of their ability to mitigate the effects of outliers. In addition, some very interesting approaches are described for variable ordering using aLASSO.
Outlier detection for regression and classification problems are addressed in detail. Previous approaches have not been able to perform this function without removing valuable data along with the outliers. The methods are essentially ad hoc in nature. In this book, practical solutions are provided using robust techniques. Of note is an iterative boxplot method for linear regression and a histogram‐based method for classification problems. Anomaly detection is another form of outlier detection where the outliers are at extreme locations and represent unusual and unexpected occurrences in the dataset. Identifying such anomalies is very important in the detection of suspicious activity such as bank fraud, spam in emails, and network intrusion. The techniques to be described in this book include ‐nearest neighbors (‐NN), DBSCAN, and Isolation Forest as they are popular techniques in this category. Also included is a new method based on robust statistics and ‐medians clustering called MADmax, which is shown to provide better results than current methods.
We wanted to write a book suitable for senior‐level (or fourth year) undergraduate and first/second‐year graduate students, that is also useful as a stand‐alone guide for researchers or practitioners in the field. As a result, there are equal parts of the theory and practice. Detailed derivations, theoretical support for the methods, as well as a substantial amount of “know‐how” and experience are part of every chapter with code segments that can be executed by the reader to improve understanding. We found that when we view existing methods through the lens of outliers, it leads to a deeper understanding of how current methods work and why they may fail. In this sense, some new knowledge will be gained in every chapter.
The programming code provided in this book are based on Python which is the workhorse language for the machine learning community. Many libraries and utilities are available in Python. We introduce code segments for all of the techniques for regression and classification, as well as the code for outlier removal in the form of projects at the end of each chapter. The reader would be well‐served to follow along the descriptions in the book while implementing the code wherever possible in Python. This is the best way to get the most out of this book.
The book is spread over 12 chapters. Chapter 1 begins with an introduction to machine learning and the importance of considering outliers in both regression and classification problems. Then, the ‐means clustering algorithm is described and transformed into a ‐medians algorithm as an example of the robustification of an existing method. Chapter 2 covers the and loss functions and describes a combination of the two called Huber's method. Chapter 3 provides a detailed analysis of the log‐cosh loss function. Chapter 4 discusses outlier detection, metrics, and standardization. Chapters 5 and 6 address robust penalty estimation using ridge, LASSO and aLASSO. Chapter 7 introduces a flexible log‐cosh loss function for quantile regression. Chapters 8–10 address the binary and multi‐class classification problems and the robustness of various methods for these tasks. Chapter 11 addresses the important topic of anomaly detection. Finally, Chapter 12 concludes with applications of robust methods to some well‐known data science problems.
A suitable undergraduate course in robust machine learning would consist of Chapters 1–6 and a graduate course would consist of Chapters 7–12. For practitioners, we recommend Chapters 1, 4, and 8–12. For researchers, we recommend Chapters 1–3, 5–7, and 12. It is our hope that readers will find something of value in each chapter, and that it will lead to many fruitful areas of future research and development.
Professor R. Saleh wishes to thank his parents, Dr. Ehsanes Saleh and Mrs. Shadidara Saleh, for their love and support. He also acknowledges the encouragement and support from Lynn Saleh, Jody Fast, Isme Alam, and Raihan Saleh. Special thanks to Dr. Mina Norouzirad for extensive reviews of all chapters and to Paolo Pignatelli for his contributions during the writing of this book.
Professor S. Majzoub is immensely grateful to his parents, Dr. Samir Majzoub and Mrs. Zeinab Khalidi, for their unwavering support and guidance. Heartfelt appreciation is expressed to his wife, Bayan Almogharbel, and children, Baraa, Asma, and Ibrahim, whose love and encouragement are cherished. Thanks are given to Haitham Tayyar, Mulhim Aldoori, Mottaqiallah Taouil, and Said Hamdioui for their invaluable assistance and encouragement throughout this journey.
Professor A.K. Md. E. Saleh is grateful to NSERC for supporting his research for more than four decades. He is grateful to his loving wife, Shahidara Saleh, for her support over 70 years of marriage. He also appreciates his grandchildren Jasmine Alam, Sarah Alam, Migel Saleh, and Malique Saleh for their loving care.
April 2025
Resve Saleh
Vancouver, Canada
Sohaib Majzoub
Sharjah, UAE
A.K. Md. Ehsanes Saleh
Ottawa, Canada
This book is accompanied by a companion website:
www.wiley.com/go/HandlingOutliersandAnomaliesinDataScience1e
The website includes:
Student only: Python problems for students
Instructor only: Answers
The field of machine learning (ML) has been advancing very rapidly since 2010 when it gained momentum in both academia and industry. In the intervening years, many innovative ideas have been implemented and applied to some very challenging data science problems with great success. Underpinning the work in these areas are the mathematical foundations within the field of probability and statistics. The relationships between the three areas of data science, machine learning, and probability and statistics are conceptualized in Figure 1.1. The interdependence of the three disciplines is perhaps the most important point to note here. Data science relies heavily on probability and statistics and the machine learning tools that build the models for a particular application domain. Applications drive the investigation of the type of mathematics and machine learning capabilities that need to be developed. Likewise, machine learning relies on the foundations that lie in probability and statistics and also on the requirements of the applications that will eventually be used to develop the necessary tools. There is a symbiotic relationship between these three disciplines. This book places a strong emphasis on all three areas in equal measure to introduce the ideas and concepts to be described. As such, some familiarity with one of these three areas will be helpful in navigating through the material to be presented.
This book is about robust machine learning. It addresses an under‐reported and often overlooked aspect of data science which is the effect of outliers on the model building process. A prediction is only as good as the model on which it is based, and if the model is faulty, so is the prediction. In particular, outliers and anomalies in datasets can lead to inaccurate models, resulting in bad business decisions, questionable explanations of cause‐and‐effect, incorrect conclusions, or inaccurate medical diagnoses, just to name a few. Even one outlier can render a model unusable if it happens to be in the wrong place. Robust methods are better‐suited to data science, especially when outliers are present. ML practitioners have not yet fully embraced a class of robust techniques that would provide more reliable models and more accurate predictions than is possible with present‐day methods. The overall goal of this book is to provide the rationale and techniques for robust ML and then build on that material toward robust data science.
The topics to be covered in this book span a wide variety of subjects and there is substantial ground to cover to gain deep knowledge in this field. This opening chapter covers introductory material and a number of basic concepts in machine learning. Initially, descriptions of robust machine learning and robust data science are provided as a backdrop for the material to be presented in the rest of the book. Then, we focus on robustifying one ML algorithm. There are several options to choose from in this initial chapter to demonstrate the value of using robust methods. Among the options, the one selected is simple and yet provides a clear picture of the power of robust methods in general. Since clustering is one of the well‐known applications of machine learning, it will be used as the first vehicle to understand the way that robust methods can be applied to handle outliers in datasets. It will also serve as a template for the rest of the book.
Figure 1.1 Synergistic relationships between data science, machine learning, and probability and statistics.
Outliers are typically a small subset of data that deviates from the majority of the data within the same dataset. Sometimes the true data is differentiated from outliers by calling them inliers. The sources of outliers can vary significantly. Outliers can occur naturally as part of the characteristics of the data being collected. Outliers can also be due to significant noise in the environment that might be unavoidable. More commonly, they are associated with measurement error, or instrumentation error. Another source is due to human error such as typographical errors or perhaps misinterpreting the units of measure of a device. If there are large variances in the features of a dataset and a small amount of data, this could lead to some of the data being declared as outliers. And other times, it is just inherently part of the data collection process. Unusual or extreme outliers are often referred to as anomalies.
There are basically two options of handling outliers. The first is to build machine learning tools that are tolerant to outliers. These are the robust machine learning tools. This involves understanding how the current ML tools work and then seeking an alternative approach that is more effective in the presence of outliers. Conceptually, the current approaches in ML seek the mean of the data to build a model. Robust methods seek the median instead which makes the models much more stable and reliable. This is one of the main thrusts of the book. The problems surrounding the use of the mean will be presented, followed by the advantages of using the median. More generally, the first approach is to convert a non‐robust machine learning method into a robust one. Many of the existing techniques are not robust and those should be identified and updated accordingly. However, if a method is inherently robust, it should be preferred over its non‐robust counterpart.
A second option is to detect and remove outliers and then apply the current ML tools or extract outliers from a dataset for further inspection. In current practice, there is a large investment in non‐robust methodologies that are hard to change or replace, so it is easier to find ways to eliminate outliers and use the existing infrastructure. Since robust methods are not widely used in Python (or well‐understood), it seems easier to remove outliers rather than build a new infrastructure based on robust methods. However, as appealing as it may sound, it is not straightforward and requires an extra level of care to ensure that inliers are not inadvertently removed with outliers. There are other cases where the goal of the project is to find the outliers or anomalies in the dataset rather than to build a model. The detected outliers would be inspected to learn something about the anomalies or any unusual aspects of the data being collected. And finally, there are statistical tools and procedures that rely on the data being outlier‐free in order to produce meaningful results (e.g. correlation coefficients and analysis of variance). Hence, the secondary thrust of the book is to describe efficient and effective methods to detect and remove both outliers and anomalies.
With the background and preliminaries covered, we will now move to the main thrust of this book which is the study of standard machine learning techniques and their robust counterparts. Table 1.1 provides a roadmap of the topics to be described and their associated chapters.
Table 1.1 Topics, chapters, and descriptions.
Topic
Chapter
Description
Robust Clustering
1
‐means clustering algorithm
Robust ‐medians clustering algorithm
Robust Linear
2
, , Huber loss functions
Regression
3
log‐cosh loss function
4
Outlier detection, metrics (MSE, MAE), robust standardization
Robust Models
5
Penalty functions (ridge, LASSO, aLASSO)
via Regularization
6
Model generalization, model complexity
Quantile Regression
7
Replacing the quantile check function with a flexible log‐cosh function
Robust Binary
8
Logistic regression
Classification
Cross‐entropy vs. log‐cosh loss functions, SVM, kNN, random forest
Neural Networks for Binary Classification and Classification Metrics
9
Activation functions: relu, sigmoid, training, backprop, gradient descent, cross‐entropy vs. log‐cosh loss functions recall, precision, score, AUROC
Multi‐class Classification
10
Categorical loss, softmax activation
and Optimization
Modified National Institute of Standards and Technology (MNIST) dataset, Adam optimization
Anomaly Detection and Evaluation Metrics
11
kNN, Isolation Forest, density‐based spatial clustering of applications with noise (DBSCAN), MADmax, precision, recall, AUROC
Application to
12
Boston housing, Titanic datasets
Data Science and
climate change time series (ARIMA)
Artificial Intelligence (AI)
explainable AI (XAI)
Machine learning involves the development of software that analyzes data and produces predictive models based on algorithms and mathematics developed by computer scientists, engineers, and statisticians. One can view it as the programming portion of the problem to build software tools and flows to process data in furtherance of some specific data science objective. Most of the existing techniques are essentially based on finding the mean of the data. Robust methods to be described in this book are based on finding the median of the data. Thus far, robust methods have not seen widespread use due to their somewhat mysterious nature. These techniques borrow heavily from robust statistics (see Maronna et al. 2019) which is a well‐known and long‐standing field in its own right. However, robust methods have only been of recent interest in machine learning. The objective of this chapter, and indeed the whole book, is to demystify robust methods so that more data scientists can gain access to such methods when the need arises.
Truth be told, most realistic datasets have outliers in one form or another. Some are harmless relative to the rest of the data while others may cause significant errors, especially during inference or prediction. The problem is that it is difficult to know a priori which case you are dealing with. You may not be able to tell if the results are reliable unless additional steps are taken to ensure reliability. If, for example, the outliers could be removed in some effective manner, then the traditional methods are quite effective. However, outlier detection and removal are prone to error so this may not be a reliable approach either. It is often considered to be tampering with the data. Unless you can successfully diagnose and detect outliers in large datasets, the models generated by standard ML methods on these modified datasets should be viewed as suspect. The more prudent approach is to employ machine learning tools that use robust procedures wherever possible to avoid the problems altogether.
Generally speaking, robust ML is the development of tools, procedures, and methodologies that are tolerant to outliers. In its most simplistic form, it implies the use of the median (or some neighboring quantile) rather than the mean thereby producing outlier‐tolerant models. It also includes the detection and removal of outliers using robust ML. Of course, robust methods introduce a set of different issues that must be addressed before they can be used effectively. As we will show throughout this book, the advantages of robust ML greatly outweigh the disadvantages. Our goal is to provide side‐by‐side comparisons of traditional methods with robust methods so that the reader can make their own judgments about the value of robust ML.
Before launching into the key ideas of this book, it is useful to provide some background material on ML and robust methods. ML grew out of the computer science field in the 1980s but went dormant for a time in the 1990s, during which it was viewed as a dead‐end field. It experienced a reawakening around 2000 and a significant resurgence around 2010. It continues to gain traction and momentum to this day with the advent of generative AI. Of course, in the 1980s, computers were relatively slow, datasets were not readily available, and algorithms for ML were in their infancy. Fast forward a few decades and we find the landscape has changed dramatically. Today, access to high‐speed computing is at everyone's fingertips, data is ubiquitous, and many advanced ML algorithms have been developed, implemented, and tested. They are now widely available through open‐access online mechanisms.
Some of the algorithms in ML are grounded in statistical theory, while others are based on nonparametric methods or a set of heuristics. We will study methods that fall into all these categories. There are three main varieties of ML tools: supervised learning, unsupervised learning, and reinforcement learning.1 We focus only on the supervised and unsupervised learning methods in this book. To understand these two methods, consider a dataset given by a matrix , where is the number of observations (rows in the dataset) and is the number of variables (columns in the dataset). The “ground truth” or true responses associated with each observation is optionally provided in a vector . To contrast the two categories, supervised ML requires both and , while unsupervised ML requires only .
For example, linear regression, logistic regression, neural networks, ‐nearest neighbors, tree‐based methods, and so on fall into the supervised category. In this book, we focus on the robustness aspects of these methods. On the other hand, data clustering and anomaly detection are examples of unsupervised learning methods. We will discuss robust clustering in this chapter as a way of introducing robust concepts and their value in machine learning. Anomaly detection will be covered in Chapter 11.
For supervised learning, we mainly consider two types of problems: regression and classification. Figure 1.2 shows the two cases graphically. On the left side, we show a set of points that are fitted with a line, where the ‐axis is the and the ‐axis is the number of in that . Fitting a line to the data is called linear regression. The ML objective in this case is to use the data to find the slope and intercept of the line that “best fits” the given points shown in the plot. The best fit is based on some objective function used during optimization. We are, in effect, training the model to learn about the dataset. Once the estimates of these parameters are obtained, predictions can be made for values that are not in the original dataset. For example, we can now test our model by asking the question: “what is the expected number of for ?” and the answer comes back as “.” We did not have this value in the dataset before but now we are able to make predictions using the fitted line. Building a prediction model is the essence of machine learning.
Figure 1.2 Linear regression vs. linear classification.
The figure on the right shows two sets of points, represented using symbols “” and “x,” respectively. Each symbol represents a distinct class. The ML objective here is to establish a line that best separates the “” points from the “x” points. Finding the dividing line between the two classes is referred to as linear classification. The data is linearly separable in this case and the separating line does a good job of defining a decision boundary between the two sets of points. We say that the model was trained using the given data, which actually means that we optimized an objective function to produce the slope and intercept parameters. We can now ask another question: “if a new point is introduced at and , what class does it belong to?” and the answer would come back as “it belongs to the ‘x’ class.” A large portion of the supervised machine learning tools involve this type of classification.
Now consider the effect of outliers on the two cases above. This is shown in Figure 1.3. On the left side, there is one new data point which is an outlier. The effect on traditional linear regression is to shift the line and make it more horizontal than in Figure 1.2. Now if the question is asked, “what is the expected for ?,” the answer comes back as “” which is not accurate. Apparently, the one outlier has produced misleading results. If used to make a business decision, this outlier may prove to be very costly.
Figure 1.3 Impact of outliers in regression and classification.
In the second case on the right side of Figure 1.3, we place an outlier by adding a “” on the left side of the boundary, where members of the “x” class reside. The effect of this outlier is to shift the line toward it, relative to the corresponding one in Figure 1.2. The outlier should not affect the boundary, but it does. When we again ask, “if a new point is introduced at and , what class does it belong to?,” the answer would come back as “it belongs to the ‘’ class.” Of course, this is incorrect. It belongs to the “x” class as we saw earlier. The outlier has moved the decision boundary thereby changing the predicted response. As before, any business decision made using these results may be costly.
Ideally, we would like to obtain the results shown in Figure 1.2 given the data of Figure 1.3. Robust methods offer this possibility along with the methodologies to achieve this outcome. Robust ML involves methods that are suitable for datasets with or without outliers. They produce models as if the data were outlier‐free. This is the key benefit of using robust methods. You do not need to wonder what the effect of outliers may be in the dataset or where they are located. You simply use this class of methods on all problems without any concerns about outliers. In some applications, you may be interested mainly in detecting and removing the outliers. This issue is also addressed by robust methods, as detailed in this book.
Data science involves a set of application‐specific tasks that focus on creating a coherent dataset from a large number of observations, performing thorough exploratory data analysis (EDA), and carefully interpreting results obtained from machine learning tools and statistical analysis. It is a rather expansive field with many layers of depth and a wide variety of applications. We will briefly describe some of these layers and emphasize the key aspects of this discipline. If we step back and look at statistics and machine learning from a higher perspective, we can understand their roles in data science. Figure 1.4 shows the traditional view. We start with the ground truth in the form of a probability distribution (assumed to be the true distribution of a population). By sampling this distribution, we generate a dataset. Once we have a dataset, then traditional statistics/ML seeks to find a model that best approximates the ground truth (i.e. the original distribution). From this model, we can do whatever we want, such as making predictions. Unfortunately, this is an idealized view of the world of data science.
Figure 1.4 Traditional view of the model building process.
In the real world, there is noise that will add outlier observations to our dataset. Consider a slightly modified version of Figure 1.4 as shown in Figure 1.5