25,99 €
Guides professionals and students through the rapidly growing field of machine learning with hands-on examples in the popular R programming language Machine learning--a branch of Artificial Intelligence (AI) which enables computers to improve their results and learn new approaches without explicit instructions--allows organizations to reveal patterns in their data and incorporate predictive analytics into their decision-making process. Practical Machine Learning in R provides a hands-on approach to solving business problems with intelligent, self-learning computer algorithms. Bestselling author and data analytics experts Fred Nwanganga and Mike Chapple explain what machine learning is, demonstrate its organizational benefits, and provide hands-on examples created in the R programming language. A perfect guide for professional self-taught learners or students in an introductory machine learning course, this reader-friendly book illustrates the numerous real-world business uses of machine learning approaches. Clear and detailed chapters cover data wrangling, R programming with the popular RStudio tool, classification and regression techniques, performance evaluation, and more. * Explores data management techniques, including data collection, exploration and dimensionality reduction * Covers unsupervised learning, where readers identify and summarize patterns using approaches such as apriori, eclat and clustering * Describes the principles behind the Nearest Neighbor, Decision Tree and Naive Bayes classification techniques * Explains how to evaluate and choose the right model, as well as how to improve model performance using ensemble methods such as Random Forest and XGBoost Practical Machine Learning in R is a must-have guide for business analysts, data scientists, and other professionals interested in leveraging the power of AI to solve business problems, as well as students and independent learners seeking to enter the field.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 654
Veröffentlichungsjahr: 2020
Cover
Introduction
WHAT DOES THIS BOOK COVER?
READER SUPPORT FOR THIS BOOK
PART I: Getting Started
Chapter 1: What Is Machine Learning?
DISCOVERING KNOWLEDGE IN DATA
MACHINE LEARNING TECHNIQUES
MODEL SELECTION
MODEL EVALUATION
EXERCISES
Chapter 2: Introduction to R and RStudio
WELCOME TO R
R AND RSTUDIO COMPONENTS
WRITING AND RUNNING AN R SCRIPT
DATA TYPES IN R
EXERCISES
Chapter 3: Managing Data
THE TIDYVERSE
DATA COLLECTION
DATA EXPLORATION
DATA PREPARATION
EXERCISES
PART II: Regression
Chapter 4: Linear Regression
BICYCLE RENTALS AND REGRESSION
RELATIONSHIPS BETWEEN VARIABLES
SIMPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION
CASE STUDY: PREDICTING BLOOD PRESSURE
EXERCISES
Chapter 5: Logistic Regression
PROSPECTING FOR POTENTIAL DONORS
CLASSIFICATION
LOGISTIC REGRESSION
CASE STUDY: INCOME PREDICTION
EXERCISES
PART III: Classification
Chapter 6:
k
-Nearest Neighbors
DETECTING HEART DISEASE
k
-NEAREST NEIGHBORS
CASE STUDY: REVISITING THE DONOR DATASET
EXERCISES
Chapter 7: Naïve Bayes
CLASSIFYING SPAM EMAIL
NAÏVE BAYES
CASE STUDY: REVISITING THE HEART DISEASE DETECTION PROBLEM
EXERCISES
Chapter 8: Decision Trees
PREDICTING BUILD PERMIT DECISIONS
DECISION TREES
CASE STUDY: REVISITING THE INCOME PREDICTION PROBLEM
EXERCISES
PART IV: Evaluating and Improving Performance
Chapter 9: Evaluating Performance
ESTIMATING FUTURE PERFORMANCE
BEYOND PREDICTIVE ACCURACY
VISUALIZING MODEL PERFORMANCE
EXERCISES
Chapter 10: Improving Performance
PARAMETER TUNING
ENSEMBLE METHODS
EXERCISES
PART V: Unsupervised Learning
Chapter 11: Discovering Patterns with Association Rules
MARKET BASKET ANALYSIS
ASSOCIATION RULES
DISCOVERING ASSOCIATION RULES
CASE STUDY: IDENTIFYING GROCERY PURCHASE PATTERNS
EXERCISES
NOTES
Chapter 12: Grouping Data with Clustering
CLUSTERING
k
-MEANS CLUSTERING
SEGMENTING COLLEGES WITH -MEANS CLUSTERING
CASE STUDY: SEGMENTING SHOPPING MALL CUSTOMERS
EXERCISES
NOTE
Index
End User License Agreement
Chapter 4
Table 4.1 Changes in Windspeed and Humidity Produce Significant Variations in...
Chapter 7
Table 7.1 Sparse Matrix from Two Sample Messages
Chapter 1
Figure 1.1 Algorithm for crossing the street
Figure 1.2 The relationship between artificial intelligence, machine learnin...
Figure 1.3 Generic supervised learning model
Figure 1.4 Making predictions with a supervised learning model
Figure 1.5 Using machine learning to classify car dealership customers
Figure 1.6 Dataset of past customer loan repayment behavior
Figure 1.7 Applying the machine learning model
Figure 1.8 Strategically placing items in a grocery store based on unsupervi...
Figure 1.9 Error types
Figure 1.10 Residual error
Figure 1.11 The bias/variance trade-off
Figure 1.12 Underfitting, overfitting, and optimal fit
Figure 1.13 Holdout method
Figure 1.14 Cross-validation method
Chapter 2
Figure 2.1 Growth of the number of CRAN packages over time
Figure 2.2 Comprehensive R Archive Network (CRAN) mirror site
Figure 2.3 RStudio Desktop offers an IDE for Windows, Mac, and Linux systems...
Figure 2.4 RStudio Server provides a web-based IDE for collaborative use.
Figure 2.5 RStudio Desktop without a script open
Figure 2.6 RStudio Desktop with the console pane highlighted
Figure 2.7 Console pane executing several simple R commands
Figure 2.8 Accessing the Mac terminal in RStudio
Figure 2.9 Chick weight script inside the RStudio IDE
Figure 2.10 Graph produced by the chick weight script
Figure 2.11 Chick weight script inside a text editor
Figure 2.12 RStudio environment pane populated with data
Figure 2.13 RStudio History pane showing previously executed commands
Figure 2.14 The Files tab in RStudio allows you to interact with your device...
Figure 2.15 The Packages tab in RStudio allows you to view and manage the pa...
Figure 2.16 The Help tab in RStudio displaying documentation for the
insta
...
Figure 2.17 Hadley Wickham on the distinction between packages and libraries...
Figure 2.18 RStudio displaying the
programming
vignette from the
dplyr
p...
Figure 2.19 The Run button in RStudio runs the current section of code.
Figure 2.20 The Source button in RStudio runs the entire script.
Chapter 3
Figure 3.1 Simple spreadsheet containing data in tabular form
Figure 3.2 CSV file containing the same data as the spreadsheet in Figure 3....
Figure 3.3 TSV file containing the same data as the spreadsheet in Figure 3....
Figure 3.4 Pipe-delimited file containing the same data as the spreadsheet i...
Figure 3.5 Sample dataset illustrating the instances and features (independe...
Figure 3.6 Box plot of CO
2
emissions by vehicle class
Figure 3.7 Scatterplot of CO
2
emissions versus city gas mileage
Figure 3.8 Histogram of CO
2
emissions
Figure 3.9 Stacked bar chart of drive type composition by year
Figure 3.10 Illustration of the smoothing by clustering approach, on 14 inst...
Figure 3.11 Illustration of the smoothing by regression approach on 14 insta...
Chapter 4
Figure 4.1 Scatterplots illustrating the relationship between the dependent ...
Figure 4.2 Estimated regression line and actual values for a sample (n=20) o...
Figure 4.3 For our regression line, the differences between each actual valu...
Figure 4.4 (a) Residual histogram showing normality of residuals, (b) residu...
Figure 4.5 Residual versus fitted value plots illustrating heteroscedasticit...
Figure 4.6 Cook's Distance chart showing the influential points in the bikes...
Figure 4.7 Linear regression fit for each of the predictor variables (humidi...
Figure 4.8 The systolic blood pressure data for this population appears to b...
Figure 4.9 Distributions of dependent variables in the health dataset
Figure 4.10 Histogram of residuals produced using the
ols_plot_resid_hist(
...
Figure 4.11 Scatterplot of residuals produced using the
ols_plot_resid_fit()
Figure 4.12 Cook's distance chart for the health dataset produced using the
Chapter 5
Figure 5.1 Fitted line for probability of
respondedMailing
using a straight...
Figure 5.2 Histogram showing the distribution of values for the
mailOrderPur
...
Figure 5.3 Histogram showing the distribution of values for the
mailOrderPur
...
Figure 5.4 Correlation matrix of the numeric features in the donors dataset...
Chapter 6
Figure 6.1 Scatterplot of age versus cholesterol levels for a sampling of 20...
Figure 6.2 Scatterplot of age versus cholesterol levels for a sampling of 20...
Figure 6.3 The impact of a large value for
k
(a) and a small value for
k
(b)...
Figure 6.4 The predictive accuracy of our model for values of
k
-nearest neig...
Chapter 8
Figure 8.1 Structure of a decision tree
Figure 8.2 Scatterplot of annual income versus loan amount for 30 commercial...
Figure 8.3 Bank customers partitioned on loan amount of less than or more th...
Figure 8.4 Bank customers partitioned on loan amount of less than or more th...
Figure 8.5 Decision tree of bank customers based on the loan amount and annu...
Figure 8.6 Candidate features for splitting the partition of customers who b...
Figure 8.7 Visualization of a decision tree model using the
rpart.plot()
fun...
Figure 8.8 Classification tree to predict customer income level
Chapter 9
Figure 9.1 Model build and evaluation process using all of the observed data...
Figure 9.2 Model build and evaluation process using subsets of the observed ...
Figure 9.3 Model build and evaluation process using the training and validat...
Figure 9.4 The
k
-fold cross-validation approach with
k
=5 (5-fold cross valid...
Figure 9.5 The leave-one-out cross-validation approach (LOOCV). A set of
n
e...
Figure 9.6 The random cross-validation approach. The training and validation...
Figure 9.7 The bootstrap sampling approach. The training set is created by r...
Figure 9.8 A sample confusion matrix showing actual versus predicted values...
Figure 9.9 Spam filter confusion matrix
Figure 9.10 (a) Precision as a measure of model performance based on (b) the...
Figure 9.11 (a) Recall as a measure of model performance based on (b) the sp...
Figure 9.12 (a) Sensitivity as a measure of model performance based on (b) t...
Figure 9.13 (a) Specificity as a measure of model performance based on (b) t...
Figure 9.14 The ROC curve for a sample classifier
Figure 9.15 The ROC curve for a sample classifier, a perfect classifier, and...
Figure 9.16 ROC curve for the spam filter example generated with R
Figure 9.17 ROC curve for two classifiers with similar AUC values
Figure 9.18 ROC curve for three different classifiers
Chapter 10
Figure 10.1 The grid search process showing eight models with different para...
Figure 10.2 Tunable parameters supported by the
caret
package for the
rpart
...
Figure 10.3 The bagging ensemble features independently trained homogenous m...
Figure 10.4 The boosting ensemble features a linear sequence of homogenous m...
Figure 10.5 The stacking ensemble features independently trained heterogeneo...
Chapter 11
Figure 11.1 Sample market basket dataset showing five different transactions...
Figure 11.2 An association rule describing that whenever both beer and milk ...
Figure 11.3 All possible itemsets (itemset lattice) derived from items A, B,...
Chapter 12
Figure 12.1 Simulated dataset showing previously unlabeled items (a). The sa...
Figure 12.2 Hierarchical versus partitional clustering
Figure 12.3 Overlapping versus exclusive clustering
Figure 12.4 Complete versus partial clustering
Figure 12.5 The initial centroids are randomly chosen (a), and every item is...
Figure 12.6 New cluster centers are chosen (a); then each item is re-assigne...
Figure 12.7 During the next iteration, new cluster centers are chosen again ...
Figure 12.8 The change in cluster center (a) did not result in change in clu...
Figure 12.9 Visualization of the three clusters created for Colleges in Mary...
Figure 12.10 The elbow method
Figure 12.11 Determining the appropriate number of clusters using the elbow ...
Figure 12.12 Determining the appropriate number of clusters using the averag...
Figure 12.13 Determining the appropriate number of clusters using the gap st...
Figure 12.14 Visualization of the colleges in Maryland segmented into four c...
Figure 12.15 All three statistical methods for determining the optimal numbe...
Figure 12.16 Shopping mall customers segmented into six clusters based on th...
Cover
Table of Contents
Begin Reading
iii
xxi
xxii
xxiii
xxiv
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
221
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
iv
v
vii
ix
xi
440
FRED NWANGANGA
MIKE CHAPPLE
Machine learning is changing the world. Every organization, large and small, seeks to extract knowledge from the massive amounts of information that they store and process on a daily basis. The tantalizing desire to predict the future drives the work of business analysts and data scientists in fields ranging from marketing to healthcare. Our goal with this book is to make the tools of analytics approachable for a broad audience.
The R programming language is a purpose-specific language designed to facilitate statistical analysis and machine learning. We choose it for this book not only due to its strong popularity in the field but also because of its intuitive nature, particularly for individuals approaching it as their first programming language.
There are many books on the market that cover practical applications of machine learning, designed for businesspeople and onlookers. Likewise, there are many deeply technical resources that dive into the mathematics and computer science of machine learning. In this book, we strive to bridge these two worlds. We attempt to bring the reader an intuitive introduction to machine learning with an eye on the practical applications of machine learning in today's world. At the same time, we don't shy away from code. As we do in our undergraduate and graduate courses, we seek to make the R programming language accessible to everyone. Our hope is that you will read this book with your laptop open next to you, following along with our examples and trying your hand at the exercises.
Best of luck as you begin your machine learning adventure!
This book provides an introduction to machine learning using the R programming language.
Chapter 1
: What Is Machine Learning?
This chapter introduces the world of machine learning and describes how machine learning allows the discovery of knowledge in data. In this chapter, we explain the differences between unsupervised learning, supervised learning, and reinforcement learning. We describe the differences between classification and regression problems and explain how to measure the effectiveness of machine learning algorithms.
Chapter 2
: Introduction to R and RStudio
In this chapter, we introduce the R programming language and the toolset that we will be using throughout the rest of the book. We approach R from the beginner's mind-set, explain the use of the RStudio integrated development environment, and walk readers through the creation and execution of their first R scripts. We also explain the use of packages to redistribute R code and the use of different data types in R.
Chapter 3
: Managing Data
This chapter introduces readers to the concepts of data management and the use of R to collect and manage data. We introduce the tidyverse, a collection of R packages designed to facilitate the analytics process, and we describe different approaches to describing and visualizing data in R. We also cover how to clean, transform, and reduce data to prepare it for machine learning.
Chapter 4
: Linear Regression
In this chapter, we dive into the world of supervised machine learning as we explore linear regression. We explain the underlying statistical principles behind regression and demonstrate how to fit simple and complex regression models in R. We also explain how to evaluate, interpret, and apply the results of regression models.
Chapter 5
: Logistic Regression
While linear regression is suitable for problems that require the prediction of numeric values, it is not well-suited to categorical predictions. In this chapter, we describe logistic regression, a categorical prediction technique. We discuss the use of generalized linear models and describe how to build logistic regression models in R. We also explain how to evaluate, interpret, and improve upon the results of a logistic regression model.
Chapter 6
:
k
-Nearest Neighbors
The
k
-nearest neighbors technique allows us to predict the classification of a data point based on the classifications of other, similar data points. In this chapter, we describe how the
k
-NN process works and demonstrate how to build a
k
-NN model in R. We also show how to apply that model, making predictions about the classifications of new data points.
Chapter 7
: Naïve Bayes
The naïve Bayes approach to classification uses a table of probabilities to predict the likelihood that an instance belongs to a particular class. In this chapter, we discuss the concepts of joint and conditional probability and describe how the Bayes classification approach functions. We demonstrate building a naïve Bayes classifier in R and use it to make predictions about previously unseen data.
Chapter 8
: Decision Trees
Decision trees are a popular modeling technique because they produce intuitive results. In this chapter, we describe the creation and interpretation of decision tree models. We also explain the process of growing a tree in R and using pruning to increase the generalizability of that model.
Chapter 9
: Evaluating Performance
No modeling technique is perfect. Each has its own strengths and weaknesses and brings different predictive power to different types of problems. In this chapter, we discuss the process of evaluating model performance. We introduce resampling techniques and explain how they can be used to estimate the future performance of a model. We also demonstrate how to visualize and evaluate model performance in R.
Chapter 10
: Improving Performance
Once we have tools to evaluate the performance of a model, we can then apply them to help improve model performance. In this chapter, we look at techniques for tuning machine learning models. We also demonstrate how we can enhance our predictive power by simultaneously harnessing the predictive capability of multiple models.
Chapter 11
: Discovering Patterns with Association Rules
Association rules help us discover patterns that exist within a dataset. In this chapter, we introduce the association rules approach and demonstrate how to generate association rules from a dataset in R. We also explain ways to evaluate and quantify the strength of association rules.
Chapter 12
: Grouping Data with Clustering
Clustering is an unsupervised learning technique that groups items based on their similarity to each other. In this chapter, we explain the way that the
k
-means clustering algorithm segments data and demonstrate the use of
k
-means clustering in R.
In order to make the most of this book, we encourage you to make use of the student and instructor materials made available on the companion site. We also encourage you to provide us with meaningful feedback on ways in which we could improve the book.
As you work through the examples in this book, you may choose either to type in all the code manually or to use the source code files that accompany the book. If you choose to follow along with the examples, you will also want to use the same datasets we use throughout the book. All the source code and datasets used in this book are available for download from www.wiley.com/go/pmlr.
If you believe you've found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.
To submit your possible errata, please email it to our customer service team at [email protected] with the subject line “Possible Book Errata Submission.”
Chapter 1: What Is Machine Learning?
Chapter 2: Introduction to R and RStudio
Chapter 3: Managing Data
Welcome to the world of machine learning! You're about to embark upon an exciting adventure discovering how data scientists use algorithms to uncover knowledge hidden within the troves of data that businesses, organizations, and individuals generate every day.
If you're like us, you often find yourself in situations where you are facing a mountain of data that you're certain contains important insights, but you just don't know how to extract that needle of knowledge from the proverbial haystack. That's where machine learning can help. This book is dedicated to providing you with the knowledge and skills you need to harness the power of machine learning algorithms. You'll learn about the different types of problems that are well-suited for machine learning solutions and the different categories of machine learning techniques that are most appropriate for tackling different types of problems.
Most importantly, we're going to approach this complex, technical field with a practical mind-set. In this book, our purpose is not to dwell on the intricate mathematical details of these algorithms. Instead, we'll focus on how you can put those algorithms to work for you immediately. We'll also introduce you to the R programming language, which we believe is particularly well-suited to approaching machine learning problems from a practical standpoint. But don't worry about programming or R for now. We'll get to that in Chapter 2. For now, let's dive in and get a better understanding of how machine learning works.
By the end of this chapter, you will have learned the following:
How machine learning allows the discovery of knowledge in data
How unsupervised learning, supervised learning, and reinforcement learning techniques differ from each other
How classification and regression problems differ from each other
How to measure the effectiveness of machine learning algorithms
How cross-validation improves the accuracy of machine learning models
Our goal in the world of machine learning is to use algorithms to discover knowledge in our datasets that we can then apply to help us make informed decisions about the future. That's true regardless of the specific subject-matter expertise where we're working, as machine learning has applications across a wide variety of fields. For example, here are some cases where machine learning commonly adds value:
Segmenting customers and determining the marketing messages that will appeal to different customer groups
Discovering anomalies in system and application logs that may be indicative of a cybersecurity incident
Forecasting product sales based on market and environmental conditions
Recommending the next movie that a customer might want to watch based on their past activity and the preferences of similar customers
Setting prices for hotel rooms far in advance based on forecasted demand
Of course, those are just a few examples. Machine learning can bring value to almost every field where discovering previously unknown knowledge is useful—and we challenge you to think of a field where knowledge doesn't offer an advantage!
As we proceed throughout this book, you'll see us continually referring to machine learning techniques as algorithms. This is a term from the world of computer science that comes up again and again in the world of data science, so it's important that you understand it. While the term sounds technically complex, the concept of an algorithm is actually straightforward, and we'd venture to guess that you use some form of an algorithm almost every day.
An algorithm is, quite simply, a set of steps that you follow when carrying out a process. Most commonly, we use the term when we're referring to the steps that a computer follows when it is carrying out a computational task, but we can think of many things that we do each day as algorithms. For example, when we are walking the streets of a large city and we reach an intersection, we follow an algorithm for crossing the street. Figure 1.1 shows an example of how this process might work.
Of course, in the world of computer science, our algorithms are more complex and are implemented by writing software, but we can think of them in this same way. An algorithm is simply a series of precise observations, decisions, and instructions that tell the computer how to carry out an action. We design machine learning algorithms to discover knowledge in our data. As we progress through this book, you'll learn about many different types of machine learning algorithms and how they work to achieve this goal in very different ways.
Figure 1.1 Algorithm for crossing the street
We hear the terms artificial intelligence, machine learning, and deep learning being used almost interchangeably to describe any sort of technique where computers are working with data. Now that you're entering the world of data science, it's important to have a more precise understanding of these terms.
Artificial intelligence (AI)
includes any type of technique where we are attempting to get a computer system to imitate human behavior. As the name implies, we are trying to ask computer systems to artificially behave as if they were intelligent. Now, of course, it's not possible for a modern computer to function at the level of complex reasoning found in the human mind, but we can try to mimic some small portions of human behavior and judgment.
Machine learning (ML)
is a subset of artificial intelligence techniques that attempt to apply statistics to data problems in an effort to discover new knowledge by generalizing from examples. Or, in other terms, machine learning techniques are artificial intelligence techniques designed to learn.
Deep learning
is a further subdivision of machine learning that uses a set of complex techniques, known as
neural networks
, to discover knowledge in a particular way. It is a highly specialized subfield of machine learning that is most commonly used for image, video, and sound analysis.
Figure 1.2 shows the relationships between these fields. In this book, we focus on machine learning techniques. Specifically, we focus on the categories of machine learning that do not fit the definition of deep learning.
The machine learning techniques that we discuss in this book fit into two major categories. Supervised learning algorithms learn patterns based on labeled examples of past data. Unsupervised learning algorithms seek to uncover patterns without the assistance of labeled data. Let's take a look at each of these techniques in more detail.
Figure 1.2 The relationship between artificial intelligence, machine learning, and deep learning
Supervised learning techniques are perhaps the most commonly used category of machine learning algorithms. The purpose of these techniques is to use an existing dataset to generate a model that then helps us make predictions about future, unlabeled data. More formally, we provide a supervised machine learning algorithm with a training dataset as input. The algorithm then uses that training data to develop a model as its output, as shown in Figure 1.3.
You can think of the model produced by a supervised machine learning algorithm as sort of a crystal ball—once we have it, we can use it to make predictions about our data. Figure 1.4 shows how this model functions. Once we have it, we can take any new data element that we encounter and use the model to make a prediction about that new element based on the knowledge it obtained from the training dataset.
The reason that we use the term supervised to describe these techniques is that we are using a training dataset to supervise the creation of our model. That training dataset contains labels that help us with our prediction task.
Let's reinforce that with a more concrete example. Consider a loan officer working at the car dealership shown in Figure 1.5. The salespeople at the dealership work with individual customers to sell them cars. The customers often don't have the necessary cash on hand to purchase a car outright, so they seek financing options. Our job is to match customers with the right loan product from three choices.
Subprime loans have the most expensive interest rates and are offered to customers who are likely to miss payment deadlines or default on their loans.
Top-shelf loans have the lowest interest rate and are offered to customers who are unlikely to miss payments and have an extremely high likelihood of repayment.
Standard loans are offered to customers who fall in the middle of these two groups and have an interest rate that falls in between those two values.
Figure 1.3 Generic supervised learning model
Figure 1.4 Making predictions with a supervised learning model
We receive loan applications from salespeople and must make a decision on the spot. If we don't act quickly, the customer may leave the store, and the business will be lost to another dealership. If we offer a customer a higher risk loan than they would normally qualify for, we might lose their business to another dealership offering a lower interest rate. On the other hand, if we offer a customer a lower interest rate than they deserve, we might not profit on the transaction after they later default.
Our current method of doing business is to review the customer's credit report and make decisions about loan categories based on our years of experience in the role. We've “seen it all” and can rely upon our “gut instinct” to make these important business decisions. However, as budding data scientists, we now realize that there might be a better way to solve this problem using machine learning.
Our car dealership can use supervised machine learning to assist with this task. First, they need a training dataset containing information about their past customers and their loan repayment behavior. The more data they can include in the training dataset, the better. If they have several years of data, that would help develop a high-quality model.
The dataset might contain a variety of information about each customer, such as the customer's approximate age, credit score, home ownership status, and vehicle type. Each of these data points is known as a feature about the customer, and they will become the inputs to the machine learning model created by the algorithm. The dataset also needs to contain labels for each one of the customers in the training dataset. These labels are the values that we'd like to predict using our model. In this case, we have two labels: default and repaid. We label each customer in our training dataset with the appropriate label for their loan status. If they repaid their loan in full, they are given the “repaid” label, while those who failed to repay their loans are given the “default” label.
Figure 1.5 Using machine learning to classify car dealership customers
A small segment of the resulting dataset appears in Figure 1.6. Notice two things about this dataset. First, each row in the dataset corresponds to a single customer, and those customers are all past customers who have completed their loan terms. We know the outcomes of the loans made to each of these customers, providing us with the labels we need to train a supervised learning model. Second, each of the features included in the model are characteristics that are available to the loan officer at the time they are making a loan decision. That's crucial to creating a model that is effective for our given problem. If the model included a feature that specified whether a customer lost his or her job during the loan term, that would likely provide us with accurate results, but the loan officer would not be able to actually use that model because they would have no way of determining this feature for a customer at the time of a loan decision. How would they know if the customer is going to lose their job over the term of the loan that hasn't started yet?
Figure 1.6 Dataset of past customer loan repayment behavior
If we use a machine learning algorithm to generate a model based on this data, it might pick up on a few characteristics of the dataset that may also be apparent to you upon casual inspection. First, most people with a credit score under 600 who have financed a car through us in the past defaulted on that loan. If we use that characteristic alone to make decisions, we'd likely be in good shape. However, if we look at the data carefully, we might realize that we could realize an even better fit by saying that anyone who has a credit score under 600 and purchased a sedan is likely to default. That type of knowledge, when generated by an algorithm, is a machine learning model!
The loan officer could then deploy this machine learning model by simply following these rules to make a prediction each time someone applies for a loan. If the next customer through the door has a credit score of 780 and is purchasing a sports car, as shown in Figure 1.7, they should be given a top-shelf loan because it is quite unlikely that they will default. If the customer has a credit score of 410 and is purchasing a sedan, we'd definitely want to slot them into a subprime loan. Customers who fall somewhere in between these extremes would be suited for a standard loan.
Now, this was a simplistic example. All of the customers in our example fit neatly into the categories we described. This won't happen in the real world, of course. Our machine learning algorithms will have imperfect data that doesn't have neat, clean divisions between groups. We'll have datasets with many more observations, and our algorithms will inevitably make mistakes. Perhaps the next high credit-scoring young person to walk into the dealership purchasing a sports car later loses their job and defaults on the loan. Our algorithm would make an incorrect prediction. We talk more about the types of errors made by algorithms later in this chapter.
Figure 1.7 Applying the machine learning model
Unsupervised learning techniques work quite differently. While supervised techniques train on labeled data, unsupervised techniques develop models based on unlabeled training datasets. This changes the nature of the datasets that they are able to tackle and the models that they produce. Instead of providing a method for assigning labels to input based on historical data, unsupervised techniques allow us to discover hidden patterns in our data.
One way to think of the difference between supervised and unsupervised algorithms is that supervised algorithms help us assign known labels to new observations while unsupervised algorithms help us discover new labels, or groupings, of the observations in our dataset.
For example, let's return to our car dealership and imagine that we're now working with our dataset of customers and want to develop a marketing campaign for our service department. We suspect that the customers in our database are similar to each other in ways that aren't as obvious as the types of cars that they buy and we'd like to discover what some of those groupings might be and use them to develop different marketing messages.
Unsupervised learning algorithms are well-suited to this type of open-ended discovery task. The car dealership problem that we described is more generally known as the market segmentation problem, and there is a wealth of unsupervised learning techniques designed to help with this type of analysis. We talk about how organizations use unsupervised clustering algorithms to perform market segmentation in Chapter 12.
Let's think of another example. Imagine that we manage a grocery store and are trying to figure out the optimal placement of products on the shelves. We know that customers often run into our store seeking to pick up some common staples, such as milk, bread, meat, and produce. Our goal is to design the store so that impulse purchases are near each other in the store. As seen in Figure 1.8, we want to place the cookies right next to the milk so someone who came into the store to purchase milk will see them and think “Those cookies would be delicious with a glass of this milk!”
Figure 1.8 Strategically placing items in a grocery store based on unsupervised learning
The problem of determining which items customers frequently purchase together is also a well-known problem in machine learning known as the market basket problem. We talk about how data scientists use association rules approaches to tackle the market basket problem in Chapter 11.
NOTE You may also hear about a third type of machine learning algorithm known as reinforcement learning. These algorithms seek to learn based on trial and error, similar to the way that a young child learns the rules of a home by being rewarded and punished. Reinforcement learning is an interesting technique but is beyond the scope of this book.
In the previous section, we described ways to group algorithms based on the types of data that they use for training. Algorithms that use labeled training datasets are known as supervised algorithms because their training is “supervised” by the labels while those that use unlabeled training datasets are known as unsupervised algorithms because they are free to learn whatever patterns they happen to discover, without “supervision.” Think of this categorization scheme as describing how machine learning algorithms learn.
We can also categorize our algorithms based on what they learn. In this book, we discuss three major types of knowledge that we can learn from our data. Classification techniques train models that allow us to predict membership in a category. Regression techniques allow us to predict a numeric result. Similarity learning techniques help us discover the ways that observations in our dataset resemble and differ from each other.
Classification techniques use supervised machine learning to help us predict a categorical response. That means that the output of our model is a non-numeric label or, more formally, a categorical variable. This simply means that the variable takes on discrete, non-numeric values, rather than numeric values. Here are some examples of categorical variables with some possible values they might take on:
Educational degree obtained (none, bachelor's, master's, doctorate)
Citizenship (United States, Ireland, Nigeria, China, Australia, South Korea)
Blood type (A+, A-, B+, B-, AB+, AB-, O+, O-)
Political party membership (Democrat, Republican, Independent)
Customer status (current customer, past customer, noncustomer)
For example, earlier in this chapter, we discussed a problem where managers at a car dealership needed the ability to predict loan repayment. This is an example of a classification problem because we are trying to assign each customer to one of two categories: repaid or default.
We encounter all types of classification problems in the real world. We might try to determine which of three promotional offers would be most appealing to a potential customer. This is a classification problem where the categories are the three different offers.
Similarly, we might want to look at people attempting to log on to our computer systems and predict whether they are a legitimate user or a hacker seeking to violate the system's security policies. This is also a classification problem where we are trying to assign each login attempt to the category of “legitimate user” or “hacker.”
Regression techniques use supervised machine learning techniques to help us predict a continuous response. Simply put, this means that the output of our model is a numeric value. Instead of predicting membership in a discrete set of categories, we are predicting the value of a numeric variable.
For example, a financial advisor seeking new clients might want to screen possible clients based on their income. If the advisor has a list of potential customers that does not include income explicitly, they might use a dataset of past contacts with known incomes to train a regression model that predicts the income of future contacts. This model might look something like this:
If the financial advisor encounters a new potential client, they can then use this formula to predict the person's income based on their age and years of education. For each year of age, they would expect the person to have $1,000 in additional annual income. Similarly, their income would increase $3,000 for each year of education beyond high school.
Regression models are quite flexible. We can plug in any possible value of age or income and come up with a prediction for that person's income. Of course, if we didn't have good training data, our prediction might not be accurate. We also might find that the relationship between our variables isn't explained by a simple linear technique. For example, income likely increases with age, but only up until a certain point. More advanced regression techniques allow us to build more complex models that can take these factors into account. We discuss those in Chapter 4.
Similarity learning techniques use machine learning algorithms to help us identify common patterns in our data. We might not know exactly what we're trying to discover, so we allow the algorithm to explore the dataset looking for similarities that we might not have already predicted.
We've already mentioned two similarity learning techniques in this chapter. Association rules techniques, discussed more fully in Chapter 11, allow us to solve problems that are similar to the market basket problem—which items are commonly purchased together. Clustering techniques, discussed more fully in Chapter 12, allow us to group observations into clusters based on the similar characteristics they possess.
Association rules and clustering are both examples of unsupervised uses of similarity learning techniques. It's also possible to use similarity learning in a supervised manner. For example, nearest neighbor algorithms seek to assign labels to observations based on the labels of the most similar observations in the training dataset. We discuss those more in Chapter 6.
Before beginning our discussion of specific machine learning algorithms, it's also helpful to have an idea in mind of how we will evaluate the effectiveness of our algorithms. We're going to cover this topic in much more detail throughout the book, so this is just to give you a feel for the concept. As we work through each machine learning technique, we'll discuss evaluating its performance against a dataset. We'll also have a more complete discussion of model performance evaluation in Chapter 9.
Until then, the important thing to realize is that some algorithms will work better than others on different problems. The nature of the dataset and the nature of the algorithm will dictate the appropriate technique.
In the world of supervised learning, we can evaluate the effectiveness of an algorithm based on the number and/or magnitude of errors that it makes. For classification problems, we often look at the percentage of times that the algorithm makes an incorrect categorical prediction, or the misclassification rate. Similarly, we can look at the percentage of predictions that were correct, known as the algorithm's accuracy. For regression problems, we often look at the difference between the values predicted by the algorithm and the actual values.
NOTE It only makes sense to talk about this type of evaluation when we're referring to supervised learning techniques where there actually is a correct answer. In unsupervised learning, we are detecting patterns without any objective guide, so there is no set “right” or “wrong” answer to measure our performance against. Instead, the effectiveness of an unsupervised learning algorithm lies in the value of the insight that it provides us.
Many classification problems seek to predict a binary value identifying whether an observation is a member of a class. We refer to cases where the observation is a member of the class as positive cases and cases where the observation is not a member of the class as negative cases.
For example, imagine we are developing a model designed to predict whether someone has a lactose intolerance, making it difficult for them to digest dairy products. Our model might include demographic, genetic, and environmental factors that are known or suspected to contribute to lactose intolerance. The model then makes predictions about whether individuals are lactose intolerant or not based on those attributes. Individuals predicted to be lactose intolerant are predicted positives, while those who are predicted to not be lactose intolerant (or, stated more simply, those who are predicted to be lactose tolerant) are predicted negatives. These predicted values come from our machine learning model.
There is also, however, a real-world truth. Regardless of what the model predicts, every individual person is either lactose intolerant or they are not. This real-world data determines whether the person is an actual positive or an actual negative. When the predicted value for an observation differs from the actual value for that same observation, an error occurs. There are two different types of error that may occur in a classification problem.
False positive errors
occur when the model labels an observation as predicted positive when it is, in reality, an actual negative. For example, if the model identifies someone as likely lactose intolerant while they are, in reality, lactose tolerant, this is a false positive error. False positive errors are also known as
Type I errors
.
False negative errors
occur when the model labels an observation as predicted negative when it is, in reality, an actual positive. In our lactose intolerance model, if the model predicts someone as lactose tolerant when they are, in reality, lactose intolerant, this is a false negative error. False negative errors are also known as
Type II errors
.
Similarly, we may label correctly predicted observations as true positives or true negatives, depending on their label. Figure 1.9 shows the types of errors in chart form.
Figure 1.9 Error types
Of course the absolute numbers for false positive and false negative errors depend on the number of predictions that we make. Instead of using these magnitude-based measures, we measure the percentage of times that those errors occur. For example, the false positive rate (FPR) is the percentage of negative instances that were incorrectly identified as positive. We can compute this rate by dividing the number of false positives (FP) by the sum of the number of false positives and the number of true negatives (TN), or, as a formula:
Similarly, we can compute the false negative rate (FNR) as follows:
There is no clear-cut rule about whether one type of error is better or worse than the other. This determination depends greatly on the type of problem being solved.
For example, imagine that we're using a machine learning algorithm to classify a large list of prospective customers as either people who will purchase our product (positive cases) or people who will not purchase our product (negative cases). We only spend the money to send the mailing to prospects labeled by the algorithm as positive.
In the case of a false positive mailing, you send a brochure to a customer who does not buy your product. You've lost the money spent on printing and mailing the brochure. In the case of a false negative result, you do not send a mailing to a customer who would have responded. You've lost the opportunity to sell your product to a customer. Which of these is worse? It depends on the cost of the mailing, the potential profit per customer, and other factors.
On the other hand, consider the use of a machine learning model to screen patients for the likelihood of cancer and then refer those patients with positive results for additional, more invasive testing. In the case of a false negative result, a patient who potentially has cancer is not sent for additional screening, possibly leaving an active disease untreated. This is clearly a very bad result.
False positive results are not without harm, however. If a patient is falsely flagged as potentially cancerous, they are subjected to unnecessary testing that is potentially costly and painful, consuming resources that could have been used on another patient. They are also subject to emotional harm while they are waiting for the new test results.
The evaluation of machine learning problems is a tricky proposition, and it cannot be done in isolation from the problem domain. Data scientists, subject-matter experts, and, in some cases, ethicists, should work together to evaluate models in light of the benefits and costs of each error type.
The errors that we might make in regression problems are quite different because the nature of our predictions is different. When we assign classification labels to instances, we can be either right or wrong with our prediction. When we label a noncancerous tumor as cancerous, that is clearly a mistake. However, in regression problems, we are predicting a numeric value.
Consider the income prediction problem that we discussed earlier in this chapter. If we have an individual with an actual income of $45,000 annually and our algorithm's prediction is on the nose at exactly $45,000, that's clearly a correct prediction. If the algorithm predicts an income of $0 or $10,000,000, almost everyone would consider those predictions objectively wrong. But what about predictions of $45,001, $45,500, $46,000, or $50,000? Are those all incorrect? Are some or all of them close enough?
It makes more sense for us to evaluate regression algorithms based on the magnitude of the error in their predictions. We determine this by measuring the distance between the predicted value and the actual value. For example, consider the dataset shown in Figure 1.10.
Figure 1.10 Residual error
In this dataset, we're trying to predict the number of bicycle rentals that occur each day based on the average temperature that day. Bicycle rentals appear on the y-axis while temperature appears on the x-axis. The black line is a regression line that says that we expect bicycle rentals to increase as temperature increases. That black line is our model, and the black dots are predictions at specific temperature values along that line.
The orange dots represent real data gathered during the bicycle rental company's operations. That's the “correct” data. The red lines between the predicted and actual values are the magnitude of the error, which we call the residual value. The longer the line, the worse the algorithm performed on that dataset.
We can't simply add the residuals together because some of them are negative values that would cancel out the positive values. Instead, we square each residual value and then add those squared residuals together to get a performance measure called the residual sum of squares.
We revisit the concept of residual error, as well as this specific bicycle rental dataset, in Chapter 4.
When we build a machine learning model for anything other than the most simplistic problems, the model will include some type of prediction error. This error comes in three different forms.
Bias
(in the world of machine learning) is the type of error that occurs due to our choice of a machine learning model. When the model type that we choose is unable to fit our dataset well, the resulting error is bias.
Variance
is the type of error that occurs when the dataset that we use to train our machine learning model is not representative of the entire universe of possible data.
Irreducible error
, or noise, occurs independently of the machine learning algorithm and training dataset that we use. It is error inherent in the problem that we are trying to solve.
When we are attempting to solve a specific machine learning problem, we cannot do much to address irreducible error, so we focus our efforts on the two remaining sources of error: bias and variance. Generally speaking, an algorithm that exhibits high variance will have low bias, while a low-variance algorithm will have higher bias, as shown in Figure 1.11. Bias and variance are intrinsic characteristics of our models and coexist. When we modify our models to improve one, it comes at the expense of the other. Our goal is to find an optimal balance between the two.
In cases where we have high bias and low variance, we describe the model as underfitting the data. Let's take a look at a few examples that might help illustrate this point. Figure 1.12 shows a few attempts to use a function of two variables to predict a third variable. The leftmost graph in Figure 1.12 shows a linear model that underfits the data. Our data points are distributed in a curved manner, but our choice of a straight line (a linear model) limits the ability of the model to fit our dataset. There is no way that you can draw a straight line that will fit this dataset well. Because of this, the majority of the error in our approach is due to our choice of model and our dataset exhibits high bias.
The middle graph in Figure 1.12 illustrates the problem of overfitting, which occurs when we have a model with low bias but high variance. In this case, our model fits the training dataset too well. It's the equivalent of studying for a specific test (the training dataset) rather than learning a generalized solution to the problem. It's highly likely that when this model is used on a different dataset, it will not work well. Instead of learning the underlying knowledge, we studied the answers to a past exam. When we faced a new exam, we didn't have the knowledge necessary to figure out the answers.
The balance that we seek is a model that optimizes both bias and variance, such as the one shown in the rightmost graph of Figure 1.12. This model matches the curved nature of the distribution but does not closely follow the specific data points in the training dataset. It aligns with the dataset much better than the underfit model but does not closely follow specific points in the training dataset as the overfit model does.
Figure 1.11 The bias/variance trade-off
Figure 1.12 Underfitting, overfitting, and optimal fit
When we evaluate a machine learning model, we can protect against variance errors by using validation techniques that expose the model to data other than the data used to create the model. The point of this approach is to address the overfitting problem. Look back at the overfit model in Figure 1.12. If we used the training dataset to evaluate this model, we would find that it performed extremely well because the model is highly tuned to perform well on that specific dataset. However, if we used a new dataset to evaluate the model, we'd likely find that it performs quite poorly.
We can explore this issue by using a test dataset to assess the performance of our model. The test dataset is set aside at the beginning of the model development process specifically for the purpose of model assessment. It is not used in the training process, so it is not possible for the model to overfit the test dataset. If we develop a generalizable model that does not overfit the training dataset, it will also perform well on the test dataset. On the other hand, if our model overfits the training dataset, it will not perform well on the test dataset.
We also sometimes need a separate dataset to assist with the model development process. These datasets, known as validation datasets, are used to help develop the model in an iterative process, adjusting the parameters of the model during each iteration until we find an approach that performs well on the validation dataset. While it may be tempting to use the test dataset as the validation dataset, this approach reintroduces the potential of overfitting the test dataset, so we should use a third dataset for this purpose.
The most straightforward approach to test and validation datasets is the holdout method. In this approach, illustrated in Figure 1.13, we set aside portions of the original dataset for validation and testing purposes at the beginning of the model development process. We use the validation dataset to assist in model development and then use the test dataset to evaluate the performance of the final model.
There are also a variety of more advanced methods for creating validation datasets that perform repeated sampling of the data during an iterative approach to model development. These approaches, known as cross-validation techniques, are particularly useful for smaller datasets where it is undesirable to reserve a portion of the dataset for validation purposes.
Figure 1.14 shows an example of cross-validation. In this approach, we still set aside a portion of the dataset for testing purposes, but we use a different portion of the training dataset for validation purposes during each iteration of model development.
If this sounds complicated now, don't worry about it. We discuss the holdout method and cross-validation in greater detail when we get to Chapter 9. For now, you should just have a passing familiarity with these techniques.
Figure 1.13 Holdout method
Figure 1.14 Cross-validation method
Consider each of the following machine learning problems. Would the problem be best approached as a classification problem or a regression problem? Provide a rationale for your answer.
Predicting the number of fish caught on a commercial fishing voyage
Identifying likely adopters of a new technology
Using weather and population data to predict bicycle rental rates
Predicting the best marketing campaign to send a specific person
You developed a machine learning algorithm that assesses a patient's risk of heart attack (a positive event) based on a number of diagnostic criteria. How would you describe each of the following events?