88,99 €
AN INTRODUCTION TO MACHINE LEARNING THAT INCLUDES THE FUNDAMENTAL TECHNIQUES, METHODS, AND APPLICATIONS
PROSE Award Finalist 2019
Association of American Publishers Award for Professional and Scholarly Excellence
Machine Learning: a Concise Introduction offers a comprehensive introduction to the core concepts, approaches, and applications of machine learning. The author—an expert in the field—presents fundamental ideas, terminology, and techniques for solving applied problems in classification, regression, clustering, density estimation, and dimension reduction. The design principles behind the techniques are emphasized, including the bias-variance trade-off and its influence on the design of ensemble methods. Understanding these principles leads to more flexible and successful applications. Machine Learning: a Concise Introduction also includes methods for optimization, risk estimation, and model selection— essential elements of most applied projects. This important resource:
A volume in the popular Wiley Series in Probability and Statistics, Machine Learning: a Concise Introduction offers the practical information needed for an understanding of the methods and application of machine learning.
STEVEN W. KNOX holds a Ph.D. in Mathematics from the University of Illinois and an M.S. in Statistics from Carnegie Mellon University. He has over twenty years’ experience in using Machine Learning, Statistics, and Mathematics to solve real-world problems. He currently serves as Technical Director of Mathematics Research and Senior Advocate for Data Science at the National Security Agency.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 518
Veröffentlichungsjahr: 2018
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by Walter A. Shewhart and Samuel S. Wilks
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay
Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels
The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods. Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.
A complete list of titles in this series can be found at http://www.wiley.com/go/wsps
Steven W. Knox
This edition first published 2018
This work is a U.S. Government work and is in the public domain in the U.S.A.
Published 2018 by John Wiley & Sons, Inc
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Steven W. Knox to be identified as the author of this work has been asserted in accordance with law.
Registered Office(s)
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Knox, Steven W., author.
Title: Machine learning : a concise introduction / by Steven W. Knox.
Description: Hoboken, New Jersey : John Wiley & Sons, 2018. | Series: Wiley series in probability and statistics |
Identifiers: LCCN 2017058505 (print) | LCCN 2018004509 (ebook) | ISBN 9781119439073 (pdf) | ISBN 9781119438984 (epub) | ISBN 9781119439196 (cloth)
Subjects: LCSH: Machine learning.
Classification: LCC Q325.5 (ebook) | LCC Q325.5 .K568 2018 (print) | DDC 006.3/1--dc23
LC record available at https://lccn.loc.gov/2017058505
Cover image: © Verticalarray/Shutterstock
Cover design by Wiley
Preface
Organization—How to Use This Book
Acknowledgments
About the Companion Website
Chapter 1: Introduction—Examples from Real Life
Chapter 2: The Problem of Learning
2.1 Domain
2.2 Range
2.3 Data
2.4 Loss
2.5 Risk
2.6 The Reality of the Unknown Function
2.7 Training and Selection of Models, and Purposes of Learning
2.8 Notation
Chapter 3: Regression
3.1 General Framework
3.2 Loss
3.3 Estimating the Model Parameters
3.4 Properties of Fitted Values
3.5 Estimating the Variance
3.6 A Normality Assumption
3.7 Computation
3.8 Categorical Features
3.9 Feature Transformations, Expansions, and Interactions
3.10 Variations in Linear Regression
3.11 Nonparametric Regression
Chapter 4: Survey of Classification Techniques
4.1 The Bayes Classifier
4.2 Introduction to Classifiers
4.3 A Running Example
4.4 Likelihood Methods
4.5 Prototype Methods
4.6 Logistic Regression
4.7 Neural Networks
4.8 Classification Trees
4.9 Support Vector Machines
4.10 Postscript: Example Problem Revisited
Chapter 5: Bias–Variance Trade-off
5.1 Squared-Error Loss
5.2 Arbitrary Loss
Chapter 6: Combining Classifiers
6.1 Ensembles
6.2 Ensemble Design
6.3 Bootstrap Aggregation (Bagging)
6.4 Bumping
6.5 Random Forests
6.6 Boosting
6.7 Arcing
6.8 Stacking and Mixture of Experts
Chapter 7: Risk Estimation and Model Selection
7.1 Risk Estimation via Training Data
7.2 Risk Estimation via Validation or Test Data
7.3 Cross-Validation
7.4 Improvements on Cross-Validation
7.5 Out-of-Bag Risk Estimation
7.6 Akaike’s Information Criterion
7.7 Schwartz’s Bayesian Information Criterion
7.8 Rissanen’s Minimum Description Length Criterion
7.9
R
2
and Adjusted
R
2
7.10 Stepwise Model Selection
7.11 Occam’s Razor
Chapter 8: Consistency
8.1 Convergence of Sequences of Random Variables
8.2 Consistency for Parameter Estimation
8.3 Consistency for Prediction
8.4 There Are Consistent and Universally Consistent Classifiers
8.5 Convergence to Asymptopia Is Not Uniform and May Be Slow
Chapter 9: Clustering
9.1 Gaussian Mixture Models
9.2
k
-Means
9.3 Clustering by Mode-Hunting in a Density Estimate
9.4 Using Classifiers to Cluster
9.5 Dissimilarity
9.6
k
-Medoids
9.7 Agglomerative Hierarchical Clustering
9.8 Divisive Hierarchical Clustering
9.9 How Many Clusters Are There? Interpretation of Clustering
9.10 An Impossibility Theorem
Chapter 10: Optimization
10.1 Quasi-Newton Methods
10.2 The Nelder–Mead Algorithm
10.3 Simulated Annealing
10.4 Genetic Algorithms
10.5 Particle Swarm Optimization
10.6 General Remarks on Optimization
10.7 The Expectation-Maximization Algorithm
Chapter 11: High-Dimensional Data
11.1 The Curse of Dimensionality
11.2 Two Running Examples
11.3 Reducing Dimension While Preserving Information
11.4 Model Regularization
Chapter 12: Communication with Clients
12.1 Binary Classification and Hypothesis Testing
12.2 Terminology for Binary Decisions
12.3 ROC Curves
12.4 One-Dimensional Measures of Performance
12.5 Confusion Matrices
12.6 Multiple Testing
12.7 Expert Systems
Chapter 13: Current Challenges in Machine Learning
13.1 Streaming Data
13.2 Distributed Data
13.3 Semi-supervised Learning
13.4 Active Learning
13.5 Feature Construction via Deep Neural Networks
13.6 Transfer Learning
13.7 Interpretability of Complex Models
Chapter 14: R Source Code
14.1 Author’s Biases
14.2 Libraries
14.3 The Running Example (Section 4.3)
14.4 The Bayes Classifier (Section 4.1)
14.5 Quadratic Discriminant Analysis (Section 4.4.1)
14.6 Linear Discriminant Analysis (Section 4.4.2)
14.7 Gaussian Mixture Models (Section 4.4.3)
14.8 Kernel Density Estimation (Section 4.4.4)
14.9 Histograms (Section 4.4.5)
14.10 The Naive Bayes Classifier (Section 4.4.6)
14.11
k
-Nearest-Neighbor (Section 4.5.1)
14.12 Learning Vector Quantization (Section 4.5.4)
14.13 Logistic Regression (Section 4.6)
14.14 Neural Networks (Section 4.7)
14.15 Classification Trees (Section 4.8)
14.16 Support Vector Machines (Section 4.9)
14.17 Bootstrap Aggregation (Section 6.3)
14.18 Boosting (Section 6.6)
14.19 Arcing (Section 6.7)
14.20 Random Forests (Section 6.5)
Appendix A: List of Symbols
Appendix B: Solutions to Selected Exercises
Appendix C: Converting Between Normal Parameters and Level-Curve Ellipsoids
Appendix D: Training Data and Fitted Parameters
References
Index
EULA
Chapter 4
Table 4.1
Table 4.2
Chapter 5
Table 5.1
Chapter 12
Table 12.1
Table 12.2
Cover
Table of Contents
Preface
C1
ii
iii
iv
xi
xii
xiii
xiv
xv
xvi
xvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
217
218
219
220
221
222
223
224
225
226
227
228
229
231
232
233
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
305
306
307
308
309
310
311
312
313
315
316
317
318
319
320
The goal of statistical data analysis is to extract the maximum information from the data, and to present a product that is as accurate and as useful as possible.
—David Scott, Scott, David Multivariate Density Estimation: Theory, Practice and Visualization, 1992
My purpose in writing this book is to introduce the mathematically sophisticated reader to a large number of topics and techniques in the field variously known as machine learning, statistical learning, or predictive modeling. I believe that a deeper understanding of the subject as a whole will be obtained from reflection on an intuitive understanding of many techniques rather than a very detailed understanding of only one or two, and the book is structured accordingly. I have omitted many details while focusing on what I think shows “what is really going on.” For details, the reader will be directed to the relevant literature, or to the exercises, which form an integral part of the text.
No work this small on a subject this large can be self-contained. Some undergraduate-level calculus, linear algebra, and probability is assumed without reference, as are a few basic ideas from statistics. All of the techniques discussed here can, I hope, be implemented using this book and a mid-level programming language (such as C),1 and explicit implementation of many techniques using R is presented in the last chapter.
The reader may detect a coverage bias in favor of classification over regression. This is deliberate. The existing literature on the theory and practice of linear regression and many of its variants is so strong that it does not need any contribution from me. Classification, I believe, is not yet so well documented. In keeping with what has been important in my experience, loss functions are completely general and predictive modeling is stressed more than explanatory modeling.
The intended audience for these notes has an extremely diverse background in probability, ranging from one introductory undergraduate course to extensive graduate work and published research.2 In seeking a probability notation which will create the least confusion for all concerned, I arrived at the non-standard use of P(x) for both the probability of an event x and a probability mass or density function, with respect to some measure which is never stated, evaluated at a point x. My hope, which I believe has been borne out in practice, is that anyone with sufficient knowledge to find this notation confusing will have sufficient knowledge to work through that confusion.
1
There is one exception: the convex programming needed to implement a support vector machine is omitted.
2
This book is designed for two kinds of people: those who know what P(
Y | X
) means but do not know the Radon–Nikodym theorem, and those who know what P(
Y | X
) means only because they know the Radon–Nikodym theorem. I have tried to communicate with the first group, at the cost of possibly irritating the second.
The core material of this book is laid out in Chapters 1 through 7. These chapters are intended to be read in order, as each chapter builds on the previous ones. Chapters 1 and 2 introduce problems which machine learning methods can solve, and also introduce the fundamental ideas and terminology used to describe these problems and their solutions. Chapter 3 gives a brief introduction to regression.1
Chapter 4 presents many methods for classification, grouped according to how they approach the problem. Chapter 5 discusses bias–variance trade-off, a topic essential to understanding the design principles behind ensemble methods. Chapter 6 presents various ensemble methods, focusing on how each method can be understood as trading off bias for variance, or vice versa. Chapter 7 concludes the core material with methods for risk estimation and model selection. By the end of Chapter 7, the reader will have encountered many useful methods for approaching classification and regression problems, and (I hope) will appreciate how each method works and why each method approaches classification or regression the way it does.
After Chapter 7, the reader can select from among the remaining seven chapters in any order, best suiting the reader’s goals or needs. For example, some readers might wish to follow
a path toward machine learning practice or consultancy:
Chapter 9
, Clustering, and then
Chapter 11
, High-Dimensional Data
Chapter 10
, Optimization
Chapter 12
, Communication with Clients
Chapter 14
, R Source Code (optional—recommended if the reader plans to use R)
Other readers might wish to follow
a path toward machine learning research:
Chapter 8
, Consistency
Chapter 11
, High-Dimensional Data (optionally preceded by
Chapter 9
, Clustering)
Chapter 10
, Optimization (optional)
Chapter 13
, Current Challenges in Machine Learning
Other readers might wish to develop insight empirically, through hands-on experience working with the methods covered in the core material of Chapters 1 through 7, before progressing on to later chapters. Such readers can proceed directly to Chapter 14, R Source Code, which illustrates how to apply and interpret many of the classification methods from Chapters 4 and 6. Finally, some readers might wish to learn specifically about deep neural networks. Feed-forward deep neural networks2 are a combination of several ideas, and these ideas are presented separately because each is individually useful: feed-forward neural networks in Section 4.7, stochastic gradient descent optimization in Sections 10.1 and 10.6, autoencoders in Section 11.3, and parameter regularization in Section 11.4.
Three of the later chapters—8 (Consistency), 10 (Optimization), and 12 (Communication with Clients)—may seem unusual in an introductory book on machine learning. Chapter 8 is about consistency, which addresses whether or not a given supervised learning method converges to an optimal solution as the number of data, n, increases without bound, either for a given problem or for any problem. This statistical topic is present because some readers of Chapters 4, 5, and 6 may, upon being presented with approximately 20 methods to solve classification problems, conclude that the richness and diversity of methods is a sign that no one actually knows yet how to deal with classification. Such readers might feel that there ought to be one method which solves all classification or regression problems optimally, at least as n → ℞. Such readers may be comforted, or saved reinventing the wheel, by learning of the large body of theory which addresses this issue.
Chapter 10 presents useful techniques in optimization, a topic which is required in order to apply most methods in machine learning3. Indeed, a common theme running through Chapters 3, 4, 6, 7, 9, and 11 (most of the practical content of the book) is the transformation of a classification, clustering, regression, density estimation, or dimension-reduction problem into an optimization problem: find the best estimate θ of a vector of model parameters θ, find the best trained model in a class of trained models, find the best features to use or the best low-dimensional representation of features, etc. Thus, in order to solve real-world problems, the machine learning practitioner needs to be able to solve various kinds of optimization problems. Even when relying on existing software for optimization, a machine learning practitioner can be more effective if he or she understands a variety of optimization methods, and their strengths and weaknesses.
Chapter 12 is about communication with clients. In the author’s experience, machine learning practitioners do not generally work for themselves, in the sense that they are not the ultimate beneficiary of a machine learning solution they create for some real-world application. Instead, a machine learning practitioner’s work is typically performed for a client, or customer, or mission stakeholder, or scientific colleague. It is the client’s subjective judgment which matters in determining what is a good solution to a problem, and what kind of trade-offs are acceptable in error, computation, and model interpretability. A machine learning practitioner—no matter how smart, how well trained, or how gifted in computational resources—is doomed to failure if he or she cannot elicit from a client what is really needed, or cannot communicate to a client what the machine learning practitioner needs and can provide.
1
The reader who has a thorough background in regression may choose to skip
Chapter 3
, or skim it for notation.
2
Recurrent neural networks are not covered in this book.
3
Despite its importance, optimization is a topic which is often either taken for granted or ignored.
This book was written to train mathematicians, computer scientists, data scientists, and others at the National Security Agency (NSA), where it has been used, in various editions, since 2005. This book is respectfully dedicated to the silent professionals—past, current, and future—who use this material in the intelligence services of the United States and its allies.
This book has been improved through many discussions with teachers, students, and colleagues. Most of these people cannot be listed here by name, so I will simply say I am grateful to them all.
Steven W. Knox
This book is accompanied by a companion website:
(URL: www.wiley.com\go\Knox\MachineLearning)
The website includes:
Solutions to some of the exercises in the book.
To call in a statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.
—R. A. Fisher, Presidential Address, 1938
The following examples will be used to illustrate the ideas of the next chapter.
Problem 1 (“Shuttle”).
