118,99 €
This book is the first of its kind to discuss error estimation with a model-based approach. From the basics of classifiers and error estimators to distributional and Bayesian theory, it covers important topics and essential issues pertaining to the scientific validity of pattern classification.
Error Estimation for Pattern Recognition focuses on error estimation, which is a broad and poorly understood topic that reaches all research areas using pattern classification. It includes model-based approaches and discussions of newer error estimators such as bolstered and Bayesian estimators. This book was motivated by the application of pattern recognition to high-throughput data with limited replicates, which is a basic problem now appearing in many areas. The first two chapters cover basic issues in classification error estimation, such as definitions, test-set error estimation, and training-set error estimation. The remaining chapters in this book cover results on the performance and representation of training-set error estimators for various pattern classifiers.
Additional features of the book include:
• The latest results on the accuracy of error estimation
• Performance analysis of re-substitution, cross-validation, and bootstrap error estimators using analytical and simulation approaches
• Highly interactive computer-based exercises and end-of-chapter problems
This is the first book exclusively about error estimation for pattern recognition.
Ulisses M. Braga Neto is an Associate Professor in the Department of Electrical and Computer Engineering at Texas A&M University, USA. He received his PhD in Electrical and Computer Engineering from The Johns Hopkins University. Dr. Braga Neto received an NSF CAREER Award for his work on error estimation for pattern recognition with applications in genomic signal processing. He is an IEEE Senior Member.
Edward R. Dougherty is a Distinguished Professor, Robert F. Kennedy ’26 Chair, and Scientific Director at the Center for Bioinformatics and Genomic Systems Engineering at Texas A&M University, USA. He is a fellow of both the IEEE and SPIE, and he has received the SPIE Presidents Award. Dr. Dougherty has authored several books including Epistemology of the Cell: A Systems Perspective on Biological Knowledge and Random Processes for Image and Signal Processing (Wiley-IEEE Press).
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 467
Veröffentlichungsjahr: 2015
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
IEEE Press Editorial Board
Tariq Samad, Editor in Chief
George W. Arnold
Mary Lanzerotti
Linda Shafer
Dmitry Goldgof
Pui-In Mak
MengChu Zhou
Ekram Hossain
Ray Perez
George Zobrist
Kenneth Moore, Director of IEEE Book and Information Services (BIS)
Technical Reviewer
Frank Alexander, Los Alamos National Laboratory
ULISSES M. BRAGA-NETO EDWARD R. DOUGHERTY
Copyright © 2015 by The Institute of Electrical and Electronics Engineers, Inc.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is available.
ISBN: 978-1-118-99973-8
To our parents (in memoriam)Jacinto and Consuêlo and Russell and Ann
And to our wives and children Flávia, Maria Clara, and Ulisses and
PREFACE
NOTES
ACKNOWLEDGMENTS
LIST OF SYMBOLS
CHAPTER 1 CLASSIFICATION
1.1 CLASSIFIERS
1.2 POPULATION-BASED DISCRIMINANTS
1.3 CLASSIFICATION RULES
1.4 SAMPLE-BASED DISCRIMINANTS
1.5 HISTOGRAM RULE
1.6 OTHER CLASSIFICATION RULES
1.7 FEATURE SELECTION
EXERCISES
NOTES
CHAPTER 2 ERROR ESTIMATION
2.1 ERROR ESTIMATION RULES
2.2 PERFORMANCE METRICS
2.3 TEST-SET ERROR ESTIMATION
2.4 RESUBSTITUTION
2.5 CROSS-VALIDATION
2.6 BOOTSTRAP
2.7 CONVEX ERROR ESTIMATION
2.8 SMOOTHED ERROR ESTIMATION
2.9 BOLSTERED ERROR ESTIMATION
EXERCISES
NOTES
CHAPTER 3 PERFORMANCE ANALYSIS
3.1 EMPIRICAL DEVIATION DISTRIBUTION
3.2 REGRESSION
3.3 IMPACT ON FEATURE SELECTION
3.4 MULTIPLE-DATA-SET REPORTING BIAS
3.5 MULTIPLE-RULE BIAS
3.6 PERFORMANCE REPRODUCIBILITY
EXERCISES
NOTES
CHAPTER 4 ERROR ESTIMATION FOR DISCRETE CLASSIFICATION
4.1 ERROR ESTIMATORS
4.2 SMALL-SAMPLE PERFORMANCE
4.3 LARGE-SAMPLE PERFORMANCE
EXERCISES
CHAPTER 5 DISTRIBUTION THEORY
5.1 MIXTURE SAMPLING VERSUS SEPARATE SAMPLING
5.2 SAMPLE-BASED DISCRIMINANTS REVISITED
5.3 TRUE ERROR
5.4 ERROR ESTIMATORS
5.5 EXPECTED ERROR RATES
5.6 HIGHER-ORDER MOMENTS OF ERROR RATES
5.7 SAMPLING DISTRIBUTION OF ERROR RATES
EXERCISES
CHAPTER 6 GAUSSIAN DISTRIBUTION THEORY: UNIVARIATE CASE
6.1 HISTORICAL REMARKS
6.2 UNIVARIATE DISCRIMINANT
6.3 EXPECTED ERROR RATES
6.4 HIGHER-ORDER MOMENTS OF ERROR RATES
6.5 SAMPLING DISTRIBUTIONS OF ERROR RATES
EXERCISES
CHAPTER 7 GAUSSIAN DISTRIBUTION THEORY: MULTIVARIATE CASE:
7.1 MULTIVARIATE DISCRIMINANTS
7.2 SMALL-SAMPLE METHODS
7.3 LARGE-SAMPLE METHODS
EXERCISES
NOTES
CHAPTER 8 BAYESIAN MMSE ERROR ESTIMATION
8.1 THE BAYESIAN MMSE ERROR ESTIMATOR
8.2 SAMPLE-CONDITIONED MSE
8.3 DISCRETE CLASSIFICATION
8.4 LINEAR CLASSIFICATION OF GAUSSIAN DISTRIBUTIONS
8.5 CONSISTENCY
8.6 CALIBRATION
8.7 CONCLUDING REMARKS
EXERCISES
NOTES
APPENDIX A BASIC PROBABILITY REVIEW
A.1 SAMPLE SPACES AND EVENTS
A.2 DEFINITION OF PROBABILITY
A.3 BOREL-CANTELLI LEMMAS
A.4 CONDITIONAL PROBABILITY
A.5 RANDOM VARIABLES
A.6 DISCRETE RANDOM VARIABLES
A.7 EXPECTATION
A.8 CONDITIONAL EXPECTATION
A.9 VARIANCE
A.10 VECTOR RANDOM VARIABLES
A.11 THE MULTIVARIATE GAUSSIAN
A.12 CONVERGENCE OF RANDOM SEQUENCES
A.13 LIMITING THEOREMS
APPENDIX B VAPNIK–CHERVONENKIS THEORY
B.1 SHATTER COEFFICIENTS
B.2 THE VC DIMENSION
B.3 VC THEORY OF CLASSIFICATION
B.4 VAPNIK–CHERVONENKIS THEOREM
APPENDIX C DOUBLE ASYMPTOTICS
BIBLIOGRAPHY
AUTHOR INDEX
SUBJECT INDEX
EULA
Chapter 8
Table 8.1
Cover
Table of Contents
Preface
xiii
xiv
xv
xvi
xvii
xix
XXI
xxii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
73
74
75
77
78
79
82
83
84
85
86
87
88
89
90
91
92
93
94
95
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
145
146
147
148
149
150
151
152
153
154
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
235
236
237
238
239
240
241
242
243
244
246
247
248
249
250
251
252
253
254
255
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
285
286
287
288
289
291
292
293
294
295
296
297
298
299
300
301
302
303
305
306
307
308
309
310
311
Error estimation determines the accuracy of the predictions that can be performed by an inferred classifier, and therefore plays a fundamental role in all scientific applications involving classification. And yet, we find that the subject is often neglected in most textbooks on pattern recognition and machine learning (with some admirable exceptions, such as the books by Devroye et al. (40) and McLachlan (110)). Most texts in the field describe a host of classification rules to train classifiers from data but treat the question of how to determine the accuracy of the inferred model only perfunctorily. Nevertheless, it is our belief, and the motivation behind this book, that the study of error estimation is at least as important as the study of classification rules themselves.
The error rate of interest could be the overall error rate on the population; however, it could also be the error rate on the individual classes. Were we to know the population distribution, we could in principle compute the error of a classifier exactly. Since we generally do not know the population distribution, we need to estimate the error from sample data. If the sample size is large enough relative to the number of features, an accurate error estimate is obtained by dividing the sample into training and testing sets, designing a classifier on the training set, and testing it on the test set. However, this is not possible if the sample size is small, in which case training and testing has to be performed on the same data. The focus of this book is on the latter case, which is common in situations where data are expensive, time-consuming, or difficult to obtain. In addition, in the case of “big data” applications, one is often dealing with small sample sizes, since the number of measurements far exceeds the number of sample points.
A classifier defines a relation between the features and the label (target random variable) relative to the joint feature–label distribution. There are two possibilities regarding the definition of a classifier: either it is derived from the feature–label distribution and therefore is intrinsic to it, as in the case of an optimal classifier, or it is not derived directly from the feature–label distribution, in which case it is extrinsic to the distribution. In both cases, classification error rates are intrinsic because they are derived from the distribution, given the classifier. Together, a classifier with a stated error rate constitute a pattern recognition model. If the classifier is intrinsic to the distribution, then there is no issue of error estimation because the feature–label distribution is known and the error rate can be derived. On the other hand, if the classifier is extrinsic to the distribution, then it is generally the case that the distribution is not known, the classifier is constructed from data via a classification rule, and the error rate of interest is estimated from the data via an error estimation rule. These rules together constitute a pattern recognition rule. (Feature selection, if present, is considered to be part of the classification rule, and thus of the pattern recognition rule.)
From a scientific perspective, that is, when a classifier is applied to phenomena, as with any scientific model, the validity of a pattern recognition model is characterized by the degree of concordance between predictions based on the model and corresponding observations.1 The only prediction based on a pattern recognition model is the percentage of errors the classifier will make when it is applied. Hence, validity is based on concordance between the error rate in the model and the empirical error rate on the data when applied. Hence, when the distributions are not known and the model error must be estimated, model validity is completely dependent on the accuracy of the error estimation rule, so that the salient epistemological problem of classification is error estimation.2
Good classification rules produce classifiers with small average error rates. But is a classification rule good if its error rate cannot be estimated accurately? This is not just an abstract exercise in epistemology, but it has immediate practical consequences for applied science. For instance, the problem is so serious in genomic classification that the Director of the Center for Drug Evaluation and Research at the FDA has estimated that as much as 75% of published biomarker associations are not replicable. Absent an accurate error estimation rule, we lack knowledge of the error rate. Thus, the classifier is useless, no matter how sophisticated the classification rule that produced it. Therefore, one should speak about the goodness of a pattern recognition rule, as defined above, rather than that of a classification rule in isolation.
A common scenario in the literature these days proceeds in the following manner: propose an algorithm for classifier design; apply the algorithm to a few small-sample data sets, with no information pertaining to the distributions from which the data sets have been sampled; and apply an error estimation scheme based on resampling the data, typically cross-validation. With regard to the third step, we are given no characterization of the accuracy of the error estimator and why it should provide a reasonably good estimate. Most strikingly, as we show in this book, we can expect it to be inaccurate in small-sample cases. Nevertheless, the claim is made that the proposed algorithm has been “validated.” Very little is said about the accuracy of the error estimation step, except perhaps that cross-validation is close to being unbiased if not too many points are held out. But this kind of comment is misleading, given that a small bias may be of little consequence if the variance is large, which it usually is for small samples and large feature sets. In addition, the classical cross-validation unbiasedness theorem holds if sampling is random over the mixture of the populations. In situations where this is not the case, for example, the populations are sampled separately, bias is introduced, as it is shown in Chapter 5. These kinds of problems are especially detrimental in the current era of high-throughput measurement devices, for which it is now commonplace to be confronted with tens of thousands of features and very small sample sizes.
The subject of error estimation has in fact a long history and has produced a large body of literature; four main review papers summarize the major advances in the field up to 2000 (Hand, 73; McLachlan, 109; Schiavo and Hand, 131; Toussaint, 148); recent advances in error estimation since 2000 include work on model selection (Bartlett et al., 10), bolstering (Braga-Neto and Dougherty, 15; Sima et al., 135), feature selection (Hanczar et al., 72; Sima et al., 134; Xiao et al., 161; Zhou and Mao, 166), confidence intervals (Kaariainen, 91; Kaariainen and Langford, 92; Xu et al., 162), model-based second-order properties (Zollanvari et al., 170, 171), and Bayesian error estimators (Dalton and Dougherty, 30,c). This book covers the classical studies as well as the recent developments. It discusses in detail nonparametric approaches, but gives special consideration, especially in the latter part of the book, to parametric, model-based approaches.
Pattern recognition plays a key role in many disciplines, including engineering, physics, statistics, computer science, social science, manufacturing, materials, medicine, biology, and more, so this book will be useful for researchers and practitioners in all these areas. This book can serve as a text at the graduate level, can be used as a supplement for general courses on pattern recognition and machine learning, or can serve as a reference for researchers across all technical disciplines where classification plays a major role, which may in fact be all technical disciplines.
The book is organized into eight chapters. Chapters 1 and 2 provide the foundation for the rest of the book and must be read first. Chapters 3, 4, and 8 stand on their own and can be studied separately. Chapter 5 provides the foundation for Chapters 6 and 7, so these chapters should be read in this sequence. For example, chapter sequences 1-2-3-4, 1-2-5-6-7, and 1-2-8 are all possible ways of reading the book. Naturally, the book is best read from beginning to end. Short descriptions of each chapter are provided next.
Chapter 1. Classification. To make the book self-contained, the first chapter covers basic topics in classification required for the remainder of the text: classifiers, population-based and sample-based discriminants, and classification rules. It defines a few basic classification rules: LDA, QDA, discrete classification, nearest-neighbors, SVMs, neural networks, and classification trees, closing with a section on feature selection.
Chapter 2. Error Estimation. This chapter covers the basics of error estimation: definitions, performance metrics for estimation rules, test-set error estimation, and training-set error estimation. It also includes a discussion of pattern recognition models. The test-set error estimator is straightforward and well-understood; there being efficient distribution-free bounds on performance. However, it assumes large sample sizes, which may be impractical. The thrust of the book is therefore on training-set error estimation, which is necessary for small-sample classifier design. We cover in this chapter the following error estimation rules: resubstitution, cross-validation, bootstrap, optimal convex estimation, smoothed error estimation, and bolstered error estimation.
Chapter 3. Performance Analysis. The main focus of the book is on performance characterization for error estimators, a topic whose coverage is inadequate in most pattern recognition texts. The fundamental entity in error analysis is the joint distribution between the true error and the error estimate. Of special importance is the regression between them. This chapter discusses the deviation distribution between the true and estimated error, with particular attention to the root-mean-square (RMS) error as the key metric of error estimation performance. The chapter covers various other issues: the impact of error estimation on feature selection, bias that can arise when considering multiple data sets or multiple rules, and measurement of performance reproducibility.
Chapter 4. Error Estimation for Discrete Classification. For discrete classification, the moments of resubstitution and the basic resampling error estimators, cross-validation and bootstrap, can be represented in finite-sample closed forms, as presented in the chapter. Using these representations, formulations of various performance measures are obtained, including the bias, variance, deviation variance, and the RMS and correlation between the true and estimated errors. Next, complete enumeration schemes to represent the joint and conditional distributions between the true and estimated errors are discussed. The chapter concludes by surveying large-sample performance bounds for various error estimators.
Chapter 5. Distribution Theory. This chapter provides general distributional results for discriminant-based classifiers. Here we also introduce the issue of separate vs. mixture sampling. For the true error, distributional results are in terms of conditional probability statements involving the discriminant. Passing on to resubstitution and the resampling-based estimators, the error estimators can be expressed in terms of error-counting expressions involving the discriminant. With these in hand, one can then go on to evaluate expected error rates and higher order moments of the error rates, all of which take combinatorial forms. Second-order moments are included, thus allowing computation of the RMS. The chapter concludes with analytical expressions for the sampling distribution for resubstitution and leave-one-out cross-validation.
Chapter 6. Gaussian Distribution Theory: Univariate Case. Historically, much effort was put into characterizing expected error rates for linear discrimination in the Gaussian model for the true error, resubstitution, and leave-one-out. This chapter considers these, along with more recent work on the bootstrap. It then goes on to provide exact expressions for the first- and higher-order moments of various error rates, thereby providing an exact analytic expression for the RMS. The chapter provides exact expressions for the marginal sampling distributions of the resubstitution and leave-one-out error estimators. It closes with comments on the joint distribution of true and estimated error rates.
Chapter 7. Gaussian Distribution Theory: Multivariate Case. This chapter begins with statistical representations for multivariate discrimination, including Bowkers classical representation for Anderson's LDA discriminant, Moran’s representations for John’s LDA discriminant, and McFarland and Richards' QDA discriminant representation. A discussion follows on computational techniques for obtaining the error rate moments based on these representations. Large-sample methods are then treated in the context of double-asymptotic approximations where the sample size and feature dimension both approach infinity, in a comparable fashion. This leads to asymptotically-exact, finite-sample approximations to the first and second moments of various error rates, which lead to double-asymptotic approximations for the RMS.
Chapter 8. Bayesian MMSE Error Estimation. In small-sample situations it is virtually impossible to make any useful statements concerning error-estimation accuracy unless distributional assumptions are made. In previous chapters, these distributional assumptions were made from a frequentist point of view, whereas this chapter considers a Bayesian approach to the problem. Partial assumptions lead to an uncertainty class of feature–label distributions and, assuming a prior distribution over the uncertainty class, one can obtain a minimum-mean-square-error (MMSE) estimate of the error rate. In opposition to the classical case, a sample-conditioned MSE of the error estimate can be defined and computed. In addition to the general MMSE theory, this chapter provides specific representations of the MMSE error estimator for discrete classification and linear classification in the Gaussian model. It also discusses consistency of the estimate and calibration of non-Bayesian error estimators relative to a prior distribution governing model uncertainty.
We close by noting that for researchers in applied mathematics, statistics, engineering, and computer science, error estimation for pattern recognition offers a wide open field with fundamental and practically important problems around every turn. Despite this fact, error estimation has been a neglected subject over the past few decades, even though it spans a huge application domain and its understanding is critical for scientific epistemology. This text, which is believed to be the first book devoted entirely to the topic of error estimation for pattern recognition, is an attempt to correct the record in that regard.
ULISSES M. BRAGA-NETO EDWARD R. DOUGHERTY
College Station, TexasApril 2015
1
In the words of Richard Feynman, “It is whether or not the theory gives predictions that agree with experiment. It is not a question of whether a theory is philosophically delightful, or easy to understand, or perfectly reasonable from the point of view of common sense.” (Feynman, 56)
2
From a scientific perspective, the salient issue is error estimation. One can imagine Harald Cramér leisurely sailing on the Baltic off the coast of Stockholm, taking in the sights and sounds of the sea, when suddenly a gene-expression classifier to detect prostate cancer pops into his head. No classification rule has been applied, nor is that necessary. All that matters is that Cramér’s imagination has produced a classifier that operates on the population of interest with a sufficiently small error rate. Estimation of that rate requires an accurate error estimation rule.
Much of the material in this book has been developed over a dozen years of our joint work on error estimation for pattern recognition. In this regard, we would like to acknowledge the students whose excellent work has contributed to the content: Chao Sima, Jianping Hua, Thang Vu, Lori Dalton, Mohammadmahdi Yousefi, Mohammad Esfahani, Amin Zollanvari, and Blaise Hanczar. We also would like to thank Don Geman for many hours of interesting conversation about error estimation. We appreciate the financial support we have received from the Translational Genomics Research Institute, the National Cancer Institute, and the National Science Foundation. Last, and certainly not least, we would like to acknowledge the encouragement and support provided by our families in this time-consuming endeavor.
U.M.B.N. E.R.D.
