Error Estimation for Pattern Recognition - Ulisses M. Braga Neto - E-Book

Error Estimation for Pattern Recognition E-Book

Ulisses M. Braga Neto

0,0
118,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book is the first of its kind to discuss error estimation with a model-based approach. From the basics of classifiers and error estimators to distributional and Bayesian theory, it covers important topics and essential issues pertaining to the scientific validity of pattern classification.

Error Estimation for Pattern Recognition focuses on error estimation, which is a broad and poorly understood topic that reaches all research areas using pattern classification. It includes model-based approaches and discussions of newer error estimators such as bolstered and Bayesian estimators. This book was motivated by the application of pattern recognition to high-throughput data with limited replicates, which is a basic problem now appearing in many areas. The first two chapters cover basic issues in classification error estimation, such as definitions, test-set error estimation, and training-set error estimation. The remaining chapters in this book cover results on the performance and representation of training-set error estimators for various pattern classifiers.

Additional features of the book include:

• The latest results on the accuracy of error estimation
• Performance analysis of re-substitution, cross-validation, and bootstrap error estimators using analytical and simulation approaches
• Highly interactive computer-based exercises and end-of-chapter problems

This is the first book exclusively about error estimation for pattern recognition.

Ulisses M. Braga Neto is an Associate Professor in the Department of Electrical and Computer Engineering at Texas A&M University, USA. He received his PhD in Electrical and Computer Engineering from The Johns Hopkins University. Dr. Braga Neto received an NSF CAREER Award for his work on error estimation for pattern recognition with applications in genomic signal processing. He is an IEEE Senior Member.

Edward R. Dougherty is a Distinguished Professor, Robert F. Kennedy ’26 Chair, and Scientific Director at the Center for Bioinformatics and Genomic Systems Engineering at Texas A&M University, USA. He is a fellow of both the IEEE and SPIE, and he has received the SPIE Presidents Award. Dr. Dougherty has authored several books including Epistemology of the Cell: A Systems Perspective on Biological Knowledge and Random Processes for Image and Signal Processing (Wiley-IEEE Press).

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 467

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



IEEE Press

445 Hoes Lane

Piscataway, NJ 08854

IEEE Press Editorial Board

Tariq Samad, Editor in Chief

George W. Arnold

Mary Lanzerotti

Linda Shafer

Dmitry Goldgof

Pui-In Mak

MengChu Zhou

Ekram Hossain

Ray Perez

George Zobrist

Kenneth Moore, Director of IEEE Book and Information Services (BIS)

Technical Reviewer

Frank Alexander, Los Alamos National Laboratory

ERROR ESTIMATION FOR PATTERN RECOGNITION

ULISSES M. BRAGA-NETO EDWARD R. DOUGHERTY

Copyright © 2015 by The Institute of Electrical and Electronics Engineers, Inc.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data is available.

ISBN: 978-1-118-99973-8

To our parents (in memoriam)Jacinto and Consuêlo and Russell and Ann

And to our wives and children Flávia, Maria Clara, and Ulisses and

CONTENTS

PREFACE

NOTES

ACKNOWLEDGMENTS

LIST OF SYMBOLS

CHAPTER 1 CLASSIFICATION

1.1 CLASSIFIERS

1.2 POPULATION-BASED DISCRIMINANTS

1.3 CLASSIFICATION RULES

1.4 SAMPLE-BASED DISCRIMINANTS

1.5 HISTOGRAM RULE

1.6 OTHER CLASSIFICATION RULES

1.7 FEATURE SELECTION

EXERCISES

NOTES

CHAPTER 2 ERROR ESTIMATION

2.1 ERROR ESTIMATION RULES

2.2 PERFORMANCE METRICS

2.3 TEST-SET ERROR ESTIMATION

2.4 RESUBSTITUTION

2.5 CROSS-VALIDATION

2.6 BOOTSTRAP

2.7 CONVEX ERROR ESTIMATION

2.8 SMOOTHED ERROR ESTIMATION

2.9 BOLSTERED ERROR ESTIMATION

EXERCISES

NOTES

CHAPTER 3 PERFORMANCE ANALYSIS

3.1 EMPIRICAL DEVIATION DISTRIBUTION

3.2 REGRESSION

3.3 IMPACT ON FEATURE SELECTION

3.4 MULTIPLE-DATA-SET REPORTING BIAS

3.5 MULTIPLE-RULE BIAS

3.6 PERFORMANCE REPRODUCIBILITY

EXERCISES

NOTES

CHAPTER 4 ERROR ESTIMATION FOR DISCRETE CLASSIFICATION

4.1 ERROR ESTIMATORS

4.2 SMALL-SAMPLE PERFORMANCE

4.3 LARGE-SAMPLE PERFORMANCE

EXERCISES

CHAPTER 5 DISTRIBUTION THEORY

5.1 MIXTURE SAMPLING VERSUS SEPARATE SAMPLING

5.2 SAMPLE-BASED DISCRIMINANTS REVISITED

5.3 TRUE ERROR

5.4 ERROR ESTIMATORS

5.5 EXPECTED ERROR RATES

5.6 HIGHER-ORDER MOMENTS OF ERROR RATES

5.7 SAMPLING DISTRIBUTION OF ERROR RATES

EXERCISES

CHAPTER 6 GAUSSIAN DISTRIBUTION THEORY: UNIVARIATE CASE

6.1 HISTORICAL REMARKS

6.2 UNIVARIATE DISCRIMINANT

6.3 EXPECTED ERROR RATES

6.4 HIGHER-ORDER MOMENTS OF ERROR RATES

6.5 SAMPLING DISTRIBUTIONS OF ERROR RATES

EXERCISES

CHAPTER 7 GAUSSIAN DISTRIBUTION THEORY: MULTIVARIATE CASE:

7.1 MULTIVARIATE DISCRIMINANTS

7.2 SMALL-SAMPLE METHODS

7.3 LARGE-SAMPLE METHODS

EXERCISES

NOTES

CHAPTER 8 BAYESIAN MMSE ERROR ESTIMATION

8.1 THE BAYESIAN MMSE ERROR ESTIMATOR

8.2 SAMPLE-CONDITIONED MSE

8.3 DISCRETE CLASSIFICATION

8.4 LINEAR CLASSIFICATION OF GAUSSIAN DISTRIBUTIONS

8.5 CONSISTENCY

8.6 CALIBRATION

8.7 CONCLUDING REMARKS

EXERCISES

NOTES

APPENDIX A BASIC PROBABILITY REVIEW

A.1 SAMPLE SPACES AND EVENTS

A.2 DEFINITION OF PROBABILITY

A.3 BOREL-CANTELLI LEMMAS

A.4 CONDITIONAL PROBABILITY

A.5 RANDOM VARIABLES

A.6 DISCRETE RANDOM VARIABLES

A.7 EXPECTATION

A.8 CONDITIONAL EXPECTATION

A.9 VARIANCE

A.10 VECTOR RANDOM VARIABLES

A.11 THE MULTIVARIATE GAUSSIAN

A.12 CONVERGENCE OF RANDOM SEQUENCES

A.13 LIMITING THEOREMS

APPENDIX B VAPNIK–CHERVONENKIS THEORY

B.1 SHATTER COEFFICIENTS

B.2 THE VC DIMENSION

B.3 VC THEORY OF CLASSIFICATION

B.4 VAPNIK–CHERVONENKIS THEOREM

APPENDIX C DOUBLE ASYMPTOTICS

BIBLIOGRAPHY

AUTHOR INDEX

SUBJECT INDEX

EULA

List of Tables

Chapter 8

Table 8.1

Guide

Cover

Table of Contents

Preface

Pages

xiii

xiv

xv

xvi

xvii

xix

XXI

xxii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

73

74

75

77

78

79

82

83

84

85

86

87

88

89

90

91

92

93

94

95

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

145

146

147

148

149

150

151

152

153

154

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

235

236

237

238

239

240

241

242

243

244

246

247

248

249

250

251

252

253

254

255

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

285

286

287

288

289

291

292

293

294

295

296

297

298

299

300

301

302

303

305

306

307

308

309

310

311

PREFACE

Error estimation determines the accuracy of the predictions that can be performed by an inferred classifier, and therefore plays a fundamental role in all scientific applications involving classification. And yet, we find that the subject is often neglected in most textbooks on pattern recognition and machine learning (with some admirable exceptions, such as the books by Devroye et al. (40) and McLachlan (110)). Most texts in the field describe a host of classification rules to train classifiers from data but treat the question of how to determine the accuracy of the inferred model only perfunctorily. Nevertheless, it is our belief, and the motivation behind this book, that the study of error estimation is at least as important as the study of classification rules themselves.

The error rate of interest could be the overall error rate on the population; however, it could also be the error rate on the individual classes. Were we to know the population distribution, we could in principle compute the error of a classifier exactly. Since we generally do not know the population distribution, we need to estimate the error from sample data. If the sample size is large enough relative to the number of features, an accurate error estimate is obtained by dividing the sample into training and testing sets, designing a classifier on the training set, and testing it on the test set. However, this is not possible if the sample size is small, in which case training and testing has to be performed on the same data. The focus of this book is on the latter case, which is common in situations where data are expensive, time-consuming, or difficult to obtain. In addition, in the case of “big data” applications, one is often dealing with small sample sizes, since the number of measurements far exceeds the number of sample points.

A classifier defines a relation between the features and the label (target random variable) relative to the joint feature–label distribution. There are two possibilities regarding the definition of a classifier: either it is derived from the feature–label distribution and therefore is intrinsic to it, as in the case of an optimal classifier, or it is not derived directly from the feature–label distribution, in which case it is extrinsic to the distribution. In both cases, classification error rates are intrinsic because they are derived from the distribution, given the classifier. Together, a classifier with a stated error rate constitute a pattern recognition model. If the classifier is intrinsic to the distribution, then there is no issue of error estimation because the feature–label distribution is known and the error rate can be derived. On the other hand, if the classifier is extrinsic to the distribution, then it is generally the case that the distribution is not known, the classifier is constructed from data via a classification rule, and the error rate of interest is estimated from the data via an error estimation rule. These rules together constitute a pattern recognition rule. (Feature selection, if present, is considered to be part of the classification rule, and thus of the pattern recognition rule.)

From a scientific perspective, that is, when a classifier is applied to phenomena, as with any scientific model, the validity of a pattern recognition model is characterized by the degree of concordance between predictions based on the model and corresponding observations.1 The only prediction based on a pattern recognition model is the percentage of errors the classifier will make when it is applied. Hence, validity is based on concordance between the error rate in the model and the empirical error rate on the data when applied. Hence, when the distributions are not known and the model error must be estimated, model validity is completely dependent on the accuracy of the error estimation rule, so that the salient epistemological problem of classification is error estimation.2

Good classification rules produce classifiers with small average error rates. But is a classification rule good if its error rate cannot be estimated accurately? This is not just an abstract exercise in epistemology, but it has immediate practical consequences for applied science. For instance, the problem is so serious in genomic classification that the Director of the Center for Drug Evaluation and Research at the FDA has estimated that as much as 75% of published biomarker associations are not replicable. Absent an accurate error estimation rule, we lack knowledge of the error rate. Thus, the classifier is useless, no matter how sophisticated the classification rule that produced it. Therefore, one should speak about the goodness of a pattern recognition rule, as defined above, rather than that of a classification rule in isolation.

A common scenario in the literature these days proceeds in the following manner: propose an algorithm for classifier design; apply the algorithm to a few small-sample data sets, with no information pertaining to the distributions from which the data sets have been sampled; and apply an error estimation scheme based on resampling the data, typically cross-validation. With regard to the third step, we are given no characterization of the accuracy of the error estimator and why it should provide a reasonably good estimate. Most strikingly, as we show in this book, we can expect it to be inaccurate in small-sample cases. Nevertheless, the claim is made that the proposed algorithm has been “validated.” Very little is said about the accuracy of the error estimation step, except perhaps that cross-validation is close to being unbiased if not too many points are held out. But this kind of comment is misleading, given that a small bias may be of little consequence if the variance is large, which it usually is for small samples and large feature sets. In addition, the classical cross-validation unbiasedness theorem holds if sampling is random over the mixture of the populations. In situations where this is not the case, for example, the populations are sampled separately, bias is introduced, as it is shown in Chapter 5. These kinds of problems are especially detrimental in the current era of high-throughput measurement devices, for which it is now commonplace to be confronted with tens of thousands of features and very small sample sizes.

The subject of error estimation has in fact a long history and has produced a large body of literature; four main review papers summarize the major advances in the field up to 2000 (Hand, 73; McLachlan, 109; Schiavo and Hand, 131; Toussaint, 148); recent advances in error estimation since 2000 include work on model selection (Bartlett et al., 10), bolstering (Braga-Neto and Dougherty, 15; Sima et al., 135), feature selection (Hanczar et al., 72; Sima et al., 134; Xiao et al., 161; Zhou and Mao, 166), confidence intervals (Kaariainen, 91; Kaariainen and Langford, 92; Xu et al., 162), model-based second-order properties (Zollanvari et al., 170, 171), and Bayesian error estimators (Dalton and Dougherty, 30,c). This book covers the classical studies as well as the recent developments. It discusses in detail nonparametric approaches, but gives special consideration, especially in the latter part of the book, to parametric, model-based approaches.

Pattern recognition plays a key role in many disciplines, including engineering, physics, statistics, computer science, social science, manufacturing, materials, medicine, biology, and more, so this book will be useful for researchers and practitioners in all these areas. This book can serve as a text at the graduate level, can be used as a supplement for general courses on pattern recognition and machine learning, or can serve as a reference for researchers across all technical disciplines where classification plays a major role, which may in fact be all technical disciplines.

The book is organized into eight chapters. Chapters 1 and 2 provide the foundation for the rest of the book and must be read first. Chapters 3, 4, and 8 stand on their own and can be studied separately. Chapter 5 provides the foundation for Chapters 6 and 7, so these chapters should be read in this sequence. For example, chapter sequences 1-2-3-4, 1-2-5-6-7, and 1-2-8 are all possible ways of reading the book. Naturally, the book is best read from beginning to end. Short descriptions of each chapter are provided next.

Chapter 1. Classification. To make the book self-contained, the first chapter covers basic topics in classification required for the remainder of the text: classifiers, population-based and sample-based discriminants, and classification rules. It defines a few basic classification rules: LDA, QDA, discrete classification, nearest-neighbors, SVMs, neural networks, and classification trees, closing with a section on feature selection.

Chapter 2. Error Estimation. This chapter covers the basics of error estimation: definitions, performance metrics for estimation rules, test-set error estimation, and training-set error estimation. It also includes a discussion of pattern recognition models. The test-set error estimator is straightforward and well-understood; there being efficient distribution-free bounds on performance. However, it assumes large sample sizes, which may be impractical. The thrust of the book is therefore on training-set error estimation, which is necessary for small-sample classifier design. We cover in this chapter the following error estimation rules: resubstitution, cross-validation, bootstrap, optimal convex estimation, smoothed error estimation, and bolstered error estimation.

Chapter 3. Performance Analysis. The main focus of the book is on performance characterization for error estimators, a topic whose coverage is inadequate in most pattern recognition texts. The fundamental entity in error analysis is the joint distribution between the true error and the error estimate. Of special importance is the regression between them. This chapter discusses the deviation distribution between the true and estimated error, with particular attention to the root-mean-square (RMS) error as the key metric of error estimation performance. The chapter covers various other issues: the impact of error estimation on feature selection, bias that can arise when considering multiple data sets or multiple rules, and measurement of performance reproducibility.

Chapter 4. Error Estimation for Discrete Classification. For discrete classification, the moments of resubstitution and the basic resampling error estimators, cross-validation and bootstrap, can be represented in finite-sample closed forms, as presented in the chapter. Using these representations, formulations of various performance measures are obtained, including the bias, variance, deviation variance, and the RMS and correlation between the true and estimated errors. Next, complete enumeration schemes to represent the joint and conditional distributions between the true and estimated errors are discussed. The chapter concludes by surveying large-sample performance bounds for various error estimators.

Chapter 5. Distribution Theory. This chapter provides general distributional results for discriminant-based classifiers. Here we also introduce the issue of separate vs. mixture sampling. For the true error, distributional results are in terms of conditional probability statements involving the discriminant. Passing on to resubstitution and the resampling-based estimators, the error estimators can be expressed in terms of error-counting expressions involving the discriminant. With these in hand, one can then go on to evaluate expected error rates and higher order moments of the error rates, all of which take combinatorial forms. Second-order moments are included, thus allowing computation of the RMS. The chapter concludes with analytical expressions for the sampling distribution for resubstitution and leave-one-out cross-validation.

Chapter 6. Gaussian Distribution Theory: Univariate Case. Historically, much effort was put into characterizing expected error rates for linear discrimination in the Gaussian model for the true error, resubstitution, and leave-one-out. This chapter considers these, along with more recent work on the bootstrap. It then goes on to provide exact expressions for the first- and higher-order moments of various error rates, thereby providing an exact analytic expression for the RMS. The chapter provides exact expressions for the marginal sampling distributions of the resubstitution and leave-one-out error estimators. It closes with comments on the joint distribution of true and estimated error rates.

Chapter 7. Gaussian Distribution Theory: Multivariate Case. This chapter begins with statistical representations for multivariate discrimination, including Bowkers classical representation for Anderson's LDA discriminant, Moran’s representations for John’s LDA discriminant, and McFarland and Richards' QDA discriminant representation. A discussion follows on computational techniques for obtaining the error rate moments based on these representations. Large-sample methods are then treated in the context of double-asymptotic approximations where the sample size and feature dimension both approach infinity, in a comparable fashion. This leads to asymptotically-exact, finite-sample approximations to the first and second moments of various error rates, which lead to double-asymptotic approximations for the RMS.

Chapter 8. Bayesian MMSE Error Estimation. In small-sample situations it is virtually impossible to make any useful statements concerning error-estimation accuracy unless distributional assumptions are made. In previous chapters, these distributional assumptions were made from a frequentist point of view, whereas this chapter considers a Bayesian approach to the problem. Partial assumptions lead to an uncertainty class of feature–label distributions and, assuming a prior distribution over the uncertainty class, one can obtain a minimum-mean-square-error (MMSE) estimate of the error rate. In opposition to the classical case, a sample-conditioned MSE of the error estimate can be defined and computed. In addition to the general MMSE theory, this chapter provides specific representations of the MMSE error estimator for discrete classification and linear classification in the Gaussian model. It also discusses consistency of the estimate and calibration of non-Bayesian error estimators relative to a prior distribution governing model uncertainty.

We close by noting that for researchers in applied mathematics, statistics, engineering, and computer science, error estimation for pattern recognition offers a wide open field with fundamental and practically important problems around every turn. Despite this fact, error estimation has been a neglected subject over the past few decades, even though it spans a huge application domain and its understanding is critical for scientific epistemology. This text, which is believed to be the first book devoted entirely to the topic of error estimation for pattern recognition, is an attempt to correct the record in that regard.

ULISSES M. BRAGA-NETO EDWARD R. DOUGHERTY

College Station, TexasApril 2015

NOTES

1

In the words of Richard Feynman, “It is whether or not the theory gives predictions that agree with experiment. It is not a question of whether a theory is philosophically delightful, or easy to understand, or perfectly reasonable from the point of view of common sense.” (Feynman, 56)

2

From a scientific perspective, the salient issue is error estimation. One can imagine Harald Cramér leisurely sailing on the Baltic off the coast of Stockholm, taking in the sights and sounds of the sea, when suddenly a gene-expression classifier to detect prostate cancer pops into his head. No classification rule has been applied, nor is that necessary. All that matters is that Cramér’s imagination has produced a classifier that operates on the population of interest with a sufficiently small error rate. Estimation of that rate requires an accurate error estimation rule.

ACKNOWLEDGMENTS

Much of the material in this book has been developed over a dozen years of our joint work on error estimation for pattern recognition. In this regard, we would like to acknowledge the students whose excellent work has contributed to the content: Chao Sima, Jianping Hua, Thang Vu, Lori Dalton, Mohammadmahdi Yousefi, Mohammad Esfahani, Amin Zollanvari, and Blaise Hanczar. We also would like to thank Don Geman for many hours of interesting conversation about error estimation. We appreciate the financial support we have received from the Translational Genomics Research Institute, the National Cancer Institute, and the National Science Foundation. Last, and certainly not least, we would like to acknowledge the encouragement and support provided by our families in this time-consuming endeavor.

U.M.B.N. E.R.D.