Ensemble Classification Methods with Applications in R -  - E-Book

Ensemble Classification Methods with Applications in R E-Book

0,0
114,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An essential guide to two burgeoning topics in machine learning – classification trees and ensemble learning 

Ensemble Classification Methods with Applications in R introduces the concepts and principles of ensemble classifiers methods and includes a review of the most commonly used techniques. This important resource shows how ensemble classification has become an extension of the individual classifiers. The text puts the emphasis on two areas of machine learning: classification trees and ensemble learning. The authors explore ensemble classification methods’ basic characteristics and explain the types of problems that can emerge in its application.

Written by a team of noted experts in the field, the text is divided into two main sections. The first section outlines the theoretical underpinnings of the topic and the second section is designed to include examples of practical applications. The book contains a wealth of illustrative cases of business failure prediction, zoology, ecology and others. This vital guide:

  • Offers an important text that has been tested both in the classroom and at tutorials at conferences
  • Contains authoritative information written by leading experts in the field
  • Presents a comprehensive text that can be applied to courses in machine learning, data mining and artificial intelligence 
  • Combines in one volume two of the most intriguing topics in machine learning: ensemble learning and classification trees

Written for researchers from many fields such as biostatistics, economics, environment, zoology, as well as students of data mining and machine learning, Ensemble Classification Methods with Applications in R puts the focus on two topics in machine learning: classification trees and ensemble learning.

 

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 390

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

List of Contributors

List of Tables

List of Figures

Preface

Chapter 1: Introduction

1.1 Introduction

1.2 Definition

1.3 Taxonomy of Supervised Classification Methods

1.4 Estimation of the Accuracy of a Classification System

1.5 Classification Trees

Chapter 2: Limitation of the Individual Classifiers

2.1 Introduction

2.2 Error Decomposition: Bias and Variance

2.3 Study of Classifier Instability

2.4 Advantages of Ensemble Classifiers

2.5 Bayesian Perspective of Ensemble Classifiers

Chapter 3: Ensemble Classifiers Methods

3.1 Introduction

3.2 Taxonomy of Ensemble Methods

3.3 Bagging

3.4 Boosting

3.5 Random Forests

Chapter 4: Classification with Individual and Ensemble Trees in R

4.1 Introduction

4.2

adabag

: An R Package for Classification with Boosting and Bagging

4.3 The “German Credit” Example

Chapter 5: Bankruptcy Prediction Through Ensemble Trees

5.1 Introduction

5.2 Problem Description

5.3 Applications

5.4 Conclusions

Chapter 6: Experiments with Adabag in Biology Classification Tasks

6.1 Classification of Color Texture Feature Patterns Extracted From Cells in Histological Images of Fish Ovary

6.2 Direct Kernel Perceptron: Ultra‐Fast Kernel ELM‐Based Classification with Non‐Iterative Closed‐Form Weight Calculation

6.3 Do We Need Hundreds of Classifiers to Solve Real‐World Classification Problems?

6.4 On the use of Nominal and Ordinal Classifiers for the Discrimination of Stages of Development in Fish Oocytes

Chapter 7: Generalization Bounds for Ranking Algorithms

7.1 Introduction

7.2 Assumptions, Main Theorem, and Application

7.3 Experiments

7.4 Conclusions

Chapter 8: Classification and Regression Trees for Analyzing Irrigation Decisions

8.1 Introduction

8.2 Theory

8.3 Case Study and Methods

8.4 Results and Discussion

8.5 Conclusions

Chapter 9: Boosted Rule Learner and its Properties

9.1 Introduction

9.2 Separate‐and‐Conquer

9.3 Boosting in Rule Induction

9.4 Experiments

9.5 Conclusions

Chapter 10: Credit Scoring with Individuals and Ensemble Trees

10.1 Introduction

10.2 Measures of Accuracy

10.3 Data Description

10.4 Classification of Borrowers Applying Ensemble Trees

10.5 Conclusions

Chapter 11: An Overview of Multiple Classifier Systems Based on Generalized Additive Models

11.1 Introduction

11.2 Multiple Classifier Systems Based on GAMs

11.3 Experiments and Applications

11.4 Software Implementation in R: the GAMens Package

11.5 Conclusions

References

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Comparison of real error rate estimation methods.

Chapter 2

Table 2.1 Error decomposition of a classifier (Tibshirani, 1996a).

Chapter 3

Table 3.1 Example of the weight updating process in AdaBoost.

Table 3.2 Example of the weight updating process in AdaBoost.M1.

Chapter 5

Table 5.1 Description of variables (some of these ratios are explained in White et al. (2003)).

Table 5.2 Results obtained from descriptive analysis and ANOVA (SW test = Shapiro‐Wilks test; KS test = Kolmogorov‐Smirnov test).

Table 5.3 Correlation matrix.

Table 5.4 Unstandardized coefficients of the canonical discriminant function.

Table 5.5 Confusion matrix and errors with LDA.

Table 5.6 Confusion matrix and errors with an artificial neural network.

Table 5.7 Sensitive analysis.

Table 5.8 Confusion matrix and errors with AdaBoost.

Table 5.9 Comparison of results with other methods.

Table 5.10 Confusion matrix and errors with the pruned tree.

Table 5.11 Confusion matrix and errors with the AdaBoost.M1 model.

Chapter 6

Table 6.1 Collection of 121 data sets from the UCI database and our real‐world problems. ac‐inflam, acute inflammation; bc, breast cancer; congress‐voting, congressional voting; ctg, cardiotocography; conn‐bench‐sonar, connectionist benchmark sonar mines rocks; conn‐bench‐vowel, connectionist benchmark vowel deterding; pb, Pittsburg bridges; st, statlog; vc, vertebral column.

Table 6.2 Friedman ranking, average accuracy and Cohen

(both in %) for the 30 best classifiers, ordered by increasing Friedman ranking. BG, bagging; MAB, MultiBoostAB; RC, RandomCommittee.

Table 6.3 Classification results: accuracy, Cohen

(both in %), mean absolute error (MAE), Kendall

, and Spearman

for species MC and TL with three stages.

Table 6.4 Classification results for the species RH with six stages using the LOIO and MIX methodologies.

Table 6.5 Confusion matrices and sensitivities/positive predictivities for each stage (in %) achieved by SVMOD and GSVM for species RH and LOIO experiments.

Chapter 7

Table 7.1 Errors of estimators.

Chapter 8

Table 8.1 Predictor variables, the represented factors as seen by the farmer, and the target variable used for trees analysis.

Table 8.2 The cost‐complexity parameter (CP), relative error, cross‐validation error (xerror), and cross‐validation standard deviation (xstd) for trees with nsplit from 0 to 8.

Table 8.3 Accuracy estimates on test data for CART models. Resub, resubstitution accuracy estimate; Xval, 10‐fold cross‐validation accuracy estimate. (a) 1‐day, (b) 4‐day, (c) all days models.

Table 8.4 Important variables for irrigating different crops according to CART.

Chapter 9

Table 9.1 Classification error (in %) estimated for test samples.

Table 9.2 The comparison of run time (in seconds) for the largest data sets from Table 9.1.

Table 9.3 Standard deviations and test for variances for 30 estimations of the classification error.

Table 9.4 Numbers of irrelevant variables introduced to the classifiers.

Table 9.5 Classification error (in %) estimated in test samples for the

Pima

data set with irrelevant variables added from various distributions.

Chapter 10

Table 10.1 Borrowers according to age (

).

Table 10.2 Borrowers according to place credit granted (

).

Table 10.3 Borrowers according to share of the loan already repaid (

).

Table 10.4 Borrowers according to the value of the credit (

).

Table 10.5 Borrowers according to period of loan repayment (

).

Table 10.6 Structure of samples used for further experiments. Note that

denotes the borrower who paid back the loan in time, and

otherwise.

Table 10.7 Results of classification applying boosting and bagging models: the testing set.

Table 10.8 Comparison of accuracy measures for the training and testing sets.

Table 10.9 Comparison of accuracy measures for the training samples.

Table 10.10 Comparison of accuracy measures for the testing samples.

Table 10.11 Comparison of synthetic measures.

Chapter 11

Table 11.1 Average rank difference (CC‐BA) between GAM ensemble and benchmark algorithms based upon De Bock et al. (2010) (*

, **

).

Table 11.2 Summary of the average performance measures over the

‐fold cross‐validation based on Coussement and De Bock (2013). Note: In any rows, performance measures that share a common subscript are not significantly different at

.

Table 11.3 Average algorithm rankings and post‐hoc test results (HoIm's procedure) based on De Bock and Van den Poel (2011) (*

, **

).

Table 11.4 The 10 most important features with feature importance scores based on AUC and TDL based on De Bock and Van den Poel (2011).

List of Illustrations

Chapter 1

Figure 1.1 Binary classification tree.

Figure 1.2 Evolution of error rates depending on the size of the tree.

Figure 1.3 Evolution of cross‐validation error based on the tree size and the cost‐complexity measure.

Chapter 2

Figure 2.1 Probability of error depending on the number of classifiers in the ensemble.

Figure 2.2 Probability of error of the ensemble depending on the individual classifier accuracy.

Chapter 4

Figure 4.1 Cross‐validation error versus tree complexity for the iris example.

Figure 4.2 Individual tree for the iris example.

Figure 4.3 Variable relative importance in bagging for the iris example.

Figure 4.4 Variable relative importance in boosting for the iris example.

Figure 4.5 Margins for bagging in the iris example.

Figure 4.6 Margins for boosting in the iris example.

Figure 4.7 Error evolution in bagging for the iris example.

Figure 4.8 Error evolution in boosting for the iris example.

Figure 4.9 Overfitted classification tree.

Figure 4.10 Cross‐validation error versus tree complexity.

Figure 4.11 Pruned tree.

Figure 4.12 Variable relative importance in bagging.

Figure 4.13 Error evolution in bagging.

Figure 4.14 Variable relative importance in boosting.

Figure 4.15 Error evolution in boosting.

Figure 4.16 Variable relative importance in random forest.

Figure 4.17 OOB error evolution in random forest.

Chapter 5

Figure 5.1 Variable relative importance in AdaBoost.

Figure 5.2 Margin cumulative distribution in AdaBoost.

Figure 5.3 Structure of the pruned tree.

Figure 5.4 Variable relative importance in AdaBoost.M1 (three classes).

Figure 5.5 Evolution of the test error in AdaBoost.M1 (three classes).

Chapter 6

Figure 6.1 Histological images of fish species

Merluccius merluccius

, with cell outlines manually annotated by experts. The continuous (respectively dashed) lines are cells with (resp. without) nucleus. The images contain cells in the different states of development (hydrated, cortical alveoli, vitellogenic, and atretic).

Figure 6.2 Average accuracy (in %, left panel) and Friedman rank (right panel) over all the feature vectors for the detection of the nucleus (upper panels) and stage classification (lower panels) of fish ovary cells.

Figure 6.3 Maximum accuracies (in %), in decreasing order, achieved by the different classifiers for the detection of the nucleus (left panel) and stage classification (right panel) of fish ovary cells.

Figure 6.4 Average accuracy (left panel, in %) and Friedman ranking (right panel, decreasing with performance) of each classifier.

Figure 6.5 Accuracies achieved by Adaboost.M1, SVM, and random forest for each data set.

Figure 6.6 Times achieved by the faster classifiers (DKP, SVM, and ELM) for each data set, ordered by increasing size of data set.

Figure 6.7 Friedman rank (upper panel, increasing order) and average accuracies (lower panel, decreasing order) for the 25 best classifiers.

Figure 6.8 Friedman rank interval for the classifiers of each family (upper panel) and minimum rank (by ascending order) for each family (lower panel).

Figure 6.9 Examples of histological images of fish species

Reinhardtius hippoglossoides

, including oocytes with the six different development stages (PG, CA, VIT1, VIT2, VIT3, and VIT4).

Chapter 8

Figure 8.1 A tree structure.

Figure 8.2 Pairs plot of some weather variables used in the tree analysis, with the intention of finding groups of similar features.

Figure 8.3 CART structures for alfalfa decisions.

Figure 8.4 CART structures for barley irrigation decisions.

Figure 8.5 CART structures for corn irrigation decisions.

Chapter 10

Figure 10.1 Ranking of predictor importance for the boosting model evaluated for sample S1A.

Figure 10.2 Ranking of predictor importance for the bagging model evaluated for sample S1A.

Chapter 11

Figure 11.1 Bootstrap confidence intervals and average trends for a selection of predictive features (from De Bock and Van den Poel (2011)).

Guide

Cover

Table of Contents

Begin Reading

Pages

C1

ix

xi

xii

xiii

xv

xvi

xvii

xviii

xix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

19

20

21

22

23

24

25

26

27

28

29

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

187

197

198

199

200

E1

Ensemble Classification Methods with Applications in R

Edited by

Esteban Alfaro, Matías Gámez and Noelia García

University of Castilla-La Mancha, Spain

Copyright

This edition first published 2019

© 2019 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Esteban Alfaro, Matías Gámez and Noelia García to be identified as the the authors of editorial material in this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Alfaro, Esteban, 1977- editor. | Gámez, Matías, 1966- editor . |

García, Noelia, 1973- editor.

Title: Ensemble classification methods with applications in R / edited by

Esteban Alfaro, Matías Gámez, Noelia García.

Description: Hoboken, NJ : John Wiley & Sons, 2019. | Includes bibliographical references and index. |

Identifiers: LCCN 2018022257 (print) | LCCN 2018033307 (ebook) | ISBN 9781119421573 (Adobe PDF) | ISBN 9781119421559 (ePub) | ISBN 9781119421092 (hardcover)

Subjects: LCSH: Machine learning-Statistical methods. | R (Computer program

language)

Classification: LCC Q325.5 (ebook) | LCC Q325.5 .E568 2018 (print) | DDC 006.3/1-dc23

LC record available at https://lccn.loc.gov/2018022257

Cover Design: Wiley

Cover Image: Courtesy of Esteban Alfaro via wordle.net

List of Contributors

Esteban Alfaro, Economics and Business Faculty, Institute for Regional Development, University of Castilla‐La Mancha.

Sanyogita Andriyas, Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.

Eva Cernadas, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.

Mariola Chrzanowska, Faculty of Applied Informatics and Mathematics, Department of Econometrics and Statistics, Warsaw University of Life Sciences (SGGW), Warsaw, Poland.

Davy Cielen, Maastricht School of Management, Maastricht, the Netherlands.

Kristof Coussement, IESEG Center for Marketing Analytics (ICMA), IESEG School of Management, Université Catholique de Lille, Lille, France.

Koen W. De Bock, Audencia Business School, Nantes, France.

Manuel Fernández‐Delgado, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.

Matías Gámez, Institute for Regional Development, University of Castilla‐La Mancha.

Noelia García, Economics and Business Faculty, University of Castilla‐La Mancha.

Mariusz Kubus, Department of Mathematics and Computer Science Applications, Opole University of Technology, Poland.

Mac McKee, Utah Water Research Laboratory and Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.

María Pérez‐Ortiz, Department of Quantitative Methods, University of Loyola Andalucía, Córdoba, Spain.

Wojciech Rejchel, Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Torun, Poland; Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.

Dorota Witkowska, Department of Finance and Strategic Management, University of Lodz, Lodz, Poland.

List of Tables

Table 1.1

Comparison of real error rate estimation methods.

Table 2.1

Error decomposition of a classifier (Tibshirani, 1996a).

Table 3.1

Example of the weight updating process in AdaBoost.

Table 3.2

Example of the weight updating process in AdaBoost.M1.

Table 5.1

Description of variables (some of these ratios are explained in White et al. (2003)).

Table 5.2

Results obtained from descriptive analysis and ANOVA (SW test = Shapiro‐Wilks test; KS test = Kolmogorov‐Smirnov test).

Table 5.3

Correlation matrix.

Table 5.4

Unstandardized coefficients of the canonical discriminant function.

Table 5.5

Confusion matrix and errors with LDA.

Table 5.6

Confusion matrix and errors with an artificial neural network.

Table 5.7

Sensitive analysis.

Table 5.8

Confusion matrix and errors with AdaBoost.

Table 5.9

Comparison of results with other methods.

Table 5.10

Confusion matrix and errors with the pruned tree.

Table 5.11

Confusion matrix and errors with the AdaBoost.M1 model.

Table 6.1

Collection of 121 data sets from the UCI database and our real‐world problems. ac‐inflam, acute inflammation; bc, breast cancer; congress‐voting, congressional voting; ctg, cardiotocography; conn‐bench‐sonar, connectionist benchmark sonar mines rocks; conn‐bench‐vowel, connectionist benchmark vowel deterding; pb, Pittsburg bridges; st, statlog; vc, vertebral column.

Table 6.2

Friedman ranking, average accuracy and Cohen

(both in %) for the 30 best classifiers, ordered by increasing Friedman ranking. BG, bagging; MAB, MultiBoostAB; RC, RandomCommittee.

Table 6.3

Classification results: accuracy, Cohen

(both in %), mean absolute error (MAE), Kendall

, and Spearman

for species MC and TL with three stages.

Table 6.4

Classification results for the species RH with six stages using the LOIO and MIX methodologies.

Table 6.5

Confusion matrices and sensitivities/positive predictivities for each stage (in %) achieved by SVMOD and GSVM for species RH and LOIO experiments.

Table 7.1

Errors of estimators.

Table 8.1

Predictor variables, the represented factors as seen by the farmer, and the target variable used for trees analysis.

Table 8.2

The cost‐complexity parameter (CP), relative error, cross‐validation error (xerror), and cross‐validation standard deviation (xstd) for trees with nsplit from 0 to 8.

Table 8.3

Accuracy estimates on test data for CART models. Resub, resubstitution accuracy estimate; Xval, 10‐fold cross‐validation accuracy estimate. (a) 1‐day, (b) 4‐day, (c) all days models.

Table 8.4

Important variables for irrigating different crops according to CART.

Table 9.1

Classification error (in %) estimated for test samples.

Table 9.2

The comparison of run time (in seconds) for the largest data sets from Table 9.1.

Table 9.3

Standard deviations and test for variances for 30 estimations of the classification error.

Table 9.4

Numbers of irrelevant variables introduced to the classifiers.

Table 9.5

Classification error (in %) estimated in test samples for the

Pima

data set with irrelevant variables added from various distributions.

Table 10.1

Borrowers according to age (

).

Table 10.2

Borrowers according to place credit granted (

).

Table 10.3

Borrowers according to share of the loan already repaid (

).

Table 10.4

Borrowers according to the value of the credit (

).

Table 10.5

Borrowers according to period of loan repayment (

).

Table 10.6

Structure of samples used for further experiments. Note that

denotes the borrower who paid back the loan in time, and

otherwise.

Table 10.7

Results of classification applying boosting and bagging models: the testing set.

Table 10.8

Comparison of accuracy measures for the training and testing sets.

Table 10.9

Comparison of accuracy measures for the training samples.

Table 10.10

Comparison of accuracy measures for the testing samples.

Table 10.11

Comparison of synthetic measures.

Table 11.1

Average rank difference (CC‐BA) between GAM ensemble and benchmark algorithms based upon De Bock et al. (2010) (*

, **

).

Table 11.2

Summary of the average performance measures over the

‐fold cross‐validation based on Coussement and De Bock (2013). Note: In any rows, performance measures that share a common subscript are not significantly different at

.

Table 11.3

Average algorithm rankings and post‐hoc test results (HoIm's procedure) based on De Bock and Van den Poel (2011) (*

, **

).

Table 11.4

The 10 most important features with feature importance scores based on AUC and TDL based on De Bock and Van den Poel (2011).

List of Figures

Figure 1.1

Binary classification tree.

Figure 1.2

Evolution of error rates depending on the size of the tree.

Figure 1.3

Evolution of cross‐validation error based on the tree size and the cost‐complexity measure.

Figure 2.1

Probability of error depending on the number of classifiers in the ensemble.

Figure 2.2

Probability of error of the ensemble depending on the individual classifier accuracy.

Figure 4.1

Cross‐validation error versus tree complexity for the iris example.

Figure 4.2

Individual tree for the iris example.

Figure 4.3

Variable relative importance in bagging for the iris example.

Figure 4.4

Variable relative importance in boosting for the iris example.

Figure 4.5

Margins for bagging in the iris example.

Figure 4.6

Margins for boosting in the iris example.

Figure 4.7

Error evolution in bagging for the iris example.

Figure 4.8

Error evolution in boosting for the iris example.

Figure 4.9

Overfitted classification tree.

Figure 4.10

Cross‐validation error versus tree complexity.

Figure 4.11

Pruned tree.

Figure 4.12

Variable relative importance in bagging.

Figure 4.13

Error evolution in bagging.

Figure 4.14

Variable relative importance in boosting.

Figure 4.15

Error evolution in boosting.

Figure 4.16

Variable relative importance in random forest.

Figure 4.17

OOB error evolution in random forest.

Figure 5.1

Variable relative importance in AdaBoost.

Figure 5.2

Margin cumulative distribution in AdaBoost.

Figure 5.3

Structure of the pruned tree.

Figure 5.4

Variable relative importance in AdaBoost.M1 (three classes).

Figure 5.5

Evolution of the test error in AdaBoost.M1 (three classes).

Figure 6.1

Histological images of fish species

Merluccius merluccius

, with cell outlines manually annotated by experts. The continuous (respectively dashed) lines are cells with (resp. without) nucleus. The images contain cells in the different states of development (hydrated, cortical alveoli, vitellogenic, and atretic).

Figure 6.2

Average accuracy (in %, left panel) and Friedman rank (right panel) over all the feature vectors for the detection of the nucleus (upper panels) and stage classification (lower panels) of fish ovary cells.

Figure 6.3

Maximum accuracies (in %), in decreasing order, achieved by the different classifiers for the detection of the nucleus (left panel) and stage classification (right panel) of fish ovary cells.

Figure 6.4

Average accuracy (left panel, in %) and Friedman ranking (right panel, decreasing with performance) of each classifier.

Figure 6.5

Accuracies achieved by Adaboost.M1, SVM, and random forest for each data set.

Figure 6.6

Times achieved by the faster classifiers (DKP, SVM, and ELM) for each data set, ordered by increasing size of data set.

Figure 6.7

Friedman rank (upper panel, increasing order) and average accuracies (lower panel, decreasing order) for the 25 best classifiers.

Figure 6.8

Friedman rank interval for the classifiers of each family (upper panel) and minimum rank (by ascending order) for each family (lower panel).

Figure 6.9

Examples of histological images of fish species

Reinhardtius hippoglossoides

, including oocytes with the six different development stages (PG, CA, VIT1, VIT2, VIT3, and VIT4).

Figure 8.1

A tree structure.

Figure 8.2

Pairs plot of some weather variables used in the tree analysis, with the intention of finding groups of similar features.

Figure 8.3

CART structures for alfalfa decisions.

Figure 8.4

CART structures for barley irrigation decisions.

Figure 8.5

CART structures for corn irrigation decisions.

Figure 10.1

Ranking of predictor importance for the boosting model evaluated for sample S1A.

Figure 10.2

Ranking of predictor importance for the bagging model evaluated for sample S1A.

Figure 11.1

Bootstrap confidence intervals and average trends for a selection of predictive features (from De Bock and Van den Poel (2011)).

Preface

This book introduces the reader to ensemble classifier methods by describing the most commonly used techniques. The goal is not a complete analysis of all the techniques and their applications, nor an exhaustive tour through all the subjects and aspects that come up within this field of continuous expansion. On the contrary, the aim is to show in an intuitive way how ensemble classification has arisen as an extension of the individual classifiers, and describe their basic characteristics and what kind of problems can emerge from its use. The book is therefore intended for everyone interested in starting in these fields, especially students, teachers, researchers, and people dealing with statistical classification.

To achieve these goals, the work has been structured in two sections containing a total of 11 chapters. The first section, which is more theoretical, contains the four initial chapters, including the introduction. The second section, from the fifth chapter to the end, has a much more practical nature, illustrating with examples of business failure prediction, zoology, and ecology, among others, how the previously studied techniques are applied.

After a brief introduction where the fundamental concepts of statistical classification tasks through decision trees are established, in the second chapter the generalization error is decomposed in three terms (the Bayes risk, the bias, and the variance). Moreover, the instability of classifiers is studied, dealing with the changes suffered by the classifier when it faces small changes in the training set. The three reasons proposed by Dietterich to explain the superiority of ensemble classifiers over single ones are given (statistical, computational, and representation). Finally, the Bayesian perspective is mentioned.

In the third chapter, several classifications of ensemble methods are enumerated, focusing on that which distinguishes between generative and non‐generative methods. After that, the bagging method is studied. This uses several bootstrap samples of the original set to train a set of basic classifiers that afterwards combine by majority vote. In addition, the boosting method is analysed, highlighting the most commonly used algorithm, the AdaBoost. This algorithm repeatedly applies the classification system to the training set, but focuses, in each iteration, on the most difficult examples, and later combines the built classifiers through weighted majority vote. To end this chapter, the random forest method is briefly mentioned. This generates a set of trees, introducing in their building process a degree of randomness to assure some diversity in the final ensemble.

The last chapter of the first part shows with simple applications how the individual and ensembled classification trees are applied in practice using the rpart, adabag and randomForestR packages. Moreover, the improvement in the results achieved over individual classification techniques is highlighted.

The second part begins with a chapter dealing with business failure prediction. Specifically, it compares the prediction accuracy of ensemble trees and neural networks for a set of European firms, considering the usual predicting variables such as financial ratios, as well as qualitative variables, such as firm size, activity, and legal structure. It shows that the ensemble trees decrease the generalization error by about 30% with respect to the error produced with a neural network.

The sixth chapter describes the experience of M. Fernández‐Delgado, E. Cernadas, and M. Pérez‐Ortiz using ensemble methods for classifying texture feature patterns in histological images of fish gonad cells. The results were also good compared to ordinal classifiers with stages of fish oocytes, whose development follows a natural time ordering.

In the seventh chapter W. Rejchel considers the ranking problem that is popular in the machine learning community. The goal is to predict or to guess the ordering between objects on the basis of their observed features. This work focuses on ranking estimators that are obtained by minimization of an empirical risk with a convex loss function, for instance boosting algorithms. Generalization bounds for the excess risk of ranking estimators that decrease faster than a threshold are constructed. In addition, the quality of procedures on simulated data sets is investigated.

In the eighth chapter S. Andriyas and M. McKee implement ensemble classification trees to analyze farmers' irrigation decisions and consequently forecast future decisions. Readily available data on biophysical conditions in fields and the irrigation delivery system during the growing season can be utilized to anticipate irrigation water orders in the absence of any predictive socio‐economic information that could be used to provide clues for future irrigation decisions. This can subsequently be useful in making short‐term demand forecasts.

The ninth chapter, by M. Kubus, focuses on two properties of a boosted set of rules. He discusses the stability and robustness of irrelevant variables, which can deteriorate the predictive ability of the model. He also compares the generalization errors of SLIPPER and AdaBoost.M1 in computational experiments using benchmark data and artificially generated irrelevant variables from various distributions.

The tenth chapter shows how M. Chrzanowska, E. Alfaro, and D. Witkowska apply individual and ensemble trees for credit scoring, which is a crucial problem for a bank as it has a critical influence on its financial outcome. Therefore, to assess credit risk (or a client's creditworthiness) various statistical tools may be used, including classification methods.

The aim of the last chapter by K. W. De Bock, K. Coussement, and D. Cielen is to provide an introduction to GAMs and GAM‐based ensembles, as well as an overview of experiments conducted to evaluate and benchmark the performance, and to provide insights into these novel algorithms using real‐life data sets from various application domains.

Thanks are due to all our collaborators and colleagues, especially the Economics and Business Faculty of Albacete, the Public Economy, Statistics and Economic Policy Department, the Quantitative Methods and Socio‐economic Development Group (MECYDES) at the Institute for Regional Development (IDR) and the University of Castilla‐La Mancha (UCLM). At Wiley, we would like to thank to Alison Oliver and Jemima Kingsly for their help and two anonymous reviewers for their comments.

Finally, we thank our families for their understanding and help in every moment: Nieves, Emilio, Francisco, María, Esteban, and Pilar; Matías, Clara, David, and Enrique.

1Introduction

Esteban Alfaro Matías Gámez and Noelia García

1.1 Introduction

Classification as a statistical task is present in a wide range of real‐life contexts as diverse as, for example, the mechanical procedure to send letters based on the automatic reading of the postal codes, decisions regarding credit applications from individuals or the preliminary diagnosis of a patient's condition to enable immediate treatment while waiting for the final results of tests.

In its most general form, the term classification can cover any context in which a decision is taken or a prediction is made based on the information available at that time, and a classification procedure is, then, a formal method to repeat the arguments that led to that decision for new situations.

This work focuses on a more specific interpretation. The problem is to build a procedure that will be applied to a set of cases in which each new case has to be assigned to one of a set of predefined classes or subpopulations on the basis of observed characteristics or attributes.

The construction of a classification system from a set of data for which actual classes are known has been called different things, such as pattern recognition, discriminant analysis, or supervised learning. The latter name is used rather than unsupervised learning or clustering in which classes are not defined a priori but they are inferred from the data. This work focuses on the first type of classification tasks.

1.2 Definition

The most traditional statistical technique applied to supervised classification is linear discriminant analysis, but in recent decades a wider set of new methods has been developed, in part due to the improvement in the capabilities of informatics supports. Generally, the performance of a classification procedure is analysed based on its accuracy, that is, the percentage of correct classified cases. The existence of a correct classification implies the existence of an expert or supervisor capable of providing it, so why would we want to replace this exact system by an approximation? Among the reasons for this replacement we could mention:

Speed. Automatic procedures are usually quick and they can help to save time. For instance, automatic readers of postal codes are able to read most letters, leaving only some very complex cases to human experts.

Objectivity. Important decisions have to be taken basing on objective criteria under the same conditions for all cases. Objectivity is sometimes difficult to ensure in the case of human deciders. In such cases, decisions can be affected from external factors, which would led us to take biased decisions.

Explanatory capabilities. Some of the classification methods allow us not only to classify observations but to explain the reasons for the decision in terms of a set of statistical features.

Economy. Having an expert who make decisions can be much more expensive than developing an effective classification system from accumulated experience so that it can be applied by anyone, not necessarily expert on the subject but following the guidelines given by the classifier.

1.3 Taxonomy of Supervised Classification Methods

There is no single taxonomy of classification methods, but we can find a variety of them depending on the criterion of division we take, for example between parametric and nonparametric methods or between methods that attempt to estimate probability densities, posterior probabilities or just decision borders. If we consider the first criterion classification methods can be divided into:

Parametric methods. These methods are based on the assumption of knowing the shape of the underlying density functions, generally the normal distribution. Then the problem is the parameter estimation, which is performed either by maximizing the likelihood or through Bayesian methods. Such methods include Fisher linear discriminant analysis, multiple discriminant analysis, quadratic discriminant, and the expectation–maximization algorithm, among others.

Non‐parametric methods. These methods do not require any hypothesis about the underlying density functions so they are appropriate when the data probability distribution is unknown. These methods include Parzen windows estimation, K‐nearest neighbors, classification trees and artificial neural networks.

On the other hand, Lippmann (1991) recognizes five basic types of classifiers:

Probabilistic methods. These are functional and parametric techniques and therefore indicated when the functional form fits well with the actual distribution of data and there is a sufficient number of examples to estimate parameters. As examples we can point to Gaussian or linear discriminant classifiers based on mixtures of normal distributions.

Global methods. These are methods that build the discriminant function by internal nodes using sigmoid or polynomial functions that have high non‐zero responses over a large part of the input space. These methods include multilayer perceptron, Boltzmann machines, and high‐order polynomial networks.

Local methods. Unlike the previous methods, these techniques build the discriminant function using nodes having nonzero responses only on localized regions of the input space. Examples of such methods are radial basis functions networks and the Kernel discriminant. The advantage of these methods is that they do not require assumptions about the underlying distributions.

Nearest neighbour. These methods are based on the distance between a new element and the set of stored elements. Among the best‐known techniques are learning vector quantization (LVQ) and K‐nearest neighbours. These are non‐parametric methods but they require a lot of computing time.

Rule‐based methods. These methods divide the input space into labelled regions through rules or logical thresholds. These techniques include classification trees.

The first three types of methods provide continuous outputs that can estimate either the likelihood or Bayes posterior probabilities, while the last two blocks provide binary outputs. Because of this difference, the first methods respond to a strategy of minimizing a cost function such as the sum of squared errors or entropy, while the second block will aim to minimize the number of misclassified items.

In this work we will focus on the last type of classifiers, to be used as base classifiers in the ensembles. Therefore, the accuracy (error) of the classification system will be measured by the percentage of successes (failures) in the classified elements.

1.4 Estimation of the Accuracy of a Classification System

In the development of a classification system three stages can be set up: selection, training, and validation. In the first stage, both the technique and the set of potential features must be selected. Once the first stage has been completed, it is time to start the learning process through a set of training examples. In order to check the performance of the trained classifier, that is to say its ability to classify new observations in the correct way, the accuracy has to be estimated.

Once the classifier has been validated, the system will be ready to be used. Otherwise, it will be necessary to return to the selection or the training stages, for example modifying the number and type of attributes, the number of rules and/or conjunctions, etc. or even looking for another more appropriate classification method.

To measure the goodness of fit of a classifier the error rate can be used. The true error rate of a classifier is defined as the error percentage when the system is tested on the distribution of cases in the population. This error rate can be empirically approximated by a test set consisting of a large number of new cases collected independently of the examples used to train the classifier. The error rate is defined as the ratio between the number of mistakes and the number of classified cases.

(1.1)

For the sake of simplicity, all errors are assumed to have the same importance, although this might not be true in a real case.

The true error rate could be computed if the number of examples tended to infinity. In a real case, however, the number of available examples is always finite and often relatively small. Therefore, the true error rate has to be estimated from the error rate calculated on a small sample or using statistical sampling techniques (random resampling, bootstrap, etc.). The estimation will usually be biased and the bias has to be analysed in order to find non‐random errors. Its variance is important too, seeking the greatest possible stability.

1.4.1 The Apparent Error Rate

The apparent error rate of a classifier is the error rate calculated from the examples of the training set. If the training set is unlimited, the rate of apparent error will coincide with the true error rate, but as already noted, this does not happen in the real world and, in general, samples of limited size will have to be used to build and evaluate a classifier system.

Overall, the rate of apparent error will be biased downwards so the apparent error rate will underestimate the true error rate (Efron, 1986). This usually happens when the classifier has been over‐adjusted to the particular characteristics of the sample instead of discovering the underlying structure in the population. This problem results in classifiers with a very low rate of apparent error, even zero in the training set, but with bad generalization ability, that is, bad performance when facing new observations.

1.4.2 Estimation of the True Error Rate

Since the apparent error rate is usually misleading, it is necessary to use other estimators of the true error rate. To perform this estimation and according to the number of examples in the sample, there are two alternative ways: the use of a test set or resampling techniques. The first method is the simplest way to estimate the true error, but it can only be used when the size of the data set is big enough. This method randomly divides the data set into two subsets, one is used to train or build the classifier and the other to test its accuracy.

However, for small or moderately sized samples the use of resampling techniques is recommended (random subsampling, cross‐validation, leaving‐one‐out, or bootstrapping). What these methods do is to repeat many times the random division between training and test sets. All these methods are used to estimate the accuracy of the classifier, but the final classifier is calculated using all the available data set.

1.4.3 Error Rate Estimation Methods

1.4.3.1 Estimation from a Test Set

This method, referred to as “hold‐out”, involves dividing the original data set into two groups: the training set and the test set. The classifier is then built using only the training set, leaving out the examples of the test set. Once the classifier has been built, the test set is used to evaluate its performance facing new observations. The error rate of the classifier in the examples of the test set is called the test error rate.

The assignment of each observation to one of the two sets must be done randomly. The proportions are typically of (training) and (test), although in large sets, once the number of test examples exceeds 1000 a higher percentage can be used as training examples.

For large samples, the use of a test set achieves good results. However, for moderate‐size sets, this method makes one of the two sets or both have an insufficient number of observations. To solve this problem and avoid the pessimistic bias estimation from the test set, some resampling techniques can be applied, repeating in an iterative way the partition of the data set into training and test sets.

1.4.3.2 Random Subsampling

This method involves repeating the aforementioned partition in such a way that a new classifier is designed from the training data set generated from each partition. The estimated error rate is computed as the average of the errors of classifiers calculated on the test sets generated independently and randomly. Thanks to this averaging process, random subsampling solves the problem of having to rely on a single partition that may not be representative.

1.4.3.3 Cross‐Validation

This method can be actually considered as a special case of the random subsampling, wherein the original data set is randomly divided into mutually exclusive subsets of approximately the same size. This method is usually known as ‐fold cross‐validation, with reference to folders or data subsets. Every test set is used for assessing the performance of each one of the classifiers constructed, taking the remaining subsets as a training set. Therefore, at the end we have classifiers with their respective test error rates. The error rate estimated by cross‐validation will be the average of the error rates, weighted by the sizes of the subsets if they were of different sizes.

This estimator is optimistically biased with respect to the true error rate. According to the empirical results shown in Kohavi et al. (1995), if is lower than five, the estimation will be too biased, when is close to 10, the bias would be acceptable, and, finally, if is greater than 20, the estimation will almost be unbiased. The author also found that the stratified cross‐validation generally has lower bias. Stratification consists of building partitions such that the proportion of each class in the total set is maintained, especially in terms of the hierarchy of classes since it could happen that the majority class in the original set was relegated to the second place in one of the subsets, which would damage the estimation of the true error rate.

The major advantage of cross‐validation with respect to other methods of random subsampling is that every available example is used both in the training process and in the accuracy evaluation. The most usual value for is 10. This size ensures that the reduction of the training set is not too important in comparison with the proportion for training and test set usually set in random subsampling. The main drawback is the excessive computational cost due to the iterations of the training process. When the size of the data set is not big enough, a particular version of cross‐validation is recommended, leaving‐one‐out.

1.4.3.4 Leaving‐One‐Out