Discovering Knowledge in Data - Daniel T. Larose - E-Book

Discovering Knowledge in Data E-Book

Daniel T. Larose

4,7
83,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The field of data mining lies at the confluence of predictive analytics, statistical analysis, and business intelligence. Due to the ever-increasing complexity and size of data sets and the wide range of applications in computer science, business, and health care, the process of discovering knowledge in data is more relevant than ever before. This book provides the tools needed to thrive in today's big data world. The author demonstrates how to leverage a company's existing databases to increase profits and market share, and carefully explains the most current data science methods and techniques. The reader will "learn data mining by doing data mining". By adding chapters on data modelling preparation, imputation of missing data, and multivariate statistical analysis, Discovering Knowledge in Data, Second Edition remains the eminent reference on data mining. * The second edition of a highly praised, successful reference on data mining, with thorough coverage of big data applications, predictive analytics, and statistical analysis. * Includes new chapters on Multivariate Statistics, Preparing to Model the Data, and Imputation of Missing Data, and an Appendix on Data Summarization and Visualization * Offers extensive coverage of the R statistical programming language * Contains 280 end-of-chapter exercises * Includes a companion website for university instructors who adopt the book

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 492

Veröffentlichungsjahr: 2014

Bewertungen
4,7 (18 Bewertungen)
14
2
2
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



WILEY SERIES ON METHODS AND APPLICATIONS IN DATA MINING

Series Editor: Daniel T. Larose

Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition • Daniel T. Larose and Chantal D. Larose

Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data • Darius M. Dziuda

Knowledge Discovery with Support Vector Machines • Lutz Hamel

Data-Mining on the Web: Uncovering Patterns in Web Content, Structure, and Usage • Zdravko Markov and Daniel Larose

Data Mining Methods and Models • Daniel Larose

Practical Text Mining with Perl • Roger Bilisoly

SECOND EDITION

DISCOVERING KNOWLEDGE IN DATA

An Introduction to Data Mining

DANIEL T. LAROSE

CHANTAL D. LAROSE

Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our website at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Larose, Daniel T. Discovering knowledge in data : an introduction to data mining / Daniel T. Larose and Chantal D. Larose. – Second edition. pages cm Includes index. ISBN 978-0-470-90874-7 (hardback) 1. Data mining. I. Larose, Chantal D. II. Title. QA76.9.D343L38 2014 006.3'12–dc23 2013046021

CONTENTS

Preface

What is Data Mining?

Why is This Book Needed?

What's New for the Second Edition?

Danger! Data Mining is Easy to Do Badly

“White Box” Approach: Understanding the Underlying Algorithmic and Model Structures

Data Mining as a Process

Graphical Approach, Emphasizing Exploratory Data Analysis

How The Book is Structured

Acknowledgments

Chapter 1: An Introduction to Data Mining

1.1 What is Data Mining?

1.2 Wanted: Data Miners

1.3 The Need for Human Direction of Data Mining

1.4 The Cross-Industry Standard Practice for Data Mining

1.5 Fallacies of Data Mining

1.6 What Tasks Can Data Mining Accomplish?

References

Exercises

Note

Chapter 2: Data Preprocessing

2.1 Why do We Need to Preprocess the Data?

2.2 Data Cleaning

2.3 Handling Missing Data

2.4 Identifying Misclassifications

2.5 Graphical Methods for Identifying Outliers

2.6 Measures of Center and Spread

2.7 Data Transformation

2.8 Min-Max Normalization

2.9

Z

-Score Standardization

2.10 Decimal Scaling

2.11 Transformations to Achieve Normality

2.12 Numerical Methods for Identifying Outliers

2.13 Flag Variables

2.14 Transforming Categorical Variables into Numerical Variables

2.15 Binning Numerical Variables

2.16 Reclassifying Categorical Variables

2.17 Adding an Index Field

2.18 Removing Variables that are Not Useful

2.19 Variables that Should Probably Not Be Removed

2.20 Removal of Duplicate Records

2.21 A Word About Id Fields

References

Exercises

Hands-On Analysis

Notes

Chapter 3: Exploratory Data Analysis

3.1 Hypothesis Testing Versus Exploratory Data Analysis

3.2 Getting to Know the Data Set

3.3 Exploring Categorical Variables

3.4 Exploring Numeric Variables

3.5 Exploring Multivariate Relationships

3.6 Selecting Interesting Subsets of the Data for Further Investigation

3.7 Using EDA to Uncover Anomalous Fields

3.8 Binning Based on Predictive Value

3.9 Deriving New Variables: Flag Variables

3.10 Deriving New Variables: Numerical Variables

3.11 Using EDA to Investigate Correlated Predictor Variables

3.12 Summary

Reference

Exercises

Hands-On Analysis

Note

Chapter 4: Univariate Statistical Analysis

4.1 Data Mining Tasks in

Discovering Knowledge in Data

4.2 Statistical Approaches to Estimation and Prediction

4.3 Statistical Inference

4.4 How Confident are We in Our Estimates?

4.5 Confidence Interval Estimation of the Mean

4.6 How to Reduce the Margin of Error

4.7 Confidence Interval Estimation of the Proportion

4.8 Hypothesis Testing for the Mean

4.9 Assessing the Strength of Evidence Against the Null Hypothesis

4.10 Using Confidence Intervals to Perform Hypothesis Tests

4.11 Hypothesis Testing for the Proportion

Reference

Exercises

Chapter 5: Multivariate Statistics

5.1 Two-Sample

t

-Test for Difference in Means

5.2 Two-Sample

Z

-Test for Difference in Proportions

5.3 Test for Homogeneity of Proportions

5.4 Chi-Square Test for Goodness of Fit of Multinomial Data

5.5 Analysis of Variance

5.6 Regression Analysis

5.7 Hypothesis Testing in Regression

5.8 Measuring the Quality of a Regression Model

5.9 Dangers of Extrapolation

5.10 Confidence Intervals for the Mean Value of

y

Given

x

5.11 Prediction Intervals for a Randomly Chosen Value of

y

Given

x

5.12 Multiple Regression

5.13 Verifying Model Assumptions

Reference

Exercises

Hands-On Analysis

Note

Chapter 6: Preparing to Model the Data

6.1 Supervised Versus Unsupervised Methods

6.2 Statistical Methodology and Data Mining Methodology

6.3 Cross-Validation

6.4 Overfitting

6.5 BIAS–Variance Trade-Off

6.6 Balancing the Training Data Set

6.7 Establishing Baseline Performance

Reference

Exercises

Chapter 7:

k

-Nearest Neighbor Algorithm

7.1 Classification Task

7.2

k

-Nearest Neighbor Algorithm

7.3 Distance Function

7.4 Combination Function

7.5 Quantifying Attribute Relevance: Stretching the Axes

7.6 Database Considerations

7.7

k

-Nearest Neighbor Algorithm for Estimation and Prediction

7.8 Choosing

k

7.9 Application of

k

-Nearest Neighbor Algorithm Using IBM/SPSS Modeler

Exercises

Hands-On Analysis

Chapter 8: Decision Trees

8.1 What is a Decision Tree?

8.2 Requirements for Using Decision Trees

8.3 Classification and Regression Trees

8.4 C4.5 Algorithm

8.5 Decision Rules

8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data

References

Exercises

Hands-On Analysis

Chapter 9: Neural Networks

9.1 Input and Output Encoding

9.2 Neural Networks for Estimation and Prediction

9.3 Simple Example of a Neural Network

9.4 Sigmoid Activation Function

9.5 Back-Propagation

9.6 Termination Criteria

9.7 Learning Rate

9.8 Momentum Term

9.9 Sensitivity Analysis

9.10 Application of Neural Network Modeling

References

Exercises

Hands-On Analysis

Chapter 10: Hierarchical and

k

-Means Clustering

10.1 The Clustering Task

10.2 Hierarchical Clustering Methods

10.3 Single-Linkage Clustering

10.4 Complete-Linkage Clustering

10.5

k

-Means Clustering

10.6 Example of

k

-Means Clustering at Work

10.7 Behavior of MSB, MSE, and PSEUDO-

F

as the

k

-Means Algorithm Proceeds

10.8 Application of

k

-Means Clustering Using SAS Enterprise Miner

10.9 Using Cluster Membership to Predict Churn

References

Exercises

Hands-On Analysis

Note

Chapter 11: Kohonen Networks

11.1 Self-Organizing Maps

11.2 Kohonen Networks

11.3 Example of a Kohonen Network Study

11.4 Cluster Validity

11.5 Application of Clustering Using Kohonen Networks

11.6 Interpreting the Clusters

11.7 Using Cluster Membership as Input to Downstream Data Mining Models

References

Exercises

Hands-On Analysis

Chapter 12: Association Rules

12.1 Affinity Analysis and Market Basket Analysis

12.2 Support, Confidence, Frequent Itemsets, and the a Priori Property

12.3 How Does the a Priori Algorithm Work?

12.4 Extension from Flag Data to General Categorical Data

12.5 Information-Theoretic Approach: Generalized Rule Induction Method

12.6 Association Rules are Easy to do Badly

12.7 How can we Measure the Usefulness of Association Rules?

12.8 Do Association Rules Represent Supervised or Unsupervised Learning?

12.9 Local Patterns Versus Global Models

References

Exercises

Hands-On Analysis

Chapter 13: Imputation of Missing Data

13.1 Need for Imputation of Missing Data

13.2 Imputation of Missing Data: Continuous Variables

13.3 Standard Error of the Imputation

13.4 Imputation of Missing Data: Categorical Variables

13.5 Handling Patterns in Missingness

Reference

Exercises

Hands-On Analysis

Notes

Chapter 14: Model Evaluation Techniques

14.1 Model Evaluation Techniques for the Description Task

14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks

14.3 Model Evaluation Techniques for the Classification Task

14.4 Error Rate, False Positives, and False Negatives

14.5 Sensitivity and Specificity

14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns

14.7 Decision Cost/Benefit Analysis

14.8 Lift Charts and Gains Charts

14.9 Interweaving Model Evaluation with Model Building

14.10 Confluence of Results: Applying a Suite of Models

Reference

Exercises

Hands-On Analysis

Notes

Appendix: Data Summarization and Visualization

Part 1 Summarization 1: Building Blocks of Data Analysis

Part 2 Visualization: Graphs and Tables for Summarizing and Organizing Data

Part 3 Summarization 2: Measures of Center, Variability, and Position

Part 4 Summarization and Visualization of Bivariate Relationships

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1

Table 1.2

Chapter 2

Table 2.1

Table 2.2

Table 2.3

Chapter 3

Table 3.1

Table 3.2

Table 3.3

Table 3.4

Table 3.5

Table 3.6

Table 3.7

Table 3.8

Table 3.9

Chapter 4

Table 4.1

Table 4.2

Table 4.3

Table 4.4

Table 4.5

Table 4.6

Table 4.7

Table 4.8

Chapter 5

Table 5.1

Table 5.2

Table 5.3

Table 5.4

Table 5.5

Table 5.6

Table 5.7

Table 5.8

Table 5.9

Table 5.10

Table 5.11

Table 5.12

Chapter 6

Table 6.1

Chapter 7

Table 7.1

Table 7.2

Table 7.3

Table 7.4

Table 7.5

Chapter 8

Table 8.1

Table 8.2

Table 8.3

Table 8.4

Table 8.5

Table 8.6

Table 8.7

Table 8.8

Table 8.9

Table 8.10

Table 8.11

Chapter 9

Table 9.1

Chapter 10

Table 10.1

Table 10.2

Table 10.3

Table 10.4

Table 10.5

Chapter 11

Table 11.1

Chapter 12

Table 12.1

Table 12.2

Table 12.3

Table 12.4

Table 12.5

Table 12.6

Table 12.7

Table 12.8

Chapter 14

Table 14.1

Table 14.2

Table 14.3

Table 14.4

Table 14.5

Appendix

Table A.1

Table A.2

Table A.3

Table A.4

Table A.5

Guide

Cover

Table of Contents

Preface

Pages

xi

xii

xiii

xiv

xv

xvi

xvii

xviii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!