83,99 €
The field of data mining lies at the confluence of predictive analytics, statistical analysis, and business intelligence. Due to the ever-increasing complexity and size of data sets and the wide range of applications in computer science, business, and health care, the process of discovering knowledge in data is more relevant than ever before. This book provides the tools needed to thrive in today's big data world. The author demonstrates how to leverage a company's existing databases to increase profits and market share, and carefully explains the most current data science methods and techniques. The reader will "learn data mining by doing data mining". By adding chapters on data modelling preparation, imputation of missing data, and multivariate statistical analysis, Discovering Knowledge in Data, Second Edition remains the eminent reference on data mining. * The second edition of a highly praised, successful reference on data mining, with thorough coverage of big data applications, predictive analytics, and statistical analysis. * Includes new chapters on Multivariate Statistics, Preparing to Model the Data, and Imputation of Missing Data, and an Appendix on Data Summarization and Visualization * Offers extensive coverage of the R statistical programming language * Contains 280 end-of-chapter exercises * Includes a companion website for university instructors who adopt the book
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 492
Veröffentlichungsjahr: 2014
Series Editor: Daniel T. Larose
Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition • Daniel T. Larose and Chantal D. Larose
Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data • Darius M. Dziuda
Knowledge Discovery with Support Vector Machines • Lutz Hamel
Data-Mining on the Web: Uncovering Patterns in Web Content, Structure, and Usage • Zdravko Markov and Daniel Larose
Data Mining Methods and Models • Daniel Larose
Practical Text Mining with Perl • Roger Bilisoly
SECOND EDITION
DANIEL T. LAROSE
CHANTAL D. LAROSE
Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our website at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Larose, Daniel T. Discovering knowledge in data : an introduction to data mining / Daniel T. Larose and Chantal D. Larose. – Second edition. pages cm Includes index. ISBN 978-0-470-90874-7 (hardback) 1. Data mining. I. Larose, Chantal D. II. Title. QA76.9.D343L38 2014 006.3'12–dc23 2013046021
Preface
What is Data Mining?
Why is This Book Needed?
What's New for the Second Edition?
Danger! Data Mining is Easy to Do Badly
“White Box” Approach: Understanding the Underlying Algorithmic and Model Structures
Data Mining as a Process
Graphical Approach, Emphasizing Exploratory Data Analysis
How The Book is Structured
Acknowledgments
Chapter 1: An Introduction to Data Mining
1.1 What is Data Mining?
1.2 Wanted: Data Miners
1.3 The Need for Human Direction of Data Mining
1.4 The Cross-Industry Standard Practice for Data Mining
1.5 Fallacies of Data Mining
1.6 What Tasks Can Data Mining Accomplish?
References
Exercises
Note
Chapter 2: Data Preprocessing
2.1 Why do We Need to Preprocess the Data?
2.2 Data Cleaning
2.3 Handling Missing Data
2.4 Identifying Misclassifications
2.5 Graphical Methods for Identifying Outliers
2.6 Measures of Center and Spread
2.7 Data Transformation
2.8 Min-Max Normalization
2.9
Z
-Score Standardization
2.10 Decimal Scaling
2.11 Transformations to Achieve Normality
2.12 Numerical Methods for Identifying Outliers
2.13 Flag Variables
2.14 Transforming Categorical Variables into Numerical Variables
2.15 Binning Numerical Variables
2.16 Reclassifying Categorical Variables
2.17 Adding an Index Field
2.18 Removing Variables that are Not Useful
2.19 Variables that Should Probably Not Be Removed
2.20 Removal of Duplicate Records
2.21 A Word About Id Fields
References
Exercises
Hands-On Analysis
Notes
Chapter 3: Exploratory Data Analysis
3.1 Hypothesis Testing Versus Exploratory Data Analysis
3.2 Getting to Know the Data Set
3.3 Exploring Categorical Variables
3.4 Exploring Numeric Variables
3.5 Exploring Multivariate Relationships
3.6 Selecting Interesting Subsets of the Data for Further Investigation
3.7 Using EDA to Uncover Anomalous Fields
3.8 Binning Based on Predictive Value
3.9 Deriving New Variables: Flag Variables
3.10 Deriving New Variables: Numerical Variables
3.11 Using EDA to Investigate Correlated Predictor Variables
3.12 Summary
Reference
Exercises
Hands-On Analysis
Note
Chapter 4: Univariate Statistical Analysis
4.1 Data Mining Tasks in
Discovering Knowledge in Data
4.2 Statistical Approaches to Estimation and Prediction
4.3 Statistical Inference
4.4 How Confident are We in Our Estimates?
4.5 Confidence Interval Estimation of the Mean
4.6 How to Reduce the Margin of Error
4.7 Confidence Interval Estimation of the Proportion
4.8 Hypothesis Testing for the Mean
4.9 Assessing the Strength of Evidence Against the Null Hypothesis
4.10 Using Confidence Intervals to Perform Hypothesis Tests
4.11 Hypothesis Testing for the Proportion
Reference
Exercises
Chapter 5: Multivariate Statistics
5.1 Two-Sample
t
-Test for Difference in Means
5.2 Two-Sample
Z
-Test for Difference in Proportions
5.3 Test for Homogeneity of Proportions
5.4 Chi-Square Test for Goodness of Fit of Multinomial Data
5.5 Analysis of Variance
5.6 Regression Analysis
5.7 Hypothesis Testing in Regression
5.8 Measuring the Quality of a Regression Model
5.9 Dangers of Extrapolation
5.10 Confidence Intervals for the Mean Value of
y
Given
x
5.11 Prediction Intervals for a Randomly Chosen Value of
y
Given
x
5.12 Multiple Regression
5.13 Verifying Model Assumptions
Reference
Exercises
Hands-On Analysis
Note
Chapter 6: Preparing to Model the Data
6.1 Supervised Versus Unsupervised Methods
6.2 Statistical Methodology and Data Mining Methodology
6.3 Cross-Validation
6.4 Overfitting
6.5 BIAS–Variance Trade-Off
6.6 Balancing the Training Data Set
6.7 Establishing Baseline Performance
Reference
Exercises
Chapter 7:
k
-Nearest Neighbor Algorithm
7.1 Classification Task
7.2
k
-Nearest Neighbor Algorithm
7.3 Distance Function
7.4 Combination Function
7.5 Quantifying Attribute Relevance: Stretching the Axes
7.6 Database Considerations
7.7
k
-Nearest Neighbor Algorithm for Estimation and Prediction
7.8 Choosing
k
7.9 Application of
k
-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
Exercises
Hands-On Analysis
Chapter 8: Decision Trees
8.1 What is a Decision Tree?
8.2 Requirements for Using Decision Trees
8.3 Classification and Regression Trees
8.4 C4.5 Algorithm
8.5 Decision Rules
8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data
References
Exercises
Hands-On Analysis
Chapter 9: Neural Networks
9.1 Input and Output Encoding
9.2 Neural Networks for Estimation and Prediction
9.3 Simple Example of a Neural Network
9.4 Sigmoid Activation Function
9.5 Back-Propagation
9.6 Termination Criteria
9.7 Learning Rate
9.8 Momentum Term
9.9 Sensitivity Analysis
9.10 Application of Neural Network Modeling
References
Exercises
Hands-On Analysis
Chapter 10: Hierarchical and
k
-Means Clustering
10.1 The Clustering Task
10.2 Hierarchical Clustering Methods
10.3 Single-Linkage Clustering
10.4 Complete-Linkage Clustering
10.5
k
-Means Clustering
10.6 Example of
k
-Means Clustering at Work
10.7 Behavior of MSB, MSE, and PSEUDO-
F
as the
k
-Means Algorithm Proceeds
10.8 Application of
k
-Means Clustering Using SAS Enterprise Miner
10.9 Using Cluster Membership to Predict Churn
References
Exercises
Hands-On Analysis
Note
Chapter 11: Kohonen Networks
11.1 Self-Organizing Maps
11.2 Kohonen Networks
11.3 Example of a Kohonen Network Study
11.4 Cluster Validity
11.5 Application of Clustering Using Kohonen Networks
11.6 Interpreting the Clusters
11.7 Using Cluster Membership as Input to Downstream Data Mining Models
References
Exercises
Hands-On Analysis
Chapter 12: Association Rules
12.1 Affinity Analysis and Market Basket Analysis
12.2 Support, Confidence, Frequent Itemsets, and the a Priori Property
12.3 How Does the a Priori Algorithm Work?
12.4 Extension from Flag Data to General Categorical Data
12.5 Information-Theoretic Approach: Generalized Rule Induction Method
12.6 Association Rules are Easy to do Badly
12.7 How can we Measure the Usefulness of Association Rules?
12.8 Do Association Rules Represent Supervised or Unsupervised Learning?
12.9 Local Patterns Versus Global Models
References
Exercises
Hands-On Analysis
Chapter 13: Imputation of Missing Data
13.1 Need for Imputation of Missing Data
13.2 Imputation of Missing Data: Continuous Variables
13.3 Standard Error of the Imputation
13.4 Imputation of Missing Data: Categorical Variables
13.5 Handling Patterns in Missingness
Reference
Exercises
Hands-On Analysis
Notes
Chapter 14: Model Evaluation Techniques
14.1 Model Evaluation Techniques for the Description Task
14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks
14.3 Model Evaluation Techniques for the Classification Task
14.4 Error Rate, False Positives, and False Negatives
14.5 Sensitivity and Specificity
14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns
14.7 Decision Cost/Benefit Analysis
14.8 Lift Charts and Gains Charts
14.9 Interweaving Model Evaluation with Model Building
14.10 Confluence of Results: Applying a Suite of Models
Reference
Exercises
Hands-On Analysis
Notes
Appendix: Data Summarization and Visualization
Part 1 Summarization 1: Building Blocks of Data Analysis
Part 2 Visualization: Graphs and Tables for Summarizing and Organizing Data
Part 3 Summarization 2: Measures of Center, Variability, and Position
Part 4 Summarization and Visualization of Bivariate Relationships
Index
End User License Agreement
Chapter 1
Table 1.1
Table 1.2
Chapter 2
Table 2.1
Table 2.2
Table 2.3
Chapter 3
Table 3.1
Table 3.2
Table 3.3
Table 3.4
Table 3.5
Table 3.6
Table 3.7
Table 3.8
Table 3.9
Chapter 4
Table 4.1
Table 4.2
Table 4.3
Table 4.4
Table 4.5
Table 4.6
Table 4.7
Table 4.8
Chapter 5
Table 5.1
Table 5.2
Table 5.3
Table 5.4
Table 5.5
Table 5.6
Table 5.7
Table 5.8
Table 5.9
Table 5.10
Table 5.11
Table 5.12
Chapter 6
Table 6.1
Chapter 7
Table 7.1
Table 7.2
Table 7.3
Table 7.4
Table 7.5
Chapter 8
Table 8.1
Table 8.2
Table 8.3
Table 8.4
Table 8.5
Table 8.6
Table 8.7
Table 8.8
Table 8.9
Table 8.10
Table 8.11
Chapter 9
Table 9.1
Chapter 10
Table 10.1
Table 10.2
Table 10.3
Table 10.4
Table 10.5
Chapter 11
Table 11.1
Chapter 12
Table 12.1
Table 12.2
Table 12.3
Table 12.4
Table 12.5
Table 12.6
Table 12.7
Table 12.8
Chapter 14
Table 14.1
Table 14.2
Table 14.3
Table 14.4
Table 14.5
Appendix
Table A.1
Table A.2
Table A.3
Table A.4
Table A.5
Cover
Table of Contents
Preface
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!