111,99 €
Learn methods of data analysis and their application to real-world data sets This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified "white box" approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets. Data Mining and Predictive Analytics: * Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language * Features over 750 chapter exercises, allowing readers to assess their understanding of the new material * Provides a detailed case study that brings together the lessons learned in the book * Includes access to the companion website, www.dataminingconsultant, with exclusive password-protected instructor content Data Mining and Predictive Analytics will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1131
Veröffentlichungsjahr: 2015
Cover
Series
Title Page
Copyright
Dedication
Preface
What is Data Mining? What is Predictive Analytics?
Why is this Book Needed?
Who Will Benefit from this Book?
Danger! Data Mining is Easy to do Badly
“White-Box” Approach
Algorithm Walk-Throughs
Exciting New Topics
The R Zone
Appendix: Data Summarization and Visualization
The Case Study: Bringing it all Together
How the Book is Structured
The Software
Weka: The Open-Source Alternative
The Companion Web Site: www.dataminingconsultant.com
Data Mining and Predictive Analytics
as a Textbook
Acknowledgments
Daniel's Acknowledgments
Chantal's Acknowledgments
Part I: Data Preparation
Chapter 1: An Introduction to Data Mining and Predictive Analytics
1.1 What is Data Mining? What Is Predictive Analytics?
1.2 Wanted: Data Miners
1.3 The Need For Human Direction of Data Mining
1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM
1.5 Fallacies of Data Mining
1.6 What Tasks can Data Mining Accomplish
The R Zone
R References
Exercises
Chapter 2: Data Preprocessing
2.1 Why do We Need to Preprocess the Data?
2.2 Data Cleaning
2.3 Handling Missing Data
2.4 Identifying Misclassifications
2.5 Graphical Methods for Identifying Outliers
2.6 Measures of Center and Spread
2.7 Data Transformation
2.8 Min–Max Normalization
2.9
Z
-Score Standardization
2.10 Decimal Scaling
2.11 Transformations to Achieve Normality
2.12 Numerical Methods for Identifying Outliers
2.13 Flag Variables
2.14 Transforming Categorical Variables into Numerical Variables
2.15 Binning Numerical Variables
2.16 Reclassifying Categorical Variables
2.17 Adding an Index Field
2.18 Removing Variables that are not Useful
2.19 Variables that Should Probably not be Removed
2.20 Removal of Duplicate Records
2.21 A Word About ID Fields
The R Zone
R Reference
Exercises
Chapter 3: Exploratory Data Analysis
3.1 Hypothesis Testing Versus Exploratory Data Analysis
3.2 Getting to Know The Data Set
3.3 Exploring Categorical Variables
3.4 Exploring Numeric Variables
3.5 Exploring Multivariate Relationships
3.6 Selecting Interesting Subsets of the Data for Further Investigation
3.7 Using EDA to Uncover Anomalous Fields
3.8 Binning Based on Predictive Value
3.9 Deriving New Variables: Flag Variables
3.10 Deriving New Variables: Numerical Variables
3.11 Using EDA to Investigate Correlated Predictor Variables
3.12 Summary of Our EDA
The R Zone
R References
Exercises
Chapter 4: Dimension-Reduction Methods
4.1 Need for Dimension-Reduction in Data Mining
4.2 Principal Components Analysis
4.3 Applying PCA to the
Houses
Data Set
4.4 How Many Components Should We Extract?
4.5 Profiling the Principal Components
4.6 Communalities
4.7 Validation of the Principal Components
4.8 Factor Analysis
4.9 Applying Factor Analysis to the
Adult
Data Set
4.10 Factor Rotation
4.11 User-Defined Composites
4.12 An Example of a User-Defined Composite
The R Zone
R References
Exercises
Part II: Statistical Analysis
Chapter 5: Univariate Statistical Analysis
5.1 Data Mining Tasks in Discovering Knowledge in Data
5.2 Statistical Approaches to Estimation and Prediction
5.3 Statistical Inference
5.4 How Confident are We in Our Estimates?
5.5 Confidence Interval Estimation of the Mean
5.6 How to Reduce the Margin of Error
5.7 Confidence Interval Estimation of the Proportion
5.8 Hypothesis Testing for the Mean
5.9 Assessing The Strength of Evidence Against The Null Hypothesis
5.10 Using Confidence Intervals to Perform Hypothesis Tests
5.11 Hypothesis Testing for The Proportion
Reference
The R Zone
R Reference
Exercises
Chapter 6: Multivariate Statistics
6.1 Two-Sample
t
-Test for Difference in Means
6.2 Two-Sample
Z
-Test for Difference in Proportions
6.3 Test for the Homogeneity of Proportions
6.4 Chi-Square Test for Goodness of Fit of Multinomial Data
6.5 Analysis of Variance
Reference
The R Zone
R Reference
Exercises
Chapter 7: Preparing to Model the Data
7.1 Supervised Versus Unsupervised Methods
7.2 Statistical Methodology and Data Mining Methodology
7.3 Cross-Validation
7.4 Overfitting
7.5 Bias–Variance Trade-Off
7.6 Balancing The Training Data Set
7.7 Establishing Baseline Performance
The R Zone
R Reference
Exercises
Chapter 8: Simple Linear Regression
8.1 An Example of Simple Linear Regression
8.2 Dangers of Extrapolation
8.3 How Useful is the Regression? The Coefficient of Determination,
2
8.4 Standard Error of the Estimate,
8.5 Correlation Coefficient
8.6 Anova Table for Simple Linear Regression
8.7 Outliers, High Leverage Points, and Influential Observations
8.8 Population Regression Equation
8.9 Verifying The Regression Assumptions
8.10 Inference in Regression
8.11
t
-Test for the Relationship Between
x
and
y
8.12 Confidence Interval for the Slope of the Regression Line
8.13 Confidence Interval for the Correlation Coefficient ρ
8.14 Confidence Interval for the Mean Value of Given
8.15 Prediction Interval for a Randomly Chosen Value of Given
8.16 Transformations to Achieve Linearity
8.17 Box–Cox Transformations
The R Zone
R References
Exercises
Chapter 9: Multiple Regression and Model Building
9.1 An Example of Multiple Regression
9.2 The Population Multiple Regression Equation
9.3 Inference in Multiple Regression
9.4 Regression With Categorical Predictors, Using Indicator Variables
9.5 Adjusting
R
2
: Penalizing Models For Including Predictors That Are Not Useful
9.6 Sequential Sums of Squares
9.7 Multicollinearity
9.8 Variable Selection Methods
9.9 Gas Mileage Data Set
9.10 An Application of Variable Selection Methods
9.11 Using the Principal Components as Predictors in Multiple Regression
The R Zone
R References
Exercises
Part III: Classification
Chapter 10: k-Nearest Neighbor Algorithm
10.1 Classification Task
10.2
k
-Nearest Neighbor Algorithm
10.3 Distance Function
10.4 Combination Function
10.5 Quantifying Attribute Relevance: Stretching the Axes
10.6 Database Considerations
10.7
k
-Nearest Neighbor Algorithm for Estimation and Prediction
10.8 Choosing
k
10.9 Application of
k
-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
The R Zone
R References
Exercises
Chapter 11: Decision Trees
11.1 What is a Decision Tree?
11.2 Requirements for Using Decision Trees
11.3 Classification and Regression Trees
11.4 C4.5 Algorithm
11.5 Decision Rules
11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data
The R Zone
R References
Exercises
Chapter 12: Neural Networks
12.1 Input and Output Encoding
12.2 Neural Networks for Estimation and Prediction
12.3 Simple Example of a Neural Network
12.4 Sigmoid Activation Function
12.5 Back-Propagation
12.6 Gradient-Descent Method
12.7 Back-Propagation Rules
12.8 Example of Back-Propagation
12.9 Termination Criteria
12.10 Learning Rate
12.11 Momentum Term
12.12 Sensitivity Analysis
12.13 Application of Neural Network Modeling
The R Zone
R References
Exercises
Chapter 13: Logistic Regression
13.1 Simple Example of Logistic Regression
13.2 Maximum Likelihood Estimation
13.3 Interpreting Logistic Regression Output
13.4 Inference: Are the Predictors Significant?
13.5 Odds Ratio and Relative Risk
13.6 Interpreting Logistic Regression for a Dichotomous Predictor
13.7 Interpreting Logistic Regression for a Polychotomous Predictor
13.8 Interpreting Logistic Regression for a Continuous Predictor
13.9 Assumption of Linearity
13.10 Zero-Cell Problem
13.11 Multiple Logistic Regression
13.12 Introducing Higher Order Terms to Handle Nonlinearity
13.13 Validating the Logistic Regression Model
13.14 WEKA: Hands-On Analysis Using Logistic Regression
The R Zone
R References
Exercises
Chapter 14: NaÏVe Bayes and Bayesian Networks
14.1 Bayesian Approach
14.2
Maximum A Posteriori
(MAP) Classification
14.3 Posterior Odds Ratio
14.4 Balancing The Data
14.5 Naïve Bayes Classification
14.6 Interpreting The Log Posterior Odds Ratio
14.7 Zero-Cell Problem
14.8 Numeric Predictors for Naïve Bayes Classification
14.9 WEKA: Hands-on Analysis Using Naïve Bayes
14.10 Bayesian Belief Networks
14.11 Clothing Purchase Example
14.12 Using The Bayesian Network to Find Probabilities
The R Zone
R References
Exercises
Chapter 15: Model Evaluation Techniques
15.1 Model Evaluation Techniques for the Description Task
15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks
15.3 Model Evaluation Measures for the Classification Task
15.4 Accuracy and Overall Error Rate
15.5 Sensitivity and Specificity
15.6 False-Positive Rate and False-Negative Rate
15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives
15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns
15.9 Decision Cost/Benefit Analysis
15.10 Lift Charts and Gains Charts
15.11 Interweaving Model Evaluation with Model Building
15.12 Confluence of Results: Applying a Suite of Models
The R Zone
R References
Exercises
Hands-On Analysis
Chapter 16: Cost-Benefit Analysis Using Data-Driven Costs
16.1 Decision Invariance Under Row Adjustment
16.2 Positive Classification Criterion
16.3 Demonstration Of The Positive Classification Criterion
16.4 Constructing The Cost Matrix
16.5 Decision Invariance Under Scaling
16.6 Direct Costs and Opportunity Costs
16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
16.8 Rebalancing as a Surrogate for Misclassification Costs
The R Zone
R References
Exercises
Chapter 17: Cost-Benefit Analysis for Trinary and -Nary Classification Models
17.1 Classification Evaluation Measures for a Generic Trinary Target
17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
17.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem
17.4 Comparing Cart Models With and Without Data-Driven Misclassification Costs
17.5 Classification Evaluation Measures for a Generic
k
-Nary Target
17.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for
k
-Nary Classification
The R Zone
R References
Exercises
Chapter 18: Graphical Evaluation of Classification Models
18.1 Review of Lift Charts and Gains Charts
18.2 Lift Charts and Gains Charts Using Misclassification Costs
18.3 Response Charts
18.4 Profits Charts
18.5 Return on Investment (ROI) Charts
The R Zone
R References
Exercises
Hands-On Exercises
Part IV: Clustering
Chapter 19: Hierarchical and -Means Clustering
19.1 The Clustering Task
19.2 Hierarchical Clustering Methods
19.3 Single-Linkage Clustering
19.4 Complete-Linkage Clustering
19.5 -Means Clustering
19.6 Example of -Means Clustering at Work
19.7 Behavior of MSB, MSE, and Pseudo-
F
as the -Means Algorithm Proceeds
19.8 Application of -Means Clustering Using SAS Enterprise Miner
19.9 Using Cluster Membership to Predict Churn
The R Zone
R References
Exercises
Hands-On Analysis
Chapter 20: Kohonen Networks
20.1 Self-Organizing Maps
20.2 Kohonen Networks
20.3 Example of a Kohonen Network Study
20.4 Cluster Validity
20.5 Application of Clustering Using Kohonen Networks
20.6 Interpreting The Clusters
20.7 Using Cluster Membership as Input to Downstream Data Mining Models
The R Zone
R References
Exercises
Chapter 21: BIRCH Clustering
21.1 Rationale for BIRCH Clustering
21.2 Cluster Features
21.3 Cluster Feature TREE
21.4 Phase 1: Building The CF Tree
21.5 Phase 2: Clustering The Sub-Clusters
21.6 Example of Birch Clustering, Phase 1: Building The CF Tree
21.7 Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters
21.8 Evaluating The Candidate Cluster Solutions
21.9 Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
The R Zone
R References
Exercises
Chapter 22: Measuring Cluster Goodness
22.1 Rationale for Measuring Cluster Goodness
22.2 The Silhouette Method
22.3 Silhouette Example
22.4 Silhouette Analysis of the
IRIS
Data Set
22.5 The Pseudo-
F
Statistic
22.6 Example of the Pseudo-
F
Statistic
22.7 Pseudo-
F
Statistic Applied to the
IRIS
Data Set
22.8 Cluster Validation
22.9 Cluster Validation Applied to the Loans Data Set
The R Zone
R References
Exercises
Part V: Association Rules
Chapter 23: Association Rules
23.1 Affinity Analysis and Market Basket Analysis
23.2 Support, Confidence, Frequent Itemsets, and the A Priori Property
23.3 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
23.4 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
23.5 Extension From Flag Data to General Categorical Data
23.6 Information-Theoretic Approach: Generalized Rule Induction Method
23.7 Association Rules are Easy to do Badly
23.8 How Can We Measure the Usefulness of Association Rules?
23.9 Do Association Rules Represent Supervised or Unsupervised Learning?
23.10 Local Patterns Versus Global Models
The R Zone
R References
Exercises
Part VI: Enhancing Model Performance
Chapter 24: Segmentation Models
24.1 The Segmentation Modeling Process
24.2 Segmentation Modeling Using EDA to Identify the Segments
24.3 Segmentation Modeling using Clustering to Identify the Segments
The R Zone
R References
Exercises
Chapter 25: Ensemble Methods: Bagging and Boosting
25.1 Rationale for Using an Ensemble of Classification Models
25.2 Bias, Variance, and Noise
25.3 When to Apply, and not to apply, Bagging
25.4 Bagging
25.5 Boosting
25.6 Application of Bagging and Boosting Using IBM/SPSS Modeler
References
The R Zone
R Reference
Exercises
Chapter 26: Model Voting and Propensity Averaging
26.1 Simple Model Voting
26.2 Alternative Voting Methods
26.3 Model Voting Process
26.4 An Application of Model Voting
26.5 What is Propensity Averaging?
26.6 Propensity Averaging Process
26.7 An Application of Propensity Averaging
The R Zone
R References
Exercises
Hands-On Analysis
Part VII: Further Topics
Chapter 27: Genetic Algorithms
27.1 Introduction To Genetic Algorithms
27.2 Basic Framework of a Genetic Algorithm
27.3 Simple Example of a Genetic Algorithm at Work
27.4 Modifications and Enhancements: Selection
27.5 Modifications and Enhancements: Crossover
27.6 Genetic Algorithms for Real-Valued Variables
27.7 Using Genetic Algorithms to Train a Neural Network
27.8 WEKA: Hands-On Analysis Using Genetic Algorithms
The R Zone
R References
Chapter 28: Imputation of Missing Data
28.1 Need for Imputation of Missing Data
28.2 Imputation of Missing Data: Continuous Variables
28.3 Standard Error of the Imputation
28.4 Imputation of Missing Data: Categorical Variables
28.5 Handling Patterns in Missingness
Reference
The R Zone
R References
Part VIII: Case Study: Predicting Response to Direct-Mail Marketing
Chapter 29: Case Study, Part 1: Business Understanding, Data Preparation, and EDA
29.1 Cross-Industry Standard Practice for Data Mining
29.2 Business Understanding Phase
29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set
29.4 Data Preparation Phase
29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis
Chapter 30: Case Study, Part 2: Clustering and Principal Components Analysis
30.1 Partitioning the Data
30.2 Developing the Principal Components
30.3 Validating the Principal Components
30.4 Profiling the Principal Components
30.5 Choosing the Optimal Number of Clusters Using Birch Clustering
30.6 Choosing the Optimal Number of Clusters Using
k
-Means Clustering
30.7 Application of
k
-Means Clustering
30.8 Validating the Clusters
30.9 Profiling the Clusters
Chapter 31: Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability
31.1 Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
31.2 Modeling And Evaluation Overview
31.3 Cost-Benefit Analysis Using Data-Driven Costs
31.4 Variables to be Input To The Models
31.5 Establishing The Baseline Model Performance
31.6 Models That Use Misclassification Costs
31.7 Models That Need Rebalancing as a Surrogate for Misclassification Costs
31.8 Combining Models Using Voting and Propensity Averaging
31.9 Interpreting The Most Profitable Model
Chapter 32: Case Study, Part 4: Modeling and Evaluation for High Performance Only
32.1 Variables to be Input to the Models
32.2 Models that use Misclassification Costs
32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs
32.4 Combining Models using Voting and Propensity Averaging
32.5 Lessons Learned
32.6 Conclusions
Appendix A: Data Summarization and Visualization
Part 1: Summarization 1: Building Blocks Of Data Analysis
Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
Part 3: Summarization 2: Measures Of Center, Variability, and Position
Part 4: Summarization And Visualization Of Bivariate Relationships
Index
End User License Agreement
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
100
99
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
283
277
278
279
280
281
282
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
325
324
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
152
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
152
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
515
514
516
517
518
519
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
781
785
786
787
788
789
790
791
792
793
794
Cover
Table of Contents
Preface
Part I: Data Preparation
Begin Reading
Figure 1.1
Figure 1.2
Figure 1.3
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 2.5
Figure 2.6
Figure 2.7
Figure 2.8
Figure 2.9
Figure 2.10
Figure 2.11
Figure 2.12
Figure 2.13
Figure 2.14
Figure 2.15
Figure 2.16
Figure 2.17
Figure 2.18
Figure 2.19
Figure 2.20
Figure 2.21
Figure 2.22
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 3.9
Figure 3.10
Figure 3.11
Figure 3.12
Figure 3.13
Figure 3.14
Figure 3.15
Figure 3.16
Figure 3.17b
Figure 3.18b
Figure 3.19a
Figure 3.20
Figure 3.21
Figure 3.22
Figure 3.23
Figure 3.24
Figure 3.25
Figure 3.26
Figure 3.27
Figure 3.28
Figure 3.29
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 4.6
Figure 4.7
Figure 6.1
Figure 6.2
Figure 5.3
Figure 5.4
Figure 6.1
Figure 6.2
Figure 7.1
Figure 7.2
Figure 7.3
Figure 7.4
Figure 8.1
Figure 8.2
Figure 8.3
Figure 8.4
Figure 8.5
Figure 8.6
Figure 8.7
Figure 8.8
Figure 8.9
Figure 8.10
Figure 8.12
Figure 8.11
Figure 8.13
Figure 8.14
Figure 8.15
Figure 8.16
Figure 8.17
Figure 8.18
Figure 8.19
Figure 8.20
Figure 8.21
Figure 8.22
Figure 8.23
Figure 9.1
Figure 9.2
Figure 9.3
Figure 9.4
Figure 9.5
Figure 9.6
Figure 9.7
Figure 9.8
Figure 9.9
Figure 9.10
Figure 9.11
Figure 9.12
Figure 9.13
Figure 9.14
Figure 9.15
Figure 9.16
Figure 9.17
Figure 9.18
Figure 10.1
Figure 10.2
Figure 10.3
Figure 10.4
Figure 10.5
Figure 11.1
Figure 11.2
Figure 11.3
Figure 11.4
Figure 11.5
Figure 11.6
Figure 11.7
Figure 11.8
Figure 11.9
Figure 12.1
Figure 12.2
Figure 12.3
Figure 12.4
Figure 12.5
Figure 12.6
Figure 12.7
Figure 12.8
Figure 12.9
Figure 12.10
Figure 13.1
Figure 13.2
Figure 13.3
Figure 13.4
Figure 13.5
Figure 13.6
Figure 13.7
Figure 14.1
Figure 14.2
Figure 14.3
Figure 14.4
Figure 14.5
Figure 14.6
Figure 15.1
Figure 15.2
Figure 15.3
Figure 15.4
Figure 16.1a–c
Figure 18.1
Figure 18.2
Figure 18.3
Figure 18.4
Figure 18.5
Figure 18.6
Figure 18.7
Figure 18.8
Figure 19.1
Figure 19.2
Figure 19.3
Figure 19.4
Figure 19.5
Figure 19.6
Figure 19.7
Figure 19.8
Figure 19.9
Figure 19.10
Figure 19.11
Figure 20.1
Figure 20.2
Figure 20.3
Figure 20.4
Figure 20.5
Figure 20.6
Figure 20.7
Figure 20.8
Figure 20.9
Figure 20.10
Figure 21.1
Figure 21.2
Figure 21.3
Figure 21.4
Figure 21.5
Figure 21.6
Figure 21.7
Figure 21.8
Figure 21.9
Figure 21.10
Figure 21.11
Figure 21.12
Figure 21.13
Figure 21.14
Figure 22.1
Figure 22.2
Figure 22.3
Figure 22.4
Figure 22.5
Figure 22.6
Figure 22.7
Figure 22.8
Figure 22.9
Figure 22.10
Figure 22.11
Figure 22.12
Figure 23.1
Figure 23.2
Figure 23.3
Figure 23.4
Figure 23.5
Figure 24.1
Figure 24.2
Figure 24.4
Figure 24.3
Figure 24.5
Figure 24.6
Figure 24.7
Figure 24.8
Figure 24.9
Figure 25.1
Figure 25.2
Figure 25.3
Figure 25.4
Figure 25.5
Figure 25.6
Figure 25.7
Figure 25.8
Figure 25.9
Figure 25.10
Figure 26.1
Figure 26.3
Figure 26.2
Figure 27.1
Figure 27.2
Figure 27.3
Figure 27.4
Figure 27.5
Figure 27.6
Figure 27.7
Figure 27.8
Figure 27.9
Figure 27.10
Figure 27.11
Figure 28.1
Figure 28.2
Figure 29.1
Figure 29.2
Figure 29.3
Figure 29.4
Figure 29.5
Figure 29.6
Figure 29.7
Figure 29.8
Figure 29.9
Figure 29.10
Figure 29.11
Figure 29.12
Figure 29.13
Figure 29.14
Figure 29.15
Figure 29.16
Figure 29.17
Figure 29.18
Figure 29.19
Figure 29.20
Figure 29.21
Figure 29.22
Figure 29.23
Figure 29.24
Figure 29.25
Figure 29.26
Figure 29.27
Figure 29.28
Figure 29.29
Figure 30.1
Figure 30.2
Figure 30.3
Figure 30.4
Figure 30.5
Figure 30.6
Figure 30.7
Figure 30.8
Figure 30.11
Figure 30.12
Figure 30.13
Figure 30.14
Figure 31.1
Figure 31.2
Figure 31.3
Figure 31.4
Figure 31.5
Figure 32.1
Figure 32.2
Figure A.1
Figure A.2
Figure A.3
Figure A.4
Figure A.5
Figure A.6
Figure A.7
Figure A.9
Figure A.8
Table 1.1
Table 1.2
Table 2.1
Table 2.2
Table 2.3
Table 3.1
Table 3.2
Table 3.3
Table 3.4
Table 3.5
Table 3.6
Table 3.7
Table 3.8
Table 3.9
Table 3.10
Table 3.11
Table 3.12
Table 3.13
Table 3.14
Table 4.1
Table 4.2
Table 4.3
Table 4.4a
Table 4.4b
Table 4.5
Table 4.6
Table 4.7
Table 4.8
Table 4.9
Table 4.10
Table 4.11
Table 4.12
Table 4.13
Table 6.1
Table 6.2
Table 6.3
Table 6.4
Table 6.5
Table 6.6
Table 6.7
Table 6.1
Table 6.2
Table 6.3
Table 6.4
Table 6.5
Table 6.6
Table 6.7
Table 6.8
Table 6.9
Table 6.10
Table 6.12
Table 6.13
Table 7.1
Table 8.1
Table 8.2
Table 8.3
Table 8.4
Table 8.5
Table 8.6
Table 8.7
Table 8.8
Table 8.9
Table 8.10
Table 8.11
Table 8.12
Table 8.13
Table 8.14
Table 8.15
Table 8.16
Table 8.17
Table 8.18
Table 9.1
Table 9.2
Table 9.3
Table 9.4
Table 9.5
Table 9.6
Table 9.7
Table 9.8
Table 9.9
Table 9.10
Table 9.11
Table 9.12
Table 9.13
Table 9.14
Table 9.15
Table 9.16
Table 9.17
Table 9.18
Table 9.19
Table 9.20
Table 9.21
Table 9.22
Table 9.23
Table 9.24
Table 9.25
Table 9.26
Table 9.27
Table 10.1
Table 10.2
Table 10.3
Table 10.4
Table 10.5
Table 11.1
Table 11.2
Table 11.3
Table 11.4
Table 11.5
Table 11.6
Table 11.7
Table 11.8
Table 11.9
Table 11.10
Table 11.11
Table 12.1
Table 13.1
Table 13.2
Table 13.3
Table 13.4
Table 13.5
Table 13.6
Table 13.7
Table 13.8
Table 13.9
Table 13.10
Table 13.11
Table 13.12
Table 13.13
Table 13.14
Table 13.15
Table 13.16
Table 13.17
Table 13.18
Table 13.19
Table 13.20
Table 13.21
Table 13.22
Table 13.23
Table 13.24
Table 13.25
Table 13.26
Table 13.27
Table 13.28
Table 13.29
Table 13.30
Table 13.31
Table 13.32
Table 13.33
Table 14.1
Table 14.2
Table 14.3
Table 14.4
Table 14.5
Table 14.6
Table 14.7
Table 14.8
Table 14.9
Table 14.10
Table 15.1
Table 15.2
Table 15.3
Table 15.4
Table 15.5
Table 15.6
Table 16.1
Table 16.2
Table 16.3
Table 16.4
Table 16.5
Table 16.6
Table 16.7
Table 16.8
Table 16.9
Table 16.10
Table 16.11
Table 16.12
Table 16.13
Table 16.14
Table 16.15
Table 17.1
Table 17.2
Table 17.3
Table 17.4
Table 17.5
Table 17.6
Table 17.7
Table 17.8
Table 17.9
Table 17.10
Table 17.11
Table 17.12
Table 17.13
Table 17.14
Table 17.15
Table 18.1
Table 19.1
Table 19.2
Table 19.3
Table 19.4
Table 19.5
Table 20.1
Table 21.1
Table 21.2
Table 21.3
Table 21.4
Table 21.5
Table 22.1
Table 22.2
Table 22.3
Table 22.4
Table 22.5
Table 23.1
Table 23.2
Table 23.3
Table 23.4
Table 23.5
Table 23.6
Table 23.7
Table 25.1
Table 25.2
Table 25.3
Table 25.4
Table 25.5
Table 25.6
Table 26.1
Table 26.2
Table 26.3
Table 26.4
Table 26.10
Table 26.11
Table 26.12
Table 27.1
Table 27.2
Table 27.3
Table 27.4
Table 27.5
Table 27.6
Table 27.7
Table 27.8
Table 29.1
Table 29.2
Table 29.3
Table 30.1
Table 30.2
Table 31.1
Table 31.2
Table 31.3
Table 31.4
Table 31.5
Table 31.6
Table 31.7
Table 31.8
Table 31.9
Table 31.10
Table 32.1
Table 32.2
Table 32.3
Table 32.4
Table 32.5
Table A.1
Table A.2
Table A.3
Table A.4
Table A.5
Second Edition
DANIEL T. LAROSE
CHANTAL D. LAROSE
Copyright © 2015 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Larose, Daniel T.
Data mining and predictive analytics / Daniel T. Larose, Chantal D. Larose.
pages cm. – (Wiley series on methods and applications in data mining)
Includes bibliographical references and index.
ISBN 978-1-118-11619-7 (cloth)
1. Data mining. 2. Prediction theory. I. Larose, Chantal D. II. Title.
QA76.9.D343L3776 2015
006.3′12–dc23
2014043340
To those who have gone before us,
And to those who come after us,
In the Family Tree of Life…
Data miningis the process of discovering useful patterns and trends in large data sets.
Predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes.
Data Mining and Predictive Analytics, by Daniel Larose and Chantal Larose, will enable you to become an expert in these cutting-edge, profitable fields.
According to the research firm MarketsandMarkets, the global big data market is expected to grow by 26% per year from 2013 to 2018, from $14.87 billion in 2013 to $46.34 billion in 2018.1 Corporations and institutions worldwide are learning to apply data mining and predictive analytics, in order to increase profits. Companies that do not apply these methods will be left behind in the global competition of the twenty-first-century economy.
Humans are inundated with data in most fields. Unfortunately, most of this valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed.
The McKinsey Global Institute reports2
:
There will be a shortage of talent necessary for organizations to take advantage of big data. A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data… We project that demand for deep analytical positions in a big data world could exceed the supply being produced on current trends by 140,000 to 190,000 positions. … In addition, we project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of big data effectively.
This book is an attempt to help alleviate this critical shortage of data analysts.
Data mining is becoming more widespread every day, because it empowers companies to uncover profitable patterns and trends from their existing databases. Companies and institutions have spent millions of dollars to collect gigabytes and terabytes of data, but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of data mining becomes more widespread, companies that do not apply these techniques are in danger of falling behind, and losing market share, because their competitors are applying data mining, and thereby gaining the competitive edge.
In Data Mining and Predictive Analytics, the step-by-step hands-on solutions of real-world business problems using widely available data mining techniques applied to real-world data sets will appeal to managers, CIOs, CEOs, CFOs, data analysts, database analysts, and others who need to keep abreast of the latest methods for enhancing return on investment.
Using Data Mining and Predictive Analytics, you will learn what types of analysis will uncover the most profitable nuggets of knowledge from the data, while avoiding the potential pitfalls that may cost your company millions of dollars. You will learn data mining and predictive analytics by doing data mining and predictive analytics.
The growth of new off-the-shelf software platforms for performing data mining has kindled a new kind of danger. The ease with which these applications can manipulate data, combined with the power of the formidable data mining algorithms embedded in the black-box software, make their misuse proportionally more hazardous.
In short, data mining is easy to do badly. A little knowledge is especially dangerous when it comes to applying powerful models based on huge data sets. For example, analyses carried out on unpreprocessed data can lead to erroneous conclusions, or inappropriate analysis may be applied to data sets that call for a completely different approach, or models may be derived that are built on wholly unwarranted specious assumptions. If deployed, these errors in analysis can lead to very expensive failures. Data Mining and Predictive Analytics will help make you a savvy analyst, who will avoid these costly pitfalls.
The best way to avoid costly errors stemming from a blind black-box approach to data mining and predictive analytics is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software.
Data Mining and Predictive Analytics applies this white-box approach by
clearly explaining
why
a particular method or algorithm is needed;
getting the reader acquainted with
how
a method or algorithm works, using a toy example (tiny data set), so that the reader may follow the logic step by step, and thus gain a
white-box insight
into the inner workings of the method or algorithm;
providing an application of the method to a large, real-world data set;
using exercises to test the reader's level of understanding of the concepts and algorithms;
providing an opportunity for the reader to experience doing some real data mining on large data sets.
Data Mining Methods and Models walks the reader through the operations and nuances of the various algorithms, using small data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm. For example, in Chapter 21, we follow step by step as the balanced iterative reducing and clustering using hierarchies (BIRCH) algorithm works through a tiny data set, showing precisely how BIRCH chooses the optimal clustering solution for this data, from start to finish. As far as we know, such a demonstration is unique to this book for the BIRCH algorithm. Also, in Chapter 27, we proceed step by step to find the optimal solution using the selection, crossover, and mutation operators, using a tiny data set, so that the reader may better understand the underlying processes.
Data Mining and Predictive Analytics provides examples of the application of data analytic methods on actual large data sets. For example, in Chapter 9, we analytically unlock the relationship between nutrition rating and cereal content using a real-world data set. In Chapter 4, we apply principal components analysis to real-world census data about California. All data sets are available from the book series web site: www.dataminingconsultant.com.
Data Mining and Predictive Analytics includes over 750 chapter exercises, which allow readers to assess their depth of understanding of the material, as well as have a little fun playing with numbers and data. These include Clarifying the Concept exercises, which help to clarify some of the more challenging concepts in data mining, and Working with the Data exercises, which challenge the reader to apply the particular data mining algorithm to a small data set, and, step by step, to arrive at a computationally sound solution. For example, in Chapter 14, readers are asked to find the maximum a posteriori classification for the data set and network provided in the chapter.
Most chapters provide the reader with Hands-On Analysis problems, representing an opportunity for the reader to apply his or her newly acquired data mining expertise to solving real problems using large data sets. Many people learn by doing. Data Mining and Predictive Analytics provides a framework where the reader can learn data mining by doing data mining. For example, in Chapter 13, readers are challenged to approach a real-world credit approval classification data set, and construct their best possible logistic regression model, using the methods learned in this chapter as possible, providing strong interpretive support for the model, including explanations of derived variables and indicator variables.
Data Mining and Predictive Analytics contains many exciting new topics, including the following:
Cost-benefit analysis using data-driven misclassification costs.
Cost-benefit analysis for trinary and
k
-nary classification models.
Graphical evaluation of classification models.
BIRCH clustering.
Segmentation models.
Ensemble methods: Bagging and boosting.
Model voting and propensity averaging.
Imputation of missing data.
R is a powerful, open-source language for exploring and analyzing data sets (www.r-project.org). Analysts using R can take advantage of many freely available packages, routines, and graphical user interfaces to tackle most data analysis problems. In most chapters of this book, the reader will find The R Zone, which provides the actual R code needed to obtain the results shown in the chapter, along with screenshots of some of the output.
Some readers may be a bit rusty on some statistical and graphical concepts, usually encountered in an introductory statistics course. Data Mining and Predictive Analytics contains an appendix that provides a review of the most common concepts and terminology helpful for readers to hit the ground running in their understanding of the material in this book.
Data Mining and Predictive Analytics culminates in a detailed Case Study. Here the reader has the opportunity to see how everything he or she has learned is brought all together to create actionable and profitable solutions. This detailed Case Study ranges over four chapters, and is as follows:
Chapter 29
:
Case Study,
Part 1
: Business Understanding, Data Preparation, and EDA
Chapter 30
:
Case Study,
Part 2
: Clustering and Principal Components Analysis
Chapter 31
:
Case Study,
Part 3
: Modeling and Evaluation for Performance and Interpretability
Chapter 32
:
Case Study,
Part 4
: Modeling and Evaluation for High Performance Only
The Case Study includes dozens of pages of graphical, exploratory data analysis (EDA), predictive modeling, customer profiling, and offers different solutions, depending on the requisites of the client. The models are evaluated using a custom-built data-driven cost-benefit table, reflecting the true costs of classification errors, rather than the usual methods such as overall error rate. Thus, the analyst can compare models using the estimated profit per customer contacted, and can predict how much money the models will earn, based on the number of customers contacted.
Data Mining and Predictive Analytics is structured in a way that the reader will hopefully find logical and straightforward. There are 32 chapters, divided into eight major parts.
Part 1
,
Data Preparation
, consists of chapters on data preparation, EDA, and dimension reduction.
Part 2
,
Statistical Analysis
, provides classical statistical approaches to data analysis, including chapters on univariate and multivariate statistical analysis, simple and multiple linear regression, preparing to model the data, and model building.
Part 3
,
Classification
, contains nine chapters, making it the largest section of the book. Chapters include
k
-nearest neighbor, decision trees, neural networks, logistic regression, naïve Bayes, Bayesian networks, model evaluation techniques, cost-benefit analysis using data-driven misclassification costs, trinary and
k
-nary classification models, and graphical evaluation of classification models.
Part 4
,
Clustering
, contains chapters on hierarchical clustering,
k
-means clustering, Kohonen networks clustering, BIRCH clustering, and measuring cluster goodness.
Part 5
,
Association Rules
, consists of a single chapter covering a priori association rules and generalized rule induction.
Part 6
,
Enhancing Model Performance
, provides chapters on segmentation models, ensemble methods: bagging and boosting, model voting, and propensity averaging.
Part 7
,
Further Methods in Predictive Modeling
, contains a chapter on imputation of missing data, along with a chapter on genetic algorithms.
Part 8
,
Case Study: Predicting Response to Direct-Mail Marketing
, consists of four chapters presenting a start-to-finish detailed Case Study of how to generate the greatest profit from a direct-mail marketing campaign.
The software used in this book includes the following:
IBM SPSS Modeler
data mining software suite
R
open source statistical software
SAS Enterprise Miner
SPSS
statistical software
Minitab
statistical software
WEKA
open source data mining software.
IBM SPSS Modeler (www-01.ibm.com/software/analytics/spss/products/modeler/) is one of the most widely used data mining software suites, and is distributed by SPSS, whose base software is also used in this book. SAS Enterprise Miner is probably more powerful than Modeler, but the learning curve is also steeper. SPSS is available for download on a trial basis as well (Google “spss” download). Minitab is an easy-to-use statistical software package that is available for download on a trial basis from their web site at www.minitab.com.
The Weka (Waikato Environment for Knowledge Analysis) machine learning workbench is open-source software issued under the GNU General Public License, which includes a collection of tools for completing many data mining tasks. Data Mining and Predictive Modeling presents several hands-on, step-by-step tutorial examples using Weka 3.6, along with input files available from the book's companion web site www.dataminingconsultant.com. The reader is shown how to carry out the following types of analysis, using WEKA: Logistic Regression (Chapter 13), Naïve Bayes classification (Chapter 14), Bayesian Networks classification (Chapter 14), and Genetic Algorithms (Chapter 27). For more information regarding Weka, see www.cs.waikato.ac.nz/ml/weka/. The author is deeply grateful to James Steck for providing these WEKA examples and exercises. James Steck ([email protected]) was one of the first students to complete the master of science in data mining from Central Connecticut State University in 2005 (GPA 4.0), and received the first data mining Graduate Academic Award. James lives with his wife and son in Issaquah, WA.
The reader will find supporting materials, both for this book and for the other data mining books written by Daniel Larose and Chantal Larose for Wiley InterScience, at the companion web site, www.dataminingconsultant.com. There one may download the many data sets used in the book, so that the reader may develop a hands-on feel for the analytic methods and models encountered throughout the book. Errata are also available, as is a comprehensive set of data mining resources, including links to data sets, data mining groups, and research papers.
However, the real power of the companion web site is available to faculty adopters of the textbook, who will have access to the following resources:
Solutions to all the exercises, including the hands-on analyses.
PowerPoint® presentations of each chapter, ready for deployment in the classroom.
Sample data mining course projects, written by the author for use in his own courses, and ready to be adapted for your course.
Real-world data sets, to be used with the course projects.
Multiple-choice chapter quizzes.
Chapter-by-chapter web resources.
Adopters may e-mail Daniel Larose at [email protected] to request access information for the adopters' resources.
Data Mining and Predictive Analytics naturally fits the role of textbook for a one-semester course or two-semester sequences of courses in introductory and intermediate data mining. Instructors may appreciate
the presentation of data mining as a
process
;
the “white-box” approach, emphasizing an understanding of the underlying algorithmic structures;
Algorithm walk-throughs with toy data sets
Application of the algorithms to large real-world data sets
Over 300 figures and over 275 tables
Over 750 chapter exercises and hands-on analysis
the many exciting new topics, such as cost-benefit analysis using data-driven misclassification costs;
the detailed
Case Study
, bringing together many of the lessons learned from the earlier 28 chapters;
the Appendix: Data Summarization and Visualization, containing a review of statistical and graphical concepts readers may be a bit rusty on;
the companion web site, providing the array of resources for adopters detailed above.
Data Mining and Predictive Analytics is appropriate for advanced undergraduate- or graduate-level courses. An introductory statistics course would be nice, but is not required. No computer programming or database expertise is required.
1
Big Data Market to Reach $46.34 Billion by 2018
, by Darryl K. Taft,
eWeek
,
www.eweek.com/database/big-data-market-to-reach-46.34-billion-by-2018.html
, posted September 1, 2013, last accessed March 23, 2014.
2
Big data: The next frontier for innovation, competition, and productivity
, by James Manyika
et al
., Mckinsey Global Institute,
www.mckinsey.com
, May, 2011. Last accessed March 16, 2014.
I would first like to thank my mentor Dr. Dipak K. Dey, distinguished professor of statistics, and associate dean of the College of Liberal Arts and Sciences at the University of Connecticut, as well as Dr. John Judge, professor of statistics in the Department of Mathematics at Westfield State College. My debt to the two of you is boundless, and now extends beyond one lifetime. Also, I wish to thank my colleagues in the data mining programs at Central Connecticut State University, Dr. Chun Jin, Dr. Daniel S. Miller, Dr. Roger Bilisoly, Dr. Darius Dziuda, and Dr. Krishna Saha. Thanks to my daughter Chantal, and to my twin children, Tristan Spring and Ravel Renaissance, for providing perspective on what life is about.
Daniel T. Larose, PhD
Professor of Statistics and Data MiningDirector, Data Mining @CCSUwww.math.ccsu.edu/larose
I would first like to thank my PhD advisors, Dr. Dipak Dey, distinguished professor and associate dean, and Dr. Ofer Harel, associate professor, both of the Department of Statistics at the University of Connecticut. Their insight and understanding have framed and sculpted our exciting research program, including my PhD dissertation, “Model-Based Clustering of Incomplete Data.” Thanks also to my father, Daniel, for kindling my enduring love of data analysis, and to my mother, Debra, for her care and patience through many statistics-filled conversations. Finally, thanks to my siblings, Ravel and Tristan, for perspective, music, and friendship.
Chantal D. Larose, MS
Department of StatisticsUniversity of Connecticut
Recently, the computer manufacturer Dell was interested in improving the productivity of its sales workforce. It therefore turned to data mining and predictive analytics to analyze its database of potential customers, in order to identify the most likely respondents. Researching the social network activity of potential leads, using LinkedIn and other sites, provided a richer amount of information about the potential customers, thereby allowing Dell to develop more personalized sales pitches to their clients. This is an example of mining customer data to help identify the type of marketing approach for a particular customer, based on customer's individual profile. What is the bottom line? The number of prospects that needed to be contacted was cut by 50%, leaving only the most promising prospects, leading to a near doubling of the productivity and efficiency of the sales workforce, with a similar increase in revenue for Dell.1
The Commonwealth of Massachusetts is wielding predictive analytics as a tool to cut down on the number of cases of Medicaid fraud in the state. When a Medicaid claim is made, the state now immediately passes it in real time to a predictive analytics model, in order to detect any anomalies. During its first 6 months of operation, the new system has “been able to recover $2 million in improper payments, and has avoided paying hundreds of thousands of dollars in fraudulent claims,” according to Joan Senatore, Director of the Massachusetts Medicaid Fraud Unit.2
The McKinsey Global Institute (MGI) reports3 that most American companies with more than 1000 employees had an average of at least 200 TB of stored data. MGI projects that the amount of data generated worldwide will increase by 40% annually, creating profitable opportunities for companies to leverage their data to reduce costs and increase their bottom line. For example, retailers harnessing this “big data” to best advantage could expect to realize an increase in their operating margin of more than 60%, according to the MGI report. And health-care providers and health maintenance organizations (HMOs) that properly leverage their data storehouses could achieve $300 in cost savings annually, through improved efficiency and quality.
Forbes magazine reports4 that the use of data mining and predictive analytics has helped to identify patients who have been of the greatest risk of developing congestive heart failure. IBM collected 3 years of data pertaining to 350,000 patients, and including measurements on over 200 factors, including things such as blood pressure, weight, and drugs prescribed. Using predictive analytics, IBM was able to identify the 8500 patients most at risk of dying of congestive heart failure within 1 year.
The MIT Technology Review reports5