Data Mining and Predictive Analytics - Daniel T. Larose - E-Book

Data Mining and Predictive Analytics E-Book

Daniel T. Larose

0,0
111,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Learn methods of data analysis and their application to real-world data sets This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified "white box" approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets. Data Mining and Predictive Analytics: * Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language * Features over 750 chapter exercises, allowing readers to assess their understanding of the new material * Provides a detailed case study that brings together the lessons learned in the book * Includes access to the companion website, www.dataminingconsultant, with exclusive password-protected instructor content Data Mining and Predictive Analytics will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1131

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Series

Title Page

Copyright

Dedication

Preface

What is Data Mining? What is Predictive Analytics?

Why is this Book Needed?

Who Will Benefit from this Book?

Danger! Data Mining is Easy to do Badly

“White-Box” Approach

Algorithm Walk-Throughs

Exciting New Topics

The R Zone

Appendix: Data Summarization and Visualization

The Case Study: Bringing it all Together

How the Book is Structured

The Software

Weka: The Open-Source Alternative

The Companion Web Site: www.dataminingconsultant.com

Data Mining and Predictive Analytics

as a Textbook

Acknowledgments

Daniel's Acknowledgments

Chantal's Acknowledgments

Part I: Data Preparation

Chapter 1: An Introduction to Data Mining and Predictive Analytics

1.1 What is Data Mining? What Is Predictive Analytics?

1.2 Wanted: Data Miners

1.3 The Need For Human Direction of Data Mining

1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM

1.5 Fallacies of Data Mining

1.6 What Tasks can Data Mining Accomplish

The R Zone

R References

Exercises

Chapter 2: Data Preprocessing

2.1 Why do We Need to Preprocess the Data?

2.2 Data Cleaning

2.3 Handling Missing Data

2.4 Identifying Misclassifications

2.5 Graphical Methods for Identifying Outliers

2.6 Measures of Center and Spread

2.7 Data Transformation

2.8 Min–Max Normalization

2.9

Z

-Score Standardization

2.10 Decimal Scaling

2.11 Transformations to Achieve Normality

2.12 Numerical Methods for Identifying Outliers

2.13 Flag Variables

2.14 Transforming Categorical Variables into Numerical Variables

2.15 Binning Numerical Variables

2.16 Reclassifying Categorical Variables

2.17 Adding an Index Field

2.18 Removing Variables that are not Useful

2.19 Variables that Should Probably not be Removed

2.20 Removal of Duplicate Records

2.21 A Word About ID Fields

The R Zone

R Reference

Exercises

Chapter 3: Exploratory Data Analysis

3.1 Hypothesis Testing Versus Exploratory Data Analysis

3.2 Getting to Know The Data Set

3.3 Exploring Categorical Variables

3.4 Exploring Numeric Variables

3.5 Exploring Multivariate Relationships

3.6 Selecting Interesting Subsets of the Data for Further Investigation

3.7 Using EDA to Uncover Anomalous Fields

3.8 Binning Based on Predictive Value

3.9 Deriving New Variables: Flag Variables

3.10 Deriving New Variables: Numerical Variables

3.11 Using EDA to Investigate Correlated Predictor Variables

3.12 Summary of Our EDA

The R Zone

R References

Exercises

Chapter 4: Dimension-Reduction Methods

4.1 Need for Dimension-Reduction in Data Mining

4.2 Principal Components Analysis

4.3 Applying PCA to the

Houses

Data Set

4.4 How Many Components Should We Extract?

4.5 Profiling the Principal Components

4.6 Communalities

4.7 Validation of the Principal Components

4.8 Factor Analysis

4.9 Applying Factor Analysis to the

Adult

Data Set

4.10 Factor Rotation

4.11 User-Defined Composites

4.12 An Example of a User-Defined Composite

The R Zone

R References

Exercises

Part II: Statistical Analysis

Chapter 5: Univariate Statistical Analysis

5.1 Data Mining Tasks in Discovering Knowledge in Data

5.2 Statistical Approaches to Estimation and Prediction

5.3 Statistical Inference

5.4 How Confident are We in Our Estimates?

5.5 Confidence Interval Estimation of the Mean

5.6 How to Reduce the Margin of Error

5.7 Confidence Interval Estimation of the Proportion

5.8 Hypothesis Testing for the Mean

5.9 Assessing The Strength of Evidence Against The Null Hypothesis

5.10 Using Confidence Intervals to Perform Hypothesis Tests

5.11 Hypothesis Testing for The Proportion

Reference

The R Zone

R Reference

Exercises

Chapter 6: Multivariate Statistics

6.1 Two-Sample

t

-Test for Difference in Means

6.2 Two-Sample

Z

-Test for Difference in Proportions

6.3 Test for the Homogeneity of Proportions

6.4 Chi-Square Test for Goodness of Fit of Multinomial Data

6.5 Analysis of Variance

Reference

The R Zone

R Reference

Exercises

Chapter 7: Preparing to Model the Data

7.1 Supervised Versus Unsupervised Methods

7.2 Statistical Methodology and Data Mining Methodology

7.3 Cross-Validation

7.4 Overfitting

7.5 Bias–Variance Trade-Off

7.6 Balancing The Training Data Set

7.7 Establishing Baseline Performance

The R Zone

R Reference

Exercises

Chapter 8: Simple Linear Regression

8.1 An Example of Simple Linear Regression

8.2 Dangers of Extrapolation

8.3 How Useful is the Regression? The Coefficient of Determination,

2

8.4 Standard Error of the Estimate,

8.5 Correlation Coefficient

8.6 Anova Table for Simple Linear Regression

8.7 Outliers, High Leverage Points, and Influential Observations

8.8 Population Regression Equation

8.9 Verifying The Regression Assumptions

8.10 Inference in Regression

8.11

t

-Test for the Relationship Between

x

and

y

8.12 Confidence Interval for the Slope of the Regression Line

8.13 Confidence Interval for the Correlation Coefficient ρ

8.14 Confidence Interval for the Mean Value of Given

8.15 Prediction Interval for a Randomly Chosen Value of Given

8.16 Transformations to Achieve Linearity

8.17 Box–Cox Transformations

The R Zone

R References

Exercises

Chapter 9: Multiple Regression and Model Building

9.1 An Example of Multiple Regression

9.2 The Population Multiple Regression Equation

9.3 Inference in Multiple Regression

9.4 Regression With Categorical Predictors, Using Indicator Variables

9.5 Adjusting

R

2

: Penalizing Models For Including Predictors That Are Not Useful

9.6 Sequential Sums of Squares

9.7 Multicollinearity

9.8 Variable Selection Methods

9.9 Gas Mileage Data Set

9.10 An Application of Variable Selection Methods

9.11 Using the Principal Components as Predictors in Multiple Regression

The R Zone

R References

Exercises

Part III: Classification

Chapter 10: k-Nearest Neighbor Algorithm

10.1 Classification Task

10.2

k

-Nearest Neighbor Algorithm

10.3 Distance Function

10.4 Combination Function

10.5 Quantifying Attribute Relevance: Stretching the Axes

10.6 Database Considerations

10.7

k

-Nearest Neighbor Algorithm for Estimation and Prediction

10.8 Choosing

k

10.9 Application of

k

-Nearest Neighbor Algorithm Using IBM/SPSS Modeler

The R Zone

R References

Exercises

Chapter 11: Decision Trees

11.1 What is a Decision Tree?

11.2 Requirements for Using Decision Trees

11.3 Classification and Regression Trees

11.4 C4.5 Algorithm

11.5 Decision Rules

11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data

The R Zone

R References

Exercises

Chapter 12: Neural Networks

12.1 Input and Output Encoding

12.2 Neural Networks for Estimation and Prediction

12.3 Simple Example of a Neural Network

12.4 Sigmoid Activation Function

12.5 Back-Propagation

12.6 Gradient-Descent Method

12.7 Back-Propagation Rules

12.8 Example of Back-Propagation

12.9 Termination Criteria

12.10 Learning Rate

12.11 Momentum Term

12.12 Sensitivity Analysis

12.13 Application of Neural Network Modeling

The R Zone

R References

Exercises

Chapter 13: Logistic Regression

13.1 Simple Example of Logistic Regression

13.2 Maximum Likelihood Estimation

13.3 Interpreting Logistic Regression Output

13.4 Inference: Are the Predictors Significant?

13.5 Odds Ratio and Relative Risk

13.6 Interpreting Logistic Regression for a Dichotomous Predictor

13.7 Interpreting Logistic Regression for a Polychotomous Predictor

13.8 Interpreting Logistic Regression for a Continuous Predictor

13.9 Assumption of Linearity

13.10 Zero-Cell Problem

13.11 Multiple Logistic Regression

13.12 Introducing Higher Order Terms to Handle Nonlinearity

13.13 Validating the Logistic Regression Model

13.14 WEKA: Hands-On Analysis Using Logistic Regression

The R Zone

R References

Exercises

Chapter 14: NaÏVe Bayes and Bayesian Networks

14.1 Bayesian Approach

14.2

Maximum A Posteriori

(MAP) Classification

14.3 Posterior Odds Ratio

14.4 Balancing The Data

14.5 Naïve Bayes Classification

14.6 Interpreting The Log Posterior Odds Ratio

14.7 Zero-Cell Problem

14.8 Numeric Predictors for Naïve Bayes Classification

14.9 WEKA: Hands-on Analysis Using Naïve Bayes

14.10 Bayesian Belief Networks

14.11 Clothing Purchase Example

14.12 Using The Bayesian Network to Find Probabilities

The R Zone

R References

Exercises

Chapter 15: Model Evaluation Techniques

15.1 Model Evaluation Techniques for the Description Task

15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks

15.3 Model Evaluation Measures for the Classification Task

15.4 Accuracy and Overall Error Rate

15.5 Sensitivity and Specificity

15.6 False-Positive Rate and False-Negative Rate

15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives

15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns

15.9 Decision Cost/Benefit Analysis

15.10 Lift Charts and Gains Charts

15.11 Interweaving Model Evaluation with Model Building

15.12 Confluence of Results: Applying a Suite of Models

The R Zone

R References

Exercises

Hands-On Analysis

Chapter 16: Cost-Benefit Analysis Using Data-Driven Costs

16.1 Decision Invariance Under Row Adjustment

16.2 Positive Classification Criterion

16.3 Demonstration Of The Positive Classification Criterion

16.4 Constructing The Cost Matrix

16.5 Decision Invariance Under Scaling

16.6 Direct Costs and Opportunity Costs

16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

16.8 Rebalancing as a Surrogate for Misclassification Costs

The R Zone

R References

Exercises

Chapter 17: Cost-Benefit Analysis for Trinary and -Nary Classification Models

17.1 Classification Evaluation Measures for a Generic Trinary Target

17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem

17.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem

17.4 Comparing Cart Models With and Without Data-Driven Misclassification Costs

17.5 Classification Evaluation Measures for a Generic

k

-Nary Target

17.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for

k

-Nary Classification

The R Zone

R References

Exercises

Chapter 18: Graphical Evaluation of Classification Models

18.1 Review of Lift Charts and Gains Charts

18.2 Lift Charts and Gains Charts Using Misclassification Costs

18.3 Response Charts

18.4 Profits Charts

18.5 Return on Investment (ROI) Charts

The R Zone

R References

Exercises

Hands-On Exercises

Part IV: Clustering

Chapter 19: Hierarchical and -Means Clustering

19.1 The Clustering Task

19.2 Hierarchical Clustering Methods

19.3 Single-Linkage Clustering

19.4 Complete-Linkage Clustering

19.5 -Means Clustering

19.6 Example of -Means Clustering at Work

19.7 Behavior of MSB, MSE, and Pseudo-

F

as the -Means Algorithm Proceeds

19.8 Application of -Means Clustering Using SAS Enterprise Miner

19.9 Using Cluster Membership to Predict Churn

The R Zone

R References

Exercises

Hands-On Analysis

Chapter 20: Kohonen Networks

20.1 Self-Organizing Maps

20.2 Kohonen Networks

20.3 Example of a Kohonen Network Study

20.4 Cluster Validity

20.5 Application of Clustering Using Kohonen Networks

20.6 Interpreting The Clusters

20.7 Using Cluster Membership as Input to Downstream Data Mining Models

The R Zone

R References

Exercises

Chapter 21: BIRCH Clustering

21.1 Rationale for BIRCH Clustering

21.2 Cluster Features

21.3 Cluster Feature TREE

21.4 Phase 1: Building The CF Tree

21.5 Phase 2: Clustering The Sub-Clusters

21.6 Example of Birch Clustering, Phase 1: Building The CF Tree

21.7 Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters

21.8 Evaluating The Candidate Cluster Solutions

21.9 Case Study: Applying BIRCH Clustering to The Bank Loans Data Set

The R Zone

R References

Exercises

Chapter 22: Measuring Cluster Goodness

22.1 Rationale for Measuring Cluster Goodness

22.2 The Silhouette Method

22.3 Silhouette Example

22.4 Silhouette Analysis of the

IRIS

Data Set

22.5 The Pseudo-

F

Statistic

22.6 Example of the Pseudo-

F

Statistic

22.7 Pseudo-

F

Statistic Applied to the

IRIS

Data Set

22.8 Cluster Validation

22.9 Cluster Validation Applied to the Loans Data Set

The R Zone

R References

Exercises

Part V: Association Rules

Chapter 23: Association Rules

23.1 Affinity Analysis and Market Basket Analysis

23.2 Support, Confidence, Frequent Itemsets, and the A Priori Property

23.3 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets

23.4 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules

23.5 Extension From Flag Data to General Categorical Data

23.6 Information-Theoretic Approach: Generalized Rule Induction Method

23.7 Association Rules are Easy to do Badly

23.8 How Can We Measure the Usefulness of Association Rules?

23.9 Do Association Rules Represent Supervised or Unsupervised Learning?

23.10 Local Patterns Versus Global Models

The R Zone

R References

Exercises

Part VI: Enhancing Model Performance

Chapter 24: Segmentation Models

24.1 The Segmentation Modeling Process

24.2 Segmentation Modeling Using EDA to Identify the Segments

24.3 Segmentation Modeling using Clustering to Identify the Segments

The R Zone

R References

Exercises

Chapter 25: Ensemble Methods: Bagging and Boosting

25.1 Rationale for Using an Ensemble of Classification Models

25.2 Bias, Variance, and Noise

25.3 When to Apply, and not to apply, Bagging

25.4 Bagging

25.5 Boosting

25.6 Application of Bagging and Boosting Using IBM/SPSS Modeler

References

The R Zone

R Reference

Exercises

Chapter 26: Model Voting and Propensity Averaging

26.1 Simple Model Voting

26.2 Alternative Voting Methods

26.3 Model Voting Process

26.4 An Application of Model Voting

26.5 What is Propensity Averaging?

26.6 Propensity Averaging Process

26.7 An Application of Propensity Averaging

The R Zone

R References

Exercises

Hands-On Analysis

Part VII: Further Topics

Chapter 27: Genetic Algorithms

27.1 Introduction To Genetic Algorithms

27.2 Basic Framework of a Genetic Algorithm

27.3 Simple Example of a Genetic Algorithm at Work

27.4 Modifications and Enhancements: Selection

27.5 Modifications and Enhancements: Crossover

27.6 Genetic Algorithms for Real-Valued Variables

27.7 Using Genetic Algorithms to Train a Neural Network

27.8 WEKA: Hands-On Analysis Using Genetic Algorithms

The R Zone

R References

Chapter 28: Imputation of Missing Data

28.1 Need for Imputation of Missing Data

28.2 Imputation of Missing Data: Continuous Variables

28.3 Standard Error of the Imputation

28.4 Imputation of Missing Data: Categorical Variables

28.5 Handling Patterns in Missingness

Reference

The R Zone

R References

Part VIII: Case Study: Predicting Response to Direct-Mail Marketing

Chapter 29: Case Study, Part 1: Business Understanding, Data Preparation, and EDA

29.1 Cross-Industry Standard Practice for Data Mining

29.2 Business Understanding Phase

29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set

29.4 Data Preparation Phase

29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis

Chapter 30: Case Study, Part 2: Clustering and Principal Components Analysis

30.1 Partitioning the Data

30.2 Developing the Principal Components

30.3 Validating the Principal Components

30.4 Profiling the Principal Components

30.5 Choosing the Optimal Number of Clusters Using Birch Clustering

30.6 Choosing the Optimal Number of Clusters Using

k

-Means Clustering

30.7 Application of

k

-Means Clustering

30.8 Validating the Clusters

30.9 Profiling the Clusters

Chapter 31: Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability

31.1 Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?

31.2 Modeling And Evaluation Overview

31.3 Cost-Benefit Analysis Using Data-Driven Costs

31.4 Variables to be Input To The Models

31.5 Establishing The Baseline Model Performance

31.6 Models That Use Misclassification Costs

31.7 Models That Need Rebalancing as a Surrogate for Misclassification Costs

31.8 Combining Models Using Voting and Propensity Averaging

31.9 Interpreting The Most Profitable Model

Chapter 32: Case Study, Part 4: Modeling and Evaluation for High Performance Only

32.1 Variables to be Input to the Models

32.2 Models that use Misclassification Costs

32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs

32.4 Combining Models using Voting and Propensity Averaging

32.5 Lessons Learned

32.6 Conclusions

Appendix A: Data Summarization and Visualization

Part 1: Summarization 1: Building Blocks Of Data Analysis

Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data

Part 3: Summarization 2: Measures Of Center, Variability, and Position

Part 4: Summarization And Visualization Of Bivariate Relationships

Index

End User License Agreement

Pages

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

100

99

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

283

277

278

279

280

281

282

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

325

324

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

152

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

152

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

515

514

516

517

518

519

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

781

785

786

787

788

789

790

791

792

793

794

Guide

Cover

Table of Contents

Preface

Part I: Data Preparation

Begin Reading

List of Illustrations

Figure 1.1

Figure 1.2

Figure 1.3

Figure 2.1

Figure 2.2

Figure 2.3

Figure 2.4

Figure 2.5

Figure 2.6

Figure 2.7

Figure 2.8

Figure 2.9

Figure 2.10

Figure 2.11

Figure 2.12

Figure 2.13

Figure 2.14

Figure 2.15

Figure 2.16

Figure 2.17

Figure 2.18

Figure 2.19

Figure 2.20

Figure 2.21

Figure 2.22

Figure 3.1

Figure 3.2

Figure 3.3

Figure 3.4

Figure 3.5

Figure 3.6

Figure 3.7

Figure 3.8

Figure 3.9

Figure 3.10

Figure 3.11

Figure 3.12

Figure 3.13

Figure 3.14

Figure 3.15

Figure 3.16

Figure 3.17b

Figure 3.18b

Figure 3.19a

Figure 3.20

Figure 3.21

Figure 3.22

Figure 3.23

Figure 3.24

Figure 3.25

Figure 3.26

Figure 3.27

Figure 3.28

Figure 3.29

Figure 4.1

Figure 4.2

Figure 4.3

Figure 4.4

Figure 4.6

Figure 4.7

Figure 6.1

Figure 6.2

Figure 5.3

Figure 5.4

Figure 6.1

Figure 6.2

Figure 7.1

Figure 7.2

Figure 7.3

Figure 7.4

Figure 8.1

Figure 8.2

Figure 8.3

Figure 8.4

Figure 8.5

Figure 8.6

Figure 8.7

Figure 8.8

Figure 8.9

Figure 8.10

Figure 8.12

Figure 8.11

Figure 8.13

Figure 8.14

Figure 8.15

Figure 8.16

Figure 8.17

Figure 8.18

Figure 8.19

Figure 8.20

Figure 8.21

Figure 8.22

Figure 8.23

Figure 9.1

Figure 9.2

Figure 9.3

Figure 9.4

Figure 9.5

Figure 9.6

Figure 9.7

Figure 9.8

Figure 9.9

Figure 9.10

Figure 9.11

Figure 9.12

Figure 9.13

Figure 9.14

Figure 9.15

Figure 9.16

Figure 9.17

Figure 9.18

Figure 10.1

Figure 10.2

Figure 10.3

Figure 10.4

Figure 10.5

Figure 11.1

Figure 11.2

Figure 11.3

Figure 11.4

Figure 11.5

Figure 11.6

Figure 11.7

Figure 11.8

Figure 11.9

Figure 12.1

Figure 12.2

Figure 12.3

Figure 12.4

Figure 12.5

Figure 12.6

Figure 12.7

Figure 12.8

Figure 12.9

Figure 12.10

Figure 13.1

Figure 13.2

Figure 13.3

Figure 13.4

Figure 13.5

Figure 13.6

Figure 13.7

Figure 14.1

Figure 14.2

Figure 14.3

Figure 14.4

Figure 14.5

Figure 14.6

Figure 15.1

Figure 15.2

Figure 15.3

Figure 15.4

Figure 16.1a–c

Figure 18.1

Figure 18.2

Figure 18.3

Figure 18.4

Figure 18.5

Figure 18.6

Figure 18.7

Figure 18.8

Figure 19.1

Figure 19.2

Figure 19.3

Figure 19.4

Figure 19.5

Figure 19.6

Figure 19.7

Figure 19.8

Figure 19.9

Figure 19.10

Figure 19.11

Figure 20.1

Figure 20.2

Figure 20.3

Figure 20.4

Figure 20.5

Figure 20.6

Figure 20.7

Figure 20.8

Figure 20.9

Figure 20.10

Figure 21.1

Figure 21.2

Figure 21.3

Figure 21.4

Figure 21.5

Figure 21.6

Figure 21.7

Figure 21.8

Figure 21.9

Figure 21.10

Figure 21.11

Figure 21.12

Figure 21.13

Figure 21.14

Figure 22.1

Figure 22.2

Figure 22.3

Figure 22.4

Figure 22.5

Figure 22.6

Figure 22.7

Figure 22.8

Figure 22.9

Figure 22.10

Figure 22.11

Figure 22.12

Figure 23.1

Figure 23.2

Figure 23.3

Figure 23.4

Figure 23.5

Figure 24.1

Figure 24.2

Figure 24.4

Figure 24.3

Figure 24.5

Figure 24.6

Figure 24.7

Figure 24.8

Figure 24.9

Figure 25.1

Figure 25.2

Figure 25.3

Figure 25.4

Figure 25.5

Figure 25.6

Figure 25.7

Figure 25.8

Figure 25.9

Figure 25.10

Figure 26.1

Figure 26.3

Figure 26.2

Figure 27.1

Figure 27.2

Figure 27.3

Figure 27.4

Figure 27.5

Figure 27.6

Figure 27.7

Figure 27.8

Figure 27.9

Figure 27.10

Figure 27.11

Figure 28.1

Figure 28.2

Figure 29.1

Figure 29.2

Figure 29.3

Figure 29.4

Figure 29.5

Figure 29.6

Figure 29.7

Figure 29.8

Figure 29.9

Figure 29.10

Figure 29.11

Figure 29.12

Figure 29.13

Figure 29.14

Figure 29.15

Figure 29.16

Figure 29.17

Figure 29.18

Figure 29.19

Figure 29.20

Figure 29.21

Figure 29.22

Figure 29.23

Figure 29.24

Figure 29.25

Figure 29.26

Figure 29.27

Figure 29.28

Figure 29.29

Figure 30.1

Figure 30.2

Figure 30.3

Figure 30.4

Figure 30.5

Figure 30.6

Figure 30.7

Figure 30.8

Figure 30.11

Figure 30.12

Figure 30.13

Figure 30.14

Figure 31.1

Figure 31.2

Figure 31.3

Figure 31.4

Figure 31.5

Figure 32.1

Figure 32.2

Figure A.1

Figure A.2

Figure A.3

Figure A.4

Figure A.5

Figure A.6

Figure A.7

Figure A.9

Figure A.8

List of Tables

Table 1.1

Table 1.2

Table 2.1

Table 2.2

Table 2.3

Table 3.1

Table 3.2

Table 3.3

Table 3.4

Table 3.5

Table 3.6

Table 3.7

Table 3.8

Table 3.9

Table 3.10

Table 3.11

Table 3.12

Table 3.13

Table 3.14

Table 4.1

Table 4.2

Table 4.3

Table 4.4a

Table 4.4b

Table 4.5

Table 4.6

Table 4.7

Table 4.8

Table 4.9

Table 4.10

Table 4.11

Table 4.12

Table 4.13

Table 6.1

Table 6.2

Table 6.3

Table 6.4

Table 6.5

Table 6.6

Table 6.7

Table 6.1

Table 6.2

Table 6.3

Table 6.4

Table 6.5

Table 6.6

Table 6.7

Table 6.8

Table 6.9

Table 6.10

Table 6.12

Table 6.13

Table 7.1

Table 8.1

Table 8.2

Table 8.3

Table 8.4

Table 8.5

Table 8.6

Table 8.7

Table 8.8

Table 8.9

Table 8.10

Table 8.11

Table 8.12

Table 8.13

Table 8.14

Table 8.15

Table 8.16

Table 8.17

Table 8.18

Table 9.1

Table 9.2

Table 9.3

Table 9.4

Table 9.5

Table 9.6

Table 9.7

Table 9.8

Table 9.9

Table 9.10

Table 9.11

Table 9.12

Table 9.13

Table 9.14

Table 9.15

Table 9.16

Table 9.17

Table 9.18

Table 9.19

Table 9.20

Table 9.21

Table 9.22

Table 9.23

Table 9.24

Table 9.25

Table 9.26

Table 9.27

Table 10.1

Table 10.2

Table 10.3

Table 10.4

Table 10.5

Table 11.1

Table 11.2

Table 11.3

Table 11.4

Table 11.5

Table 11.6

Table 11.7

Table 11.8

Table 11.9

Table 11.10

Table 11.11

Table 12.1

Table 13.1

Table 13.2

Table 13.3

Table 13.4

Table 13.5

Table 13.6

Table 13.7

Table 13.8

Table 13.9

Table 13.10

Table 13.11

Table 13.12

Table 13.13

Table 13.14

Table 13.15

Table 13.16

Table 13.17

Table 13.18

Table 13.19

Table 13.20

Table 13.21

Table 13.22

Table 13.23

Table 13.24

Table 13.25

Table 13.26

Table 13.27

Table 13.28

Table 13.29

Table 13.30

Table 13.31

Table 13.32

Table 13.33

Table 14.1

Table 14.2

Table 14.3

Table 14.4

Table 14.5

Table 14.6

Table 14.7

Table 14.8

Table 14.9

Table 14.10

Table 15.1

Table 15.2

Table 15.3

Table 15.4

Table 15.5

Table 15.6

Table 16.1

Table 16.2

Table 16.3

Table 16.4

Table 16.5

Table 16.6

Table 16.7

Table 16.8

Table 16.9

Table 16.10

Table 16.11

Table 16.12

Table 16.13

Table 16.14

Table 16.15

Table 17.1

Table 17.2

Table 17.3

Table 17.4

Table 17.5

Table 17.6

Table 17.7

Table 17.8

Table 17.9

Table 17.10

Table 17.11

Table 17.12

Table 17.13

Table 17.14

Table 17.15

Table 18.1

Table 19.1

Table 19.2

Table 19.3

Table 19.4

Table 19.5

Table 20.1

Table 21.1

Table 21.2

Table 21.3

Table 21.4

Table 21.5

Table 22.1

Table 22.2

Table 22.3

Table 22.4

Table 22.5

Table 23.1

Table 23.2

Table 23.3

Table 23.4

Table 23.5

Table 23.6

Table 23.7

Table 25.1

Table 25.2

Table 25.3

Table 25.4

Table 25.5

Table 25.6

Table 26.1

Table 26.2

Table 26.3

Table 26.4

Table 26.10

Table 26.11

Table 26.12

Table 27.1

Table 27.2

Table 27.3

Table 27.4

Table 27.5

Table 27.6

Table 27.7

Table 27.8

Table 29.1

Table 29.2

Table 29.3

Table 30.1

Table 30.2

Table 31.1

Table 31.2

Table 31.3

Table 31.4

Table 31.5

Table 31.6

Table 31.7

Table 31.8

Table 31.9

Table 31.10

Table 32.1

Table 32.2

Table 32.3

Table 32.4

Table 32.5

Table A.1

Table A.2

Table A.3

Table A.4

Table A.5

Data Mining and Predictive Analytics

Second Edition

DANIEL T. LAROSE

CHANTAL D. LAROSE

Copyright © 2015 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Larose, Daniel T.

Data mining and predictive analytics / Daniel T. Larose, Chantal D. Larose.

pages cm. – (Wiley series on methods and applications in data mining)

Includes bibliographical references and index.

ISBN 978-1-118-11619-7 (cloth)

1. Data mining. 2. Prediction theory. I. Larose, Chantal D. II. Title.

QA76.9.D343L3776 2015

006.3′12–dc23

2014043340

To those who have gone before us,

And to those who come after us,

In the Family Tree of Life…

Preface

What is Data Mining? What is Predictive Analytics?

Data miningis the process of discovering useful patterns and trends in large data sets.

Predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes.

Data Mining and Predictive Analytics, by Daniel Larose and Chantal Larose, will enable you to become an expert in these cutting-edge, profitable fields.

Why is this Book Needed?

According to the research firm MarketsandMarkets, the global big data market is expected to grow by 26% per year from 2013 to 2018, from $14.87 billion in 2013 to $46.34 billion in 2018.1 Corporations and institutions worldwide are learning to apply data mining and predictive analytics, in order to increase profits. Companies that do not apply these methods will be left behind in the global competition of the twenty-first-century economy.

Humans are inundated with data in most fields. Unfortunately, most of this valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed.

The McKinsey Global Institute reports2

:

There will be a shortage of talent necessary for organizations to take advantage of big data. A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data… We project that demand for deep analytical positions in a big data world could exceed the supply being produced on current trends by 140,000 to 190,000 positions. … In addition, we project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of big data effectively.

This book is an attempt to help alleviate this critical shortage of data analysts.

Data mining is becoming more widespread every day, because it empowers companies to uncover profitable patterns and trends from their existing databases. Companies and institutions have spent millions of dollars to collect gigabytes and terabytes of data, but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of data mining becomes more widespread, companies that do not apply these techniques are in danger of falling behind, and losing market share, because their competitors are applying data mining, and thereby gaining the competitive edge.

Who Will Benefit from this Book?

In Data Mining and Predictive Analytics, the step-by-step hands-on solutions of real-world business problems using widely available data mining techniques applied to real-world data sets will appeal to managers, CIOs, CEOs, CFOs, data analysts, database analysts, and others who need to keep abreast of the latest methods for enhancing return on investment.

Using Data Mining and Predictive Analytics, you will learn what types of analysis will uncover the most profitable nuggets of knowledge from the data, while avoiding the potential pitfalls that may cost your company millions of dollars. You will learn data mining and predictive analytics by doing data mining and predictive analytics.

Danger! Data Mining is Easy to do Badly

The growth of new off-the-shelf software platforms for performing data mining has kindled a new kind of danger. The ease with which these applications can manipulate data, combined with the power of the formidable data mining algorithms embedded in the black-box software, make their misuse proportionally more hazardous.

In short, data mining is easy to do badly. A little knowledge is especially dangerous when it comes to applying powerful models based on huge data sets. For example, analyses carried out on unpreprocessed data can lead to erroneous conclusions, or inappropriate analysis may be applied to data sets that call for a completely different approach, or models may be derived that are built on wholly unwarranted specious assumptions. If deployed, these errors in analysis can lead to very expensive failures. Data Mining and Predictive Analytics will help make you a savvy analyst, who will avoid these costly pitfalls.

“White-Box” Approach

Understanding the Underlying Algorithmic and Model Structures

The best way to avoid costly errors stemming from a blind black-box approach to data mining and predictive analytics is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software.

Data Mining and Predictive Analytics applies this white-box approach by

clearly explaining

why

a particular method or algorithm is needed;

getting the reader acquainted with

how

a method or algorithm works, using a toy example (tiny data set), so that the reader may follow the logic step by step, and thus gain a

white-box insight

into the inner workings of the method or algorithm;

providing an application of the method to a large, real-world data set;

using exercises to test the reader's level of understanding of the concepts and algorithms;

providing an opportunity for the reader to experience doing some real data mining on large data sets.

Algorithm Walk-Throughs

Data Mining Methods and Models walks the reader through the operations and nuances of the various algorithms, using small data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm. For example, in Chapter 21, we follow step by step as the balanced iterative reducing and clustering using hierarchies (BIRCH) algorithm works through a tiny data set, showing precisely how BIRCH chooses the optimal clustering solution for this data, from start to finish. As far as we know, such a demonstration is unique to this book for the BIRCH algorithm. Also, in Chapter 27, we proceed step by step to find the optimal solution using the selection, crossover, and mutation operators, using a tiny data set, so that the reader may better understand the underlying processes.

Applications of the Algorithms and Models to Large Data Sets

Data Mining and Predictive Analytics provides examples of the application of data analytic methods on actual large data sets. For example, in Chapter 9, we analytically unlock the relationship between nutrition rating and cereal content using a real-world data set. In Chapter 4, we apply principal components analysis to real-world census data about California. All data sets are available from the book series web site: www.dataminingconsultant.com.

Chapter Exercises: Checking to Make Sure You Understand It

Data Mining and Predictive Analytics includes over 750 chapter exercises, which allow readers to assess their depth of understanding of the material, as well as have a little fun playing with numbers and data. These include Clarifying the Concept exercises, which help to clarify some of the more challenging concepts in data mining, and Working with the Data exercises, which challenge the reader to apply the particular data mining algorithm to a small data set, and, step by step, to arrive at a computationally sound solution. For example, in Chapter 14, readers are asked to find the maximum a posteriori classification for the data set and network provided in the chapter.

Hands-On Analysis: Learn Data Mining by Doing Data Mining

Most chapters provide the reader with Hands-On Analysis problems, representing an opportunity for the reader to apply his or her newly acquired data mining expertise to solving real problems using large data sets. Many people learn by doing. Data Mining and Predictive Analytics provides a framework where the reader can learn data mining by doing data mining. For example, in Chapter 13, readers are challenged to approach a real-world credit approval classification data set, and construct their best possible logistic regression model, using the methods learned in this chapter as possible, providing strong interpretive support for the model, including explanations of derived variables and indicator variables.

Exciting New Topics

Data Mining and Predictive Analytics contains many exciting new topics, including the following:

Cost-benefit analysis using data-driven misclassification costs.

Cost-benefit analysis for trinary and

k

-nary classification models.

Graphical evaluation of classification models.

BIRCH clustering.

Segmentation models.

Ensemble methods: Bagging and boosting.

Model voting and propensity averaging.

Imputation of missing data.

The R Zone

R is a powerful, open-source language for exploring and analyzing data sets (www.r-project.org). Analysts using R can take advantage of many freely available packages, routines, and graphical user interfaces to tackle most data analysis problems. In most chapters of this book, the reader will find The R Zone, which provides the actual R code needed to obtain the results shown in the chapter, along with screenshots of some of the output.

Appendix: Data Summarization and Visualization

Some readers may be a bit rusty on some statistical and graphical concepts, usually encountered in an introductory statistics course. Data Mining and Predictive Analytics contains an appendix that provides a review of the most common concepts and terminology helpful for readers to hit the ground running in their understanding of the material in this book.

The Case Study: Bringing it all Together

Data Mining and Predictive Analytics culminates in a detailed Case Study. Here the reader has the opportunity to see how everything he or she has learned is brought all together to create actionable and profitable solutions. This detailed Case Study ranges over four chapters, and is as follows:

Chapter 29

:

Case Study,

Part 1

: Business Understanding, Data Preparation, and EDA

Chapter 30

:

Case Study,

Part 2

: Clustering and Principal Components Analysis

Chapter 31

:

Case Study,

Part 3

: Modeling and Evaluation for Performance and Interpretability

Chapter 32

:

Case Study,

Part 4

: Modeling and Evaluation for High Performance Only

The Case Study includes dozens of pages of graphical, exploratory data analysis (EDA), predictive modeling, customer profiling, and offers different solutions, depending on the requisites of the client. The models are evaluated using a custom-built data-driven cost-benefit table, reflecting the true costs of classification errors, rather than the usual methods such as overall error rate. Thus, the analyst can compare models using the estimated profit per customer contacted, and can predict how much money the models will earn, based on the number of customers contacted.

How the Book is Structured

Data Mining and Predictive Analytics is structured in a way that the reader will hopefully find logical and straightforward. There are 32 chapters, divided into eight major parts.

Part 1

,

Data Preparation

, consists of chapters on data preparation, EDA, and dimension reduction.

Part 2

,

Statistical Analysis

, provides classical statistical approaches to data analysis, including chapters on univariate and multivariate statistical analysis, simple and multiple linear regression, preparing to model the data, and model building.

Part 3

,

Classification

, contains nine chapters, making it the largest section of the book. Chapters include

k

-nearest neighbor, decision trees, neural networks, logistic regression, naïve Bayes, Bayesian networks, model evaluation techniques, cost-benefit analysis using data-driven misclassification costs, trinary and

k

-nary classification models, and graphical evaluation of classification models.

Part 4

,

Clustering

, contains chapters on hierarchical clustering,

k

-means clustering, Kohonen networks clustering, BIRCH clustering, and measuring cluster goodness.

Part 5

,

Association Rules

, consists of a single chapter covering a priori association rules and generalized rule induction.

Part 6

,

Enhancing Model Performance

, provides chapters on segmentation models, ensemble methods: bagging and boosting, model voting, and propensity averaging.

Part 7

,

Further Methods in Predictive Modeling

, contains a chapter on imputation of missing data, along with a chapter on genetic algorithms.

Part 8

,

Case Study: Predicting Response to Direct-Mail Marketing

, consists of four chapters presenting a start-to-finish detailed Case Study of how to generate the greatest profit from a direct-mail marketing campaign.

The Software

The software used in this book includes the following:

IBM SPSS Modeler

data mining software suite

R

open source statistical software

SAS Enterprise Miner

SPSS

statistical software

Minitab

statistical software

WEKA

open source data mining software.

IBM SPSS Modeler (www-01.ibm.com/software/analytics/spss/products/modeler/) is one of the most widely used data mining software suites, and is distributed by SPSS, whose base software is also used in this book. SAS Enterprise Miner is probably more powerful than Modeler, but the learning curve is also steeper. SPSS is available for download on a trial basis as well (Google “spss” download). Minitab is an easy-to-use statistical software package that is available for download on a trial basis from their web site at www.minitab.com.

Weka: The Open-Source Alternative

The Weka (Waikato Environment for Knowledge Analysis) machine learning workbench is open-source software issued under the GNU General Public License, which includes a collection of tools for completing many data mining tasks. Data Mining and Predictive Modeling presents several hands-on, step-by-step tutorial examples using Weka 3.6, along with input files available from the book's companion web site www.dataminingconsultant.com. The reader is shown how to carry out the following types of analysis, using WEKA: Logistic Regression (Chapter 13), Naïve Bayes classification (Chapter 14), Bayesian Networks classification (Chapter 14), and Genetic Algorithms (Chapter 27). For more information regarding Weka, see www.cs.waikato.ac.nz/ml/weka/. The author is deeply grateful to James Steck for providing these WEKA examples and exercises. James Steck ([email protected]) was one of the first students to complete the master of science in data mining from Central Connecticut State University in 2005 (GPA 4.0), and received the first data mining Graduate Academic Award. James lives with his wife and son in Issaquah, WA.

The Companion Web Site: www.dataminingconsultant.com

The reader will find supporting materials, both for this book and for the other data mining books written by Daniel Larose and Chantal Larose for Wiley InterScience, at the companion web site, www.dataminingconsultant.com. There one may download the many data sets used in the book, so that the reader may develop a hands-on feel for the analytic methods and models encountered throughout the book. Errata are also available, as is a comprehensive set of data mining resources, including links to data sets, data mining groups, and research papers.

However, the real power of the companion web site is available to faculty adopters of the textbook, who will have access to the following resources:

Solutions to all the exercises, including the hands-on analyses.

PowerPoint® presentations of each chapter, ready for deployment in the classroom.

Sample data mining course projects, written by the author for use in his own courses, and ready to be adapted for your course.

Real-world data sets, to be used with the course projects.

Multiple-choice chapter quizzes.

Chapter-by-chapter web resources.

Adopters may e-mail Daniel Larose at [email protected] to request access information for the adopters' resources.

Data Mining and Predictive Analytics as a Textbook

Data Mining and Predictive Analytics naturally fits the role of textbook for a one-semester course or two-semester sequences of courses in introductory and intermediate data mining. Instructors may appreciate

the presentation of data mining as a

process

;

the “white-box” approach, emphasizing an understanding of the underlying algorithmic structures;

Algorithm walk-throughs with toy data sets

Application of the algorithms to large real-world data sets

Over 300 figures and over 275 tables

Over 750 chapter exercises and hands-on analysis

the many exciting new topics, such as cost-benefit analysis using data-driven misclassification costs;

the detailed

Case Study

, bringing together many of the lessons learned from the earlier 28 chapters;

the Appendix: Data Summarization and Visualization, containing a review of statistical and graphical concepts readers may be a bit rusty on;

the companion web site, providing the array of resources for adopters detailed above.

Data Mining and Predictive Analytics is appropriate for advanced undergraduate- or graduate-level courses. An introductory statistics course would be nice, but is not required. No computer programming or database expertise is required.

1

 

Big Data Market to Reach $46.34 Billion by 2018

, by Darryl K. Taft,

eWeek

,

www.eweek.com/database/big-data-market-to-reach-46.34-billion-by-2018.html

, posted September 1, 2013, last accessed March 23, 2014.

2

 

Big data: The next frontier for innovation, competition, and productivity

, by James Manyika

et al

., Mckinsey Global Institute,

www.mckinsey.com

, May, 2011. Last accessed March 16, 2014.

Acknowledgments

Daniel's Acknowledgments

I would first like to thank my mentor Dr. Dipak K. Dey, distinguished professor of statistics, and associate dean of the College of Liberal Arts and Sciences at the University of Connecticut, as well as Dr. John Judge, professor of statistics in the Department of Mathematics at Westfield State College. My debt to the two of you is boundless, and now extends beyond one lifetime. Also, I wish to thank my colleagues in the data mining programs at Central Connecticut State University, Dr. Chun Jin, Dr. Daniel S. Miller, Dr. Roger Bilisoly, Dr. Darius Dziuda, and Dr. Krishna Saha. Thanks to my daughter Chantal, and to my twin children, Tristan Spring and Ravel Renaissance, for providing perspective on what life is about.

Daniel T. Larose, PhD

Professor of Statistics and Data MiningDirector, Data Mining @CCSUwww.math.ccsu.edu/larose

Chantal's Acknowledgments

I would first like to thank my PhD advisors, Dr. Dipak Dey, distinguished professor and associate dean, and Dr. Ofer Harel, associate professor, both of the Department of Statistics at the University of Connecticut. Their insight and understanding have framed and sculpted our exciting research program, including my PhD dissertation, “Model-Based Clustering of Incomplete Data.” Thanks also to my father, Daniel, for kindling my enduring love of data analysis, and to my mother, Debra, for her care and patience through many statistics-filled conversations. Finally, thanks to my siblings, Ravel and Tristan, for perspective, music, and friendship.

Chantal D. Larose, MS

Department of StatisticsUniversity of Connecticut

Part I

Data Preparation

Chapter 1An Introduction to Data Mining and Predictive Analytics

1.1 What is Data Mining? What Is Predictive Analytics?

Recently, the computer manufacturer Dell was interested in improving the productivity of its sales workforce. It therefore turned to data mining and predictive analytics to analyze its database of potential customers, in order to identify the most likely respondents. Researching the social network activity of potential leads, using LinkedIn and other sites, provided a richer amount of information about the potential customers, thereby allowing Dell to develop more personalized sales pitches to their clients. This is an example of mining customer data to help identify the type of marketing approach for a particular customer, based on customer's individual profile. What is the bottom line? The number of prospects that needed to be contacted was cut by 50%, leaving only the most promising prospects, leading to a near doubling of the productivity and efficiency of the sales workforce, with a similar increase in revenue for Dell.1

The Commonwealth of Massachusetts is wielding predictive analytics as a tool to cut down on the number of cases of Medicaid fraud in the state. When a Medicaid claim is made, the state now immediately passes it in real time to a predictive analytics model, in order to detect any anomalies. During its first 6 months of operation, the new system has “been able to recover $2 million in improper payments, and has avoided paying hundreds of thousands of dollars in fraudulent claims,” according to Joan Senatore, Director of the Massachusetts Medicaid Fraud Unit.2

The McKinsey Global Institute (MGI) reports3 that most American companies with more than 1000 employees had an average of at least 200 TB of stored data. MGI projects that the amount of data generated worldwide will increase by 40% annually, creating profitable opportunities for companies to leverage their data to reduce costs and increase their bottom line. For example, retailers harnessing this “big data” to best advantage could expect to realize an increase in their operating margin of more than 60%, according to the MGI report. And health-care providers and health maintenance organizations (HMOs) that properly leverage their data storehouses could achieve $300 in cost savings annually, through improved efficiency and quality.

Forbes magazine reports4 that the use of data mining and predictive analytics has helped to identify patients who have been of the greatest risk of developing congestive heart failure. IBM collected 3 years of data pertaining to 350,000 patients, and including measurements on over 200 factors, including things such as blood pressure, weight, and drugs prescribed. Using predictive analytics, IBM was able to identify the 8500 patients most at risk of dying of congestive heart failure within 1 year.

The MIT Technology Review reports5