Data Mining for Business Analytics - Galit Shmueli - E-Book

Data Mining for Business Analytics E-Book

Galit Shmueli

0,0
108,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An applied approach to data mining and predictive analytics with clear exposition, hands-on exercises, and real-life case studies. Readers will work with all of the standard data mining methods using the Microsoft® Office Excel® add-in XLMiner® to develop predictive models and learn how to obtain business value from Big Data. Featuring updated topical coverage on text mining, social network analysis, collaborative filtering, ensemble methods, uplift modeling and more, the Third Edition also includes: * Real-world examples to build a theoretical and practical understanding of key data mining methods * End-of-chapter exercises that help readers better understand the presented material * Data-rich case studies to illustrate various applications of data mining techniques * Completely new chapters on social network analysis and text mining * A companion site with additional data sets, instructors material that include solutions to exercises and case studies, and Microsoft PowerPoint® slides https://www.dataminingbook.com * Free 140-day license to use XLMiner for Education software Data Mining for Business Analytics: Concepts, Techniques, and Applications in XLMiner®, Third Edition is an ideal textbook for upper-undergraduate and graduate-level courses as well as professional programs on data mining, predictive modeling, and Big Data analytics. The new edition is also a unique reference for analysts, researchers, and practitioners working with predictive analytics in the fields of business, finance, marketing, computer science, and information technology. Praise for the Second Edition "...full of vivid and thought-provoking anecdotes... needs to be read by anyone with a serious interest in research and marketing."- Research Magazine "Shmueli et al. have done a wonderful job in presenting the field of data mining - a welcome addition to the literature." - ComputingReviews.com "Excellent choice for business analysts...The book is a perfect fit for its intended audience." - Keith McCormick, Consultant and Author of SPSS Statistics For Dummies, Third Edition and SPSS Statistics for Data Analysis and Visualization Galit Shmueli, PhD, is Distinguished Professor at National Tsing Hua University's Institute of Service Science. She has designed and instructed data mining courses since 2004 at University of Maryland, Statistics.com, The Indian School of Business, and National Tsing Hua University, Taiwan. Professor Shmueli is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored over 70 journal articles, books, textbooks and book chapters. Peter C. Bruce is President and Founder of the Institute for Statistics Education at www.statistics.com. He has written multiple journal articles and is the developer of Resampling Stats software. He is the author of Introductory Statistics and Analytics: A Resampling Perspective, also published by Wiley. Nitin R. Patel, PhD, is Chairman and cofounder of Cytel, Inc., based in Cambridge, Massachusetts. A Fellow of the American Statistical Association, Dr. Patel has also served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad for 15 years.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 805

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



CONTENTS

Cover

Title Page

Copyright

Dedication

Foreword

Preface to the Third Edition

Preface to the First Edition

Acknowledgments

Part I: Preliminaries

Chapter 1: Introduction

1.1 What is Business Analytics?

1.2 What is Data Mining?

1.3 Data Mining and Related Terms

1.4 Big Data

1.5 Data Science

1.6 Why are There so Many Different Methods?

1.7 Terminology and Notation

1.8 Road Maps to This Book

Chapter 2: Overview of the Data Mining Process

2.1 Introduction

2.2 Core Ideas in Data Mining

2.3 The Steps in Data Mining

2.4 Preliminary Steps

2.5 Predictive Power and Overfitting

2.6 Building a Predictive Model with XLMiner

2.7 Using Excel for Data Mining

2.8 Automating Data Mining Solutions

Problems

Part II: Data Exploration and Dimension Reduction

Chapter 3: Data Visualization

3.1 Uses of Data Visualization

3.2 Data Examples

3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots

3.4 Multidimensional Visualization

3.5 Specialized Visualizations

3.6 Summary: Major Visualizations and Operations, by Data Mining Goal

Problems

Chapter 4: Dimension Reduction

4.1 Introduction

4.2 Curse of Dimensionality

4.3 Practical Considerations

4.4 Data Summaries

4.5 Correlation Analysis

4.6 Reducing the Number of Categories in Categorical Variables

4.7 Converting a Categorical Variable to a Numerical Variable

4.8 Principal Components Analysis

4.9 Dimension Reduction Using Regression Models

4.10 Dimension Reduction Using Classification and Regression Trees

Problems

Part III: Performance Evaluation

Chapter 5: Evaluating Predictive Performance

5.1 Introduction

5.2 Evaluating Predictive Performance

5.3 Judging Classifier Performance

5.4 Judging Ranking Performance

5.5 Oversampling

Problems

Part IV: Prediction and Classification Methods

Chapter 6: Multiple Linear Regression

6.1 Introduction

6.2 Explanatory vs. Predictive Modeling

6.3 Estimating the Regression Equation and Prediction

6.4 Variable Selection in Linear Regression

Problems

Chapter 7: k-Nearest-Neighbors (k-NN)

7.1 The k-NN Classifier (Categorical Outcome)

7.2 k-NN for a Numerical Response

7.3 Advantages and Shortcomings of k-NN Algorithms

Problems

Chapter 8: The Naive Bayes Classifier

8.1 Introduction

8.2 Applying the Full (Exact) Bayesian Classifier

8.3 Advantages and Shortcomings of the Naive Bayes Classifier

Problems

Chapter 9: Classification and Regression Trees

9.1 Introduction

9.2 Classification Trees

9.3 Evaluating the Performance of a Classification Tree

9.4 Avoiding Overfitting

9.5 Classification Rules from Trees

9.6 Classification Trees for More Than two Classes

9.7 Regression Trees

9.8 Advantages, Weaknesses, and Extensions

9.9 Improving Prediction: Multiple Trees

Problems

Chapter 10: Logistic Regression

10.1 Introduction

10.2 The Logistic Regression Model

10.3 Evaluating Classification Performance

10.4 Example of Complete Analysis: Predicting Delayed Flights

10.5 Appendix: Logistic Regression for Profiling

Problems

Chapter 11: Neural Nets

11.1 Introduction

11.2 Concept and Structure of a Neural Network

11.3 Fitting a Network to Data

11.4 Required User Input

11.5 Exploring the Relationship Between Predictors and Response

11.6 Advantages and Weaknesses of Neural Networks

Problems

Chapter 12: Discriminant Analysis

12.1 Introduction

12.2 Distance of an Observation from a Class

12.3 Fishers Linear Classification Functions

12.4 Classification Performance of Discriminant Analysis

12.5 Prior Probabilities

12.6 Unequal Misclassification Costs

12.7 Classifying More Than Two Classes

12.8 Advantages and Weaknesses

Problems

Chapter 13: Combining Methods: Ensembles and Uplift Modeling

13.1 Ensembles

13.2 Uplift (Persuasion) Modeling

13.3 Summary

Problems

Part V: Mining Relationships among Records

Chapter 14: Association Rules and Collaborative Filtering

14.1 Association Rules

14.2 Collaborative Filtering

14.3 Summary

Problems

Chapter 15: Cluster Analysis

15.1 Introduction

15.2 Measuring Distance Between Two Observations

15.3 Measuring Distance Between Two Clusters

15.4 Hierarchical (Agglomerative) Clustering

15.5 Non-hierarchical Clustering: The k-Means Algorithm

Problems

Part VI: Forecasting Time Series

Chapter 16: Handling Time Series

16.1 Introduction

16.2 Descriptive vs. Predictive Modeling

16.3 Popular Forecasting Methods in Business

16.4 Time Series Components

16.5 Data Partitioning and Performance Evaluation

Problems

Chapter 17: Regression-Based Forecasting

17.1 A Model with Trend

17.2 A Model with Seasonality

17.3 A Model with Trend and Seasonality

17.4 Autocorrelation and ARIMA Models

Problems

Chapter 18: Smoothing Methods

18.1 Introduction

18.2 Moving Average

18.3 Simple Exponential Smoothing

18.4 Advanced Exponential Smoothing

Problems

Part VII: Data Analytics

Chapter 19: Social Network Analytics

19.1 Introduction

19.2 Directed Vs. Undirected Networks

19.3 Visualizing and Analyzing Networks

19.4 Social Data Metrics and Taxonomy

19.5 Using Network Metrics in Prediction and Classification

19.6 Advantages and Disadvantages

Problems

Chapter 20: Text Mining

20.1 Introduction

20.2 The Spreadsheet Representation of Text: “Bag-of-Words”

20.3 Bag-of-Words Vs. Meaning Extraction at Document Level

20.4 Preprocessing the Text

20.5 Implementing Data Mining Methods

20.6 Example: Online Discussions on Autos and Electronics

20.7 Summary

Problems

Part VIII: Cases

Chapter 21: Cases

21.1 Charles Book Club

21.2 German Credit

21.3 Tayko Software Cataloger

21.4 Political Persuasion

21.5 Taxi Cancellations

21.6 Segmenting Consumers of Bath Soap

21.7 Direct-Mail Fundraising

21.8 Catalog Cross-Selling

21.9 Predicting Bankruptcy

21.10 Time Series Case: Forecasting Public Transportation Demand

References

Data Files Used in the Book

Index

End User License Agreement

List of Tables

Table 1.1

Table 2.1

Table 2.2

Table 2.3

Table 2.4

Table 2.5

Table 2.6

Table 2.7

Table 3.1

Table 3.2

Table 4.1

Table 4.2

Table 4.3

Table 4.4

Table 5.1

Table 5.2

Table 5.5

Table 6.1

Table 6.2

Table 6.3

Table 6.4

Table 6.5

Table 7.1

Table 8.1

Table 8.2

Table 8.3

Table 8.4

Table 9.1

Table 9.2

Table 9.3

Table 10.1

Table 10.2

Table 10.3

Table 11.1

Table 11.2

Table 11.3

Table 13.1

Table 13.2

Table 13.3

Table 13.4

Table 13.5

Table 13.6

Table 14.1

Table 14.2

Table 14.3

Table 14.4

Table 14.5

Table 14.6

Table 14.7

Table 14.8

Table 14.9

Table 14.10

Table 14.11

Table 14.12

Table 15.1

Table 15.2

Table 15.3

Table 15.4

Table 15.5

Table 15.6

Table 17.1

Table 17.2

Table 17.3

Table 18.1

Table 19.1

Table 19.2

Table 19.3

Table 19.4

Table 19.5

Table 19.6

Table 19.7

Table 19.8

Table 20.1

Table 21.1

Table 21.2

Table 21.3

Table 21.4

Table 21.5

Table 21.6

Table 21.7

Table 21.8

Table 21.9

Table 21.10

List of Illustrations

Figure 1.1

Figure 1.2

Figure 1.3

Figure 2.1

Figure 2.2

Figure 2.3

Figure 2.4

Figure 2.5

Figure 2.6

Figure 2.7

Figure 2.8

Figure 2.9

Figure 2.10

Figure 2.11

Figure 2.12

Figure 2.13

Figure 3.1

Figure 3.2

Figure 3.3

Figure 3.4

Figure 3.5

Figure 3.6

Figure 3.7

Figure 3.8

Figure 3.9

Figure 3.10

Figure 3.11

Figure 3.12

Figure 3.13

Figure 3.14

Figure 3.15

Figure 3.16

Figure 3.17

Figure 4.1

Figure 4.2

Figure 4.3

Figure 4.4

Figure 4.5

Figure 4.6

Figure 4.7

Figure 4.8

Figure 4.9

Figure 4.10

Figure 4.11

Figure 4.12

Figure 4.13

Figure 5.1

Figure 5.2

Figure 5.3

Figure 5.4

Figure 5.5

Figure 5.6

Figure 5.7

Figure 5.8

Figure 5.9

Figure 5.10

Figure 5.11

Figure 5.12

Figure 5.13

Figure 5.14

Figure 5.15

Figure 5.16

Figure 5.17

Figure 6.1

Figure 6.2

Figure 6.3

Figure 6.4

Figure 6.5

Figure 6.6

Figure 7.1

Figure 7.2

Figure 7.3

Figure 8.1

Figure 8.2

Figure 8.3

Figure 8.4

Figure 9.1

Figure 9.2

Figure 9.3

Figure 9.4

Figure 9.5

Figure 9.6

Figure 9.7

Figure 9.8

Figure 9.9

Figure 9.10

Figure 9.11

Figure 9.12

Figure 9.13

Figure 9.14

Figure 9.15

Figure 9.16

Figure 9.17

Figure 9.18

Figure 10.1

Figure 10.2

Figure 10.3

Figure 10.4

Figure 10.5

Figure 10.6

Figure 10.7

Figure 10.8

Figure 10.9

Figure 10.10

Figure 10.11

Figure 10.12

Figure 10.13

Figure 10.14

Figure 10.15

Figure 11.1

Figure 11.2

Figure 11.3

Figure 11.4

Figure 11.5

Figure 11.6

Figure 11.7

Figure 11.8

Figure 11.9

Figure 12.1

Figure 12.2

Figure 12.3

Figure 12.4

Figure 12.5

Figure 12.6

Figure 12.7

Figure 12.8

Figure 12.9

Figure 14.1

Figure 14.2

Figure 14.3

Figure 14.4

Figure 15.1

Figure 15.2

Figure 15.3

Figure 15.4

Figure 15.5

Figure 15.6

Figure 15.7

Figure 15.8

Figure 15.9

Figure 16.1

Figure 16.2

Figure 16.3

Figure 16.4

Figure 16.5

Figure 16.6

Figure 17.1

Figure 17.2

Figure 17.3

Figure 17.4

Figure 17.5

Figure 17.6

Figure 17.7

Figure 17.8

Figure 17.9

Figure 17.10

Figure 17.11

Figure 17.12

Figure 17.13

Figure 17.14

Figure 17.15

Figure 17.16

Figure 17.17

Figure 17.18

Figure 17.19

Figure 17.20

Figure 17.21

Figure 17.22

Figure 17.23

Figure 18.1

Figure 18.2

Figure 18.3

Figure 18.4

Figure 18.5

Figure 18.6

Figure 18.7

Figure 18.8

Figure 18.9

Figure 18.10

Figure 18.11

Figure 18.12

Figure 18.13

Figure 18.14

Figure 18.15

Figure 18.16

Figure 19.1

Figure 19.2

Figure 19.3

Figure 19.4

Figure 19.5

Figure 19.6

Figure 19.7

Figure 19.8

Figure 19.9

Figure 19.10

Figure 20.1

Figure 20.2

Figure 20.3

Figure 20.4

Figure 20.5

Figure 21.1

Figure 21.2

Guide

Cover

Table of Contents

Begin Reading

Part 1

Chapter 1

Pages

iii

v

vi

vii

viii

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxix

xxx

xxxi

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

273

274

275

276

277

278

279

280

281

282

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

Third Edition

Data Mining For Business Analytics

Concepts, Techniques, and Applications with XLMiner®

Galit Shmueli

Peter C. Bruce

Nitin R. Patel

Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data applied for.

Names: Shmueli, Galit, 1971-author. | Patel, Nitin R. (Nitin Ratilal), author. | Bruce, Peter C., 1953-author.

Title: Data mining for business analytics: concepts, techniques, and applications in Microsoft Office Excel with XLMiner / Galit Shmueli, Nitin R. Patel, Peter C. Bruce.

Other titles: Data mining for business intelligence

Description: Third edition. | Hoboken, New Jersey: John Wiley & Sons, 2016.

| Includes index. | Originally published as: Data mining for business intelligence, 2007.

Identifiers: LCCN 2015040496 (print) | LCCN 2015047680 (ebook) | ISBN 9781118729274 (cloth) | ISBN 9781118729137 (Adobe PDF) | ISBN 9781118729243 (ePub)

Subjects: LCSH: Business–Data processing. | Data mining. | Microsoft Excel

(Computer file)

Classification: LCC HF5548.2.S44843 2016 (print) | LCC HF5548.2 (ebook) | DDC 005.54–dc23

LC record available at http://lccn.loc.gov/2015040496

ISBN: 9781118729274

 

XLMiner for Educationwww.solver.com/xlminer

Your new textbook, Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner®, Third Edition, uses this software throughout. Here's how to get it for your course.

For Instructors: Setting Up the Course Code

To set up a course code for your course, please email Frontline Systems at [email protected], or call 775-831-0300, press 0, and ask for the Academic Coordinator. Course codes MUST be renewed each time you teach your course.

The course code is free, and it can usually be issued within 24 to 48 hours (often the same day). It will enable your students to download and install XLMiner for Education with a semester-long (140 day) license, and will enable Frontline Systems to assist students with installation, and provide technical support to you during the course.

Please give the course code, plus the instructions, to your students. If you're evaluating the book for adoption, you can use the course code yourself to download and install the software as described below.

For Students: Installing XLMiner for Education

To download and install XLMiner for Education from Frontline Systems, to work with Microsoft Excel for Windows, please visit:

www.solver.com/xlminer-wiley

Fill out the registration form on this page

, supplying your name, school, email address (key information will be sent to this address), Course Code (obtain this from your instructor), and Textbook Code (enter

SDMBI3

).

On the download page, change 32-bit to 64-bit ONLY if you've confirmed that you have 64-bit Excel (see below).

Click the Download Now button

, and save the downloaded file (

SolverSetup.exe

or

SolverSetup64.exe

).

Close any Excel windows you have open

.

Run SolverSetup/SolverSetup64 to install the software.

When prompted, enter the installation password and the license activation code contained in the email sent to the address you entered on the form above.

If you have problems downloading or installing, please email [email protected] or call 775-831-0300 and press 4 (tech support). Say that you have XLMiner for Education, and have your course code and textbook code available.

If you have problems setting up or solving your model, or interpreting the results, please ask your instructor for assistance. Frontline Systems cannot help you with homework problems.

If you purchase this textbook but you aren't enrolled in a course, call 775-831-0300 and press 0 for assistance with the software

.

If you have a Mac, you'll need to install “dual-boot” or VM software, Microsoft Windows, and Office or Excel for Windows first. Excel for Mac will NOT work. For more information see

www.solver.com/using-frontline-solvers-macintosh

. Ask your instructor if you can use your browser and

xlminer.com

.

For Excel 2007, always download SolverSetup. In Excel 2010, choose File ≫ Help and look in the lower right. In Excel 2016 and Excel 2013, choose File ≫ Account ≫ About Excel and look at the top of the dialog. Download SolverSetup64 ONLY if you see “64-bit” displayed.

Dedication

To our familiesBoaz and NoaLiz, Lisa, and AllisonTehmi, Arjun, and inmemory of Aneesh

Foreword

Data is the new gold and mining this gold to create business value in today's context of a highly networked and digital society requires a skillset that we haven't traditionally delivered in business or statistics or even engineering programs on their own. For those businesses and organizations that feel overwhelmed by today's big data, the phrase you ain't seen nothing yet comes to mind. Yesterday's three major sources of big data—the 20+ years of investment in enterprise systems (ERP, CRM, SCM, . . .), the three billion plus people on the online social grid, and the close to 5 billion people carrying increasingly sophisticated mobile devices—are going to be dwarfed by tomorrow's smarter physical ecosystems fueled by the Internet of Things (IoT) movement.

The idea that we can use sensors to connect physical objects such as homes, automobiles, roads, even garbage bins, and street lights, to digitally optimized systems of governance goes hand in glove with bigger data and the need for deeper analytical capabilities. We are not far away from a smart refrigerator sensing that you are short on, say, eggs, populating your grocery store's mobile app's shopping list, and arranging a Task Rabbit to do a grocery run for you. Or the refrigerator negotiating a deal with an Uber driver to deliver an evening meal to you. Nor are we far away from sensors embedded in roads and vehicles that can compute traffic congestion, track roadway wear and tear, record vehicle use, and factor these into dynamic usage based pricing, insurance rates, and even taxation. This brave new world is going to be fueled by analytics and the ability to harness data for competitive advantage.

Business Analytics is an emerging discipline that is going to help us ride this new wave. This new Business Analytics discipline requires individuals who are grounded in the fundamentals of business such that they know the right questions to ask, who have the ability to harness, store, and optimally process vast datasets from a variety of structured and unstructured sources, and who can then use an array of techniques from machine learning and statistics to uncover new insights for decision making. Such individuals are a rare commodity today, but their creation has been the focus of this book for close to a decade now. This book's forte is that it relies on explaining the core set of concepts required for today's business analytics professionals using real-world data-rich cases in a hands-on manner, without sacrificing academic rigor. I say this with the confidence of someone who was probably the first adopter of the zeroth edition of this book (Spring 2006 at the Indian School of Business).

What particularly pleases me about the third edition is the addition of chapters on social network analytics and text mining, as well as a more detailed focus on ensemble methods that are going to become increasingly useful in more complex application domains. Also completing the picture are treatments of collaborative filtering and recommendation engines. All in all, this edition represents a comprehensive treatment of business analytics that goes beyond business intelligence, the notion of viewing the existing data that you have in a richer manner to get insights. Instead, it provides a modern-day foundation for Business Analytics, the notion of linking the X's to the Y's of interest in a predictive sense.

I look forward to using it in multiple fora, from executive education to MBA classrooms to specialized Business Analytics programs. I trust you will too!

Ravi Bapna

Carlson School of Management, University of Minnessota, 2016

Preface to the Third Edition

Since the book's first appearance in early 2007, it has been used by numerous practitioners and in many courses, ranging from dedicated data mining classes to more general business analytics courses (including our own experience teaching this material both online and in person for more than 10 years). Following feedback from instructors teaching MBA, undergraduate, and executive courses, and from students, we revised some of the existing materials as well as added new material.

The first noticeable change is the title: we now use business analytics in place of business intelligence. This update reflects the change in terminology since the second edition: BI today refers mainly to reporting and data visualization (“what is happening now”), while BA has taken over the “advanced analytics,” which include predictive analytics and data mining. In this new edition we therefore also updated these terms in the book, using them as is currently common.

We added the new Part VII on Data Analytics, covering two new topics: Social Network Analysis and Text Mining. The Data Analytics chapters expand data mining into the realm of new data structures: networks and text. As in the rest of the book, Excel-based tools are used to present these topics.

The new chapter on Social Network Analysis (Chapter 19) introduces metrics and graphs that help you understand connections among people and entities, and use that information to understand social networks, and to contribute additional depth to predictive models.

The new Text Mining chapter (Chapter 20) discusses how to process text, and convert it into a useful form for predictive modeling, where the modeling then follows the same paradigm as presented earlier in the book.

Another new chapter to the third edition is “Combining Methods: Ensembles and Uplift Modeling” (Chapter 13). This chapter, which is the last in Part IV on “Prediction and Classification Methods,” introduces two important approaches. The first, ensembles, are the combination of multiple models for improving predictive power. Ensembles have routinely proved their usefulness in practical applications and in data mining contests. The second topic, Uplift Modeling, introduces an improved approach for measuring the impact of an intervention, and also introduces the application of analytics by governments. Similar to other chapters, the new chapters include real-world examples and end-of-chapter problems.

In addition to the new chapters, we extended coverage of methods in some of the chapters. The chapter on association rules is now expanded to recommendation algorithms, with an additional section on the popular collaborative filtering approach.

The Cases chapter includes two new cases: one on political persuasion and uplift modeling, and another on taxi cancellations. Both use real data.

An important update in the third edition is an extensive revision of each of the examples, boxes with special topics, and exercises that rely on XLMiner software. Since the second edition, XLMiner has undergone several upgrades that improved speed, functionality, presentation, and algorithmic implementation. The XLMiner screenshots in the third edition are based on using the latest XLMiner version (currently, V2015-R2).

Since the second edition's appearance, the landscape of the courses using the textbook has greatly expanded: whereas initially the book was used mainly in semester-long elective MBA-level courses, it is now used in a variety of courses in Business Analytics degree and certificate programs, ranging from undergraduate programs, to postgraduate and executive education programs. Courses in such programs also vary in their duration and coverage. In many schools, our book is used across multiple courses. The book is designed to continue supporting the general “Predictive Analytics” or “Data Mining” course as well as supporting a set of courses in dedicated business analytics programs.

A general “Business Analytics,” “Predictive Analytics,” or “Data Mining” course, common in MBA and undergraduate programs as a one-semester elective, would cover Parts I to III, and choose a subset of methods from Parts IV and V. Instructors can choose to use Cases as team assignments, class discussions, or projects. For a two-semester course, Part VI might be considered, and we recommend introducing the new Part VII (Data Analytics).

For a set of courses in a dedicated business analytics program, here are a few courses that have been using our book:

Predictive Analytics: Supervised Learning

In a dedicated Business Analytics program, the topic of Predictive Analytics is typically instructed across a set of courses. The first course would cover Parts I to IV and instructors typically choose a subset of methods from Part IV according to the course length. We recommend including the new

Chapter 3

in such a course, as well as the new “Part VII: Data Analytics.”

Predictive Analytics: Unsupervised Learning

This course introduces data exploration and visualization, dimension reduction, mining relationships, and clustering (Parts III and V). If this course follows the Predictive Analytics: Supervised Learning course, then it is useful to examine examples and approaches that integrate unsupervised and supervised learning, such as the new part on “Data Analytics.”

Forecasting Analytics

A dedicated course on time series forecasting would rely on Part VI.

Advanced Analytics

A course that integrates the learnings from Predictive Analytics (supervised and unsupervised learning). Such a course can focus on Part VII: Data Analytics, where social network analytics and text mining are introduced. Some instructors choose to use the Cases chapter in such a course.

In all courses, we strongly recommend including a project component, where data are either collected by students according to their interest or provided by the instructor (e.g., from the many data mining competition datasets available). From our experience and other instructors' experience, such projects enhance the learning and provide students with an excellent opportunity to understand the strengths of data mining and the challenges that arise in the process.

Important Note: A cloud based version of XLMiner is now available on the web that significantly expands the range of potential users, freeing them from the constraints of using Excel or Windows. The cloud version functions nearly identically to the Excel-based version illustrated in these pages, so can effectively be used with this book.

Preface to the First Edition

This book arose out of a data mining course at MIT's Sloan School of Management and was refined during its use in data mining courses at the University of Maryland's R. H. Smith School of Business and at statistics.com. Preparation for the course revealed that there are a number of excellent books on the business context of data mining, but their coverage of the statistical and machine-learning algorithms that underlie data mining is not sufficiently detailed to provide a practical guide if the instructor's goal is to equip students with the skills and tools to implement those algorithms. On the other hand, there are also a number of more technical books about data mining algorithms, but these are aimed at the statistical researcher or more advanced graduate student, and do not provide the case-oriented business focus that is successful in teaching business students.

Hence, this book is intended for the business student (and practitioner) of data mining techniques, and its goal is threefold:

To provide both a theoretical and a practical understanding of the key methods of classification, prediction, reduction, and exploration that are at the heart of data mining.

To provide a business decision-making context for these methods.

Using real business cases, to illustrate the application and interpretation of these methods.

The presentation of the cases in the book is structured so that the reader can follow along and implement the algorithms on his or her own with a very low learning hurdle.

Just as a natural science course without a lab component would seem incomplete, a data mining course without practical work with actual data is missing a key ingredient. The MIT data mining course that gave rise to this book followed an introductory quantitative course that relied on Excel—this made its practical work universally accessible. Using Excel for data mining seemed a natural progression. An important feature of this book is the use of Excel, an environment familiar to business analysts. All required data mining algorithms (plus illustrative datasets) are provided in an Excel add-in, XLMiner. Data for both the cases and exercises are available at www.dataminingbook.com.

Although the genesis for this book lay in the need for a case-oriented guide to teaching data mining, analysts and consultants who are considering the application of data mining techniques in contexts where they are not currently in use will also find this a useful, practical guide.

Acknowledgments

The authors thank the many people who assisted us in improving the first edition and improving it further in the second edition and now in the third edition. Anthony Babinec, who has been using earlier editions of this book for years in his data mining courses at Statistics.com, provided us with detailed and expert corrections. Similarly, Dan Toy and John Elder IV greeted our project with enthusiasm and provided detailed and useful comments on earlier drafts.

Boaz Shmueli and Raquelle Azran gave detailed editorial comments and suggestions on the first two editions; Bruce McCullough and Adam Hughes did the same for the first edition. Noa Shmueli provided careful proofs of the third edition. Ravi Bapna, who used an early draft in a data mining course at the Indian School of Business, has provided invaluable comments and helpful suggestions since the book's start. Useful comments and feedback have also come from the many instructors, too numerous to mention, who have used the book in their classes. Susan Palocsay, Scott Nestler, Margret Bjarnadottir, and Mia Stephens provided suggestions and feedback on the second edition and drafts of the third edition.

From the Smith School of Business at the University of Maryland, colleagues Shrivardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and comments. We thank Robert Windle, and MBA students Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the heatmap and map charts. And we thank the many MBA students from the University of Maryland and the Indian School of Business for fruitful discussions and interesting data mining projects that have helped shape and improve the book.

This book would not have seen the light of day without the nurturing support of the faculty at the Sloan School of Management at MIT. Our special thanks to Dimitris Bertsimas, James Orlin, Robert Freund, Roy Welsch, Gordon Kaufmann, and Gabriel Bitran. As teaching assistants for the data mining course at Sloan, Adam Mersereau gave detailed comments on the notes and cases that were the genesis of this book, Romy Shioda helped with the preparation of several cases and exercises used here, and Mahesh Kumar helped with the material on clustering. We are grateful to the MBA students at Sloan for stimulating discussions in the class that led to refinement of the notes as well as XLMiner.

Chris Albright, Wayne Winston, and Uday Karmarkar gave us helpful advice on the use of XLMiner. Anand Bodapati provided both data and advice. Jake Hofman from Microsoft Research and Sharad Borle assisted with data access. Suresh Ankolekar and Mayank Shah helped develop several cases and provided valuable pedagogical comments. Vinni Bhandari helped write the Charles Book Club case.

We would like to thank Marvin Zelen, L. J. Wei, and Cyrus Mehta at Harvard, as well as Anil Gore at Pune University, for thought-provoking discussions on the relationship between statistics and data mining. Our thanks to Richard Larson of the Engineering Systems Division, MIT, for sparking many stimulating ideas on the role of data mining in modeling complex systems. They helped us develop a balanced philosophical perspective on the emerging field of data mining.

Our thanks to Ajay Sathe, who energetically shepherded XLMiner's development, and to his colleagues on the initial XLMiner team: Suresh Ankolekar, Poonam Baviskar, Kuber Deokar, Rupali Desai, Yogesh Gajjar, Ajit Ghanekar, Ayan Khare, Bharat Lande, Dipankar Mukhopadhyay, S. V. Sabnis, Usha Sathe, Anurag Srivastava, V. Subramaniam, Ramesh Raman, and Sanhita Yeolkar. We also thank Dan Fylstra, the founder of Frontline Systems, who has overseen the incorporation of XLMiner into the Solver suite. Frontline's programming team members Eissa Nematollahi and Oleg Shirokikh have continued the extension of XLMiner's capabilities.

Steve Quigley at Wiley showed confidence in this book from the beginning and helped us navigate through the publishing process with great speed. Curt Hinrichs' vision, tips, and encouragement helped bring this book to the starting gate. Jon Gurstelle, Allison McGinniss, Sari Friedman, and Katrina Maceda at Wiley, and Shikha Pahuja from Thomson Digital, were all helpful and responsive as we finalized this third edition. We are also grateful to Ashwini Kumthekar, Achala Sabane, Michael Shapard, Amy Hendrickson, and Heidi Sestrich who assisted with typesetting, figures, and indexing, and to Valerie Troiano who has shepherded many instructors through the use of XLMiner and early drafts of this text.

We also thank Catherine Plaisant at the University of Maryland's Human–Computer Interaction Lab, who helped out in a major way by contributing exercises and illustrations to the data visualization chapter, Marietta Tretter at Texas A&M for her helpful comments and thoughts on the time series chapters, and Stephen Few and Ben Shneiderman for feedback and suggestions on the data visualization chapter and overall design tips.

Gregory Piatetsky-Shapiro, founder of KDNuggets.com, has been generous with his time and counsel over the many years of this project. Ken Strasma, founder of the microtargeting firm HaystaqDNA and director of targeting for the 2004 Kerry campaign and the 2008 Obama campaign, provided the scenario and data for the section on uplift modeling. We also thank Jen Golbeck, director of the Social Intelligence Lab at the University of Maryland and author of Analyzing the Social Web, whose book inspired our presentation in the chapter on Social Network Analytics.

We also thank the many students who have used and commented on this text at Statistics.com and, for trouble-shooting and followup, the Statistics.com team, led by Kuber Deokar, Instructional Operations Supervisor, Dhanashree Vishwasrao, Assistant Teacher, and, with special gratitude for her hours of double-checking XLMiner, Shweta Jadhav.

Part IPreliminaries

Chapter 1Introduction

1.1 What is Business Analytics?

Business analytics (BA) is the practice and art of bringing quantitative data to bear on decision making. The term means different things to different organizations.

Consider the role of analytics in helping newspapers survive the transition to a digital world. One tabloid newspaper with a working-class readership in Britain had launched a web version of the paper, and did tests on its home page to determine which images produced more hits: cats, dogs, or monkeys. This simple application, for this company, was considered analytics. By contrast, the Washington Post has a highly influential audience that is of interest to big defense contractors: it is perhaps the only newspaper where you routinely see advertisements for aircraft carriers. In the digital environment, the Post can track readers by time of day, location, and user subscription information. In this fashion, the display of the aircraft carrier advertisement in the online paper may be focused on a very small group of individuals—say, the members of the House and Senate Armed Services Committees who will be voting on the Pentagon's budget.

Business analytics, or more generically, analytics, includes a range of data analysis methods. Many powerful applications involve little more than counting, rule checking, and basic arithmetic. For some organizations, this is what is meant by analytics.

The next level of business analytics, now termed business intelligence (BI), refers to data visualization and reporting for understanding “what happened and what is happening.” This is done by use of charts, tables, and dashboards to display, examine, and explore data. BI, which earlier consisted mainly of generating static reports, has evolved into more user-friendly and effective tools and practices, such as creating interactive dashboards that allow the user not only to access real-time data but also to directly interact with it. Effective dashboards are those that tie directly into company data, and give managers a tool to quickly see what might not readily be apparent in a large complex database. One such tool for industrial operations managers displays customer orders in a single two-dimensional display, using color and bubble size as added variables, showing customer name, type of product, size of order, and length of time to produce.

Business analytics now typically includes BI as well as sophisticated data analysis methods, such as statistical models and data mining algorithms used for exploring data, quantifying and explaining relationships between measurements, and predicting new records. Methods like regression models are used to describe and quantify “on average” relationships (e.g., between advertising and sales), to predict new records (e.g., whether a new patient will react positively to a medication), and to forecast future values (e.g., next week's web traffic).

Readers familiar with earlier editions of this book might have noticed that the book title changed from Data Mining for Business Intelligence to Data Mining for business analytics in this edition. The change reflects the more recent term BA, which overtook the earlier term BI to denote advanced analytics. Today, BI is used to refer to data visualization and reporting.

Who Uses Predictive Analytics?

The widespread adoption of predictive analytics, coupled with the accelerating availability of data, has increased organizations' capabilities throughout the economy. A few examples:

Credit scoring: One long-established use of predictive modeling techniques for business prediction is credit scoring. A credit score is not some arbitrary judgment of credit-worthiness; it is based mainly on a predictive model that uses prior data to predict repayment behavior.

Future purchases: A more recent (and controversial) example is Target's use of predictive modeling to classify sales prospects as “pregnant” or “not-pregnant.” Those classified as pregnant could then be sent sales promotions at an early stage of pregnancy, giving Target a head start on a significant purchase stream.

Tax evasion: The US Internal Revenue Service found it was 25 times more likely to find tax evasion when enforcement activity was based on predictive models, allowing agents to focus on the most likely tax cheats (Siegel, 2013).

The business analytics toolkit also includes statistical experiments, the most common of which is known to marketers as A-B testing. These are often used for pricing decisions:

Orbitz, the travel site, found that it could price hotel options higher for Mac users than Windows users.

Staples online store found it could charge more for staplers if a customer lived far from a Staples store.

Beware the organizational setting where analytics is a solution in search of a problem: A manager, knowing that business analytics and data mining are hot areas, decides that her organization must deploy them too, to capture that hidden value that must be lurking somewhere. Successful use of analytics and data mining requires both an understanding of the business context where value is to be captured, and an understanding of exactly what the data mining methods do.

1.2 What is Data Mining?

In this book, data mining refers to business analytics methods that go beyond counts, descriptive techniques, reporting, and methods based on business rules. While we do introduce data visualization, which is commonly the first step into more advanced analytics, the book focuses mostly on the more advanced data analytics tools. Specifically, it includes statistical and machine-learning methods that inform decision making, often in automated fashion. Prediction is typically an important component, often at the individual level. Rather than “what is the relationship between advertising and sales,” we might be interested in “what specific advertisement, or recommended product, should be shown to a given online shopper at this moment?” Or we might be interested in clustering customers into different “personas” that receive different marketing treatment, then assigning each new prospect to one of these personas.

The era of big data has accelerated the use of data mining. Data mining methods, with their power and automaticity, have the ability to cope with huge amounts of data and extract value.

1.3 Data Mining and Related Terms

The field of analytics is growing rapidly, both in terms of the breadth of applications, and in terms of the number of organizations using advanced analytics. As a result there is considerable overlap and inconsistency of definitions.

The term data mining itself means different things to different people. To the general public, it may have a general, somewhat hazy and pejorative meaning of digging through vast stores of (often personal) data in search of something interesting. One major consulting firm has a “data mining department,” but its responsibilities are in the area of studying and graphing past data in search of general trends. And, to confuse matters, their more advanced predictive models are the responsibility of an “advanced analytics department.” Other terms that organizations use are predictive analytics, predictive modeling, and machine learning.

Data mining stands at the confluence of the fields of statistics and machine learning (also known as artificial intelligence). A variety of techniques for exploring data and building models have been around for a long time in the world of statistics: linear regression, logistic regression, discriminant analysis, and principal components analysis, for example. But the core tenets of classical statistics—computing is difficult and data are scarce—do not apply in data mining applications where both data and computing power are plentiful.

This gives rise to Daryl Pregibon's description of data mining as “statistics at scale and speed” (Pregibon, 1999). Another major difference between the fields of statistics and machine learning is the focus in statistics on inference from a sample to the population regarding an “average effect”—for example, “a $1 price increase will reduce average demand by 2 boxes.” In contrast, the focus in machine learning is on predicting individual records—“the predicted demand for person i given a $1 price increase is 1 box, while for person j it is 3 boxes.” The emphasis that classical statistics places on inference (determining whether a pattern or interesting result might have happened by chance in our sample) is absent from data mining.

In comparison to statistics, data mining deals with large datasets in an open-ended fashion, making it impossible to put the strict limits around the question being addressed that inference would require. As a result the general approach to data mining is vulnerable to the danger of overfitting, where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal.

In this book, we use the term machine learning to refer to algorithms that learn directly from data, especially local patterns, often in layered or iterative fashion. In contrast, we use statistical models to refer to methods that apply global structure to the data. A simple example is a linear regression model (statistical) versus a k-nearest-neighbors algorithm (machine learning). A given record would be treated by linear regression in accord with an overall linear equation that applies to all the records. In k-nearest-neighbors, that record would be classified in accord with the values of a small number of nearby records.

Last, many practitioners, particularly those from the IT and computer science communities, use the term machine learning to refer to all the methods discussed in this book.

1.4 Big Data

Data mining and big data go hand in hand. Big data is a relative term—data today are big by reference to the past, and to the methods and devices available to deal with them. The challenge big data presents is often characterized by the four V's—volume, velocity, variety, and veracity. Volume refers to the amount of data. Velocity refers to the flow rate—the speed at which it is being generated and changed. Variety refers to the different types of data being generated (currency, dates, numbers, text, etc.). Veracity refers to the fact that data is being generated by organic distributed processes (e.g., millions of people signing up for services or free downloads) and not subject to the controls or quality checks that apply to data collected for a study.

Most large organizations face both the challenge and the opportunity of big data because most routine data processes now generate data that can be stored and, possibly, analyzed. The scale can be visualized by comparing the data in a traditional statistical analysis (e.g., 15 variables and 5000 records) to the Walmart database. If you consider the traditional statistical study to be the size of a period at the end of a sentence, then the Walmart database is the size of a football field. And that probably does not include other data associated with Walmart—social media data, for example, which comes in the form of unstructured text. If the analytical challenge is substantial, so can be the reward:

OKCupid, the online dating site, uses statistical models with their data to predict what forms of message content are most likely to produce a response.

Telenor, a Norwegian mobile phone service company, was able to reduce subscriber turnover 37% by using models to predict which customers were most likely to leave, and then lavishing attention on them.

Allstate, the insurance company, tripled the accuracy of predicting injury liability in auto claims by incorporating more information about vehicle type.

The examples above are from Eric Siegel's book Predictive Analytics (2013, Wiley).

Some extremely valuable tasks were not even feasible before the era of big data. Consider web searches, the technology on which Google was built. In early days, a search for “Ricky Ricardo Little Red Riding Hood” would have yielded various links to the I Love Lucy TV show, other links to Ricardo's career as a band leader, and links to the children's story of Little Red Riding Hood. Only once the Google database had accumulated sufficient data (including records of what users clicked on) would the search yield, in the top position, links to the specific I Love Lucy episode in which Ricky enacts, in a comic mixture of Spanish and English, Little Red Riding Hood for his infant son.

1.5 Data Science

The ubiquity, size, value, and importance of big data has given rise to a new profession: the data scientist. Data science is a mix of skills in the areas of statistics, machine learning, math, programming, business, and IT. The term itself is thus broader than the other concepts we discussed above, and it is a rare individual who combines deep skills in all the constituent areas. In their book Analyzing the Analyzers (Harris et al., 2013), the authors describe the skillsets of most data scientists as resembling a “T”–deep in one area (the vertical bar of the T), and shallower in other areas (the top of the T).

At a large data science conference session (Strata-Hadoop World, October 2014) most attendees felt that programming was an essential skill, though there was a sizable minority who felt otherwise. And, although big data is the motivating power behind the growth of data science, most data scientists do not actually spend most of their time working with terabyte-size or larger data.

Data of the terabyte or larger size would be involved at the deployment stage of a model. There are manifold challenges at that stage, most of them IT and programming issues related to data handling and tying together different components of a system. Much work must precede that phase. It is that earlier piloting and prototyping phase on which this book focuses–developing the statistical and machine learning models that will eventually be plugged into a deployed system. What methods do you use with what sorts of data and problems? How do the methods work? What are their requirements, their strengths, their weaknesses? How do you assess their performance?

1.6 Why are There so Many Different Methods?

As can be seen in this book or any other resource on data mining, there are many different methods for prediction and classification. You might ask yourself why they coexist, and whether some are better than others. The answer is that each method has advantages and disadvantages. The usefulness of a method can depend on factors such as the size of the dataset, the types of patterns that exist in the data, whether the data meet some underlying assumptions of the method, how noisy the data are, and the particular goal of the analysis. A small illustration is shown in Figure 1.1, where the goal is to find a combination of household income level and household lot size that separates owners (solid circles) from nonowners (hollow circles) of riding mowers. The first method (left panel) looks only for horizontal and vertical lines to separate owners from nonowners; the second method (right panel) looks for a single diagonal line.

Figure 1.1 Two methods for separating owners from nonowners

Different methods can lead to different results, and their performance can vary. It is therefore customary in data mining to apply several different methods and select the one that appears most useful for the goal at hand.

1.7 Terminology and Notation

Because of the hybrid parentry of data mining, its practitioners often use multiple terms to refer to the same thing. For example, in the machine learning (artificial intelligence) field, the variable being predicted is the output variable or target variable. To a statistician, it is the dependent variable or the response. Here is a summary of terms used:

Algorithm

A specific procedure used to implement a particular data mining technique: classification tree, discriminant analysis, and the like.

Attribute

See

Predictor

.

Case

See

Observation

.

Confidence

A performance measure in association rules of the type “IF

A

and

B

are purchased, THEN

C

is also purchased.” Confidence is the conditional probability that

C

will be purchased IF

A

and

B

are purchased.

Confidence

Also has a broader meaning in statistics (

confidence interval

), concerning the degree of error in an estimate that results from selecting one sample as opposed to another.

Dependent Variable

See

Response

.

Dependent Variable

See

Prediction

.

Feature

See

Predictor

.

Holdout Data

(or

holdout set

) A sample of data not used in fitting a model, but instead used to assess the performance of that model. This book uses the terms

validation set

and

test set

instead of

holdout set

.

Input Variable

See

Predictor

.

Model

An algorithm as applied to a dataset, complete with its settings (many of the algorithms have parameters that the user can adjust).

Observation

The unit of analysis on which the measurements are taken (a customer, a transaction, etc.); also called

instance

,

sample

,

example

,

case

,

record

,

pattern

, or

row

. In spreadsheets, each row typically represents a record; each column, a variable. Note that the use of the term “sample” here is different from its usual meaning in statistics, where it refers to a collection of observations.

Outcome Variable

See

Response

.

Output Variable

See

Response

.

P

(

A

|

B

)

The conditional probability of event

A

occurring given that event

B

has occurred. Read as “the probability that

A

will occur given that

B

has occurred.”

Profile

A set of measurements on an observation (e.g., the height, weight, and age of a person).

Prediction

The prediction of the numerical value of a continuous output variable; also called

estimation.

Predictor

A variable, usually denoted by

X

, used as an input into a predictive model. Also called a

feature

,

input variable

,

independent variable

, or from a database perspective, a

field

.

Record

See

Observation

.

Response

A variable, usually denoted by

Y

, which is the variable being predicted in supervised learning; also called

dependent variable

,

output variable

,

target variable

, or

outcome variable

.

Sample

In the statistical community, “sample” means a collection of observations. In the machine learning community, “sample” means a single observation.

Score

A predicted value or class.

Scoring new data

means using a model developed with training data to predict output values in new data.

Success Class

The class of interest in a binary outcome (e.g.,

purchasers

in the outcome

purchase/no purchase

).

Supervised Learning

The process of providing an algorithm (logistic regression, regression tree, etc.) with records in which an output variable of interest is known and the algorithm “learns” how to predict this value with new records where the output is unknown.

Target

See

Response

.

Test Data

(or

test set

) The portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on new data.

Training Data

(or

training set

) The portion of the data used to fit a model.

Unsupervised Learning

An analysis in which one attempts to learn patterns in the data other than predicting an output value of interest.

Validation Data

(or

validation set

) The portion of the data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried.

Variable

Any measurement on the records, including both the input (

X

) variables and the output (

Y

) variable.

1.8 Road Maps to This Book

The book covers many of the widely used predictive and classification methods as well as other data mining tools. Figure 1.2 outlines data mining from a process perspective and where the topics in this book fit in. Chapter numbers are indicated beside the topic. Table 1.1 provides a different perspective: it organizes data mining procedures according to the type and structure of the data.

Figure 1.2 Data mining from a process perspective. Numbers in parentheses indicate chapter numbers

Table 1.1 Organization of data mining methods in this book, according to the nature of the data

Supervised

Unsupervised

ContinuousResponse

CategoricalResponse

No Response

Continuous Predictors

Linear regression (6)

Logistic regression (10)

Principal components (4)

Neural nets (11)

Neural nets (11)

Cluster analysis (15)

k

-Nearest-neighbors (7)

Discriminant analysis (12)

Collaborative filtering (14)

Ensembles (13)

k

-Nearest-neighbors (7)

Ensembles (13)

Categorical Predictors

Linear regression (6)

Neural nets (11)

Association rules (14)

Neural nets (11)