Profit Driven Business Analytics - Wouter Verbeke - E-Book

Profit Driven Business Analytics E-Book

Wouter Verbeke

0,0
32,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Maximize profit and optimize decisions with advanced business analytics Profit-Driven Business Analytics provides actionable guidance on optimizing the use of data to add value and drive better business. Combining theoretical and technical insights into daily operations and long-term strategy, this book acts as a development manual for practitioners seeking to conceive, develop, and manage advanced analytical models. Detailed discussion delves into the wide range of analytical approaches and modeling techniques that can help maximize business payoff, and the author team draws upon their recent research to share deep insight about optimal strategy. Real-life case studies and examples illustrate these techniques at work, and provide clear guidance for implementation in your own organization. From step-by-step instruction on data handling, to analytical fine-tuning, to evaluating results, this guide provides invaluable guidance for practitioners seeking to reap the advantages of true business analytics. Despite widespread discussion surrounding the value of data in decision making, few businesses have adopted advanced analytic techniques in any meaningful way. This book shows you how to delve deeper into the data and discover what it can do for your business. * Reinforce basic analytics to maximize profits * Adopt the tools and techniques of successful integration * Implement more advanced analytics with a value-centric approach * Fine-tune analytical information to optimize business decisions Both data stored and streamed has been increasing at an exponential rate, and failing to use it to the fullest advantage equates to leaving money on the table. From bolstering current efforts to implementing a full-scale analytics initiative, the vast majority of businesses will see greater profit by applying advanced methods. Profit-Driven Business Analytics provides a practical guidebook and reference for adopting real business analytics techniques.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 623

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Wiley & SAS Business Series

Title Page

Foreword

Acknowledgments

CHAPTER 1: A Value-Centric Perspective Towards Analytics

INTRODUCTION

PROFIT-DRIVEN BUSINESS ANALYTICS

ANALYTICS PROCESS MODEL

ANALYTICAL MODEL EVALUATION

ANALYTICS TEAM

CONCLUSION

REVIEW QUESTIONS

REFERENCES

CHAPTER 2: Analytical Techniques

INTRODUCTION

DATA PREPROCESSING

TYPES OF ANALYTICS

PREDICTIVE ANALYTICS

ENSEMBLE METHODS

EVALUATING PREDICTIVE MODELS

DESCRIPTIVE ANALYTICS

SURVIVAL ANALYSIS

SOCIAL NETWORK ANALYTICS

CONCLUSION

REVIEW QUESTIONS

NOTES

REFERENCES

CHAPTER 3: Business Applications

INTRODUCTION

MARKETING ANALYTICS

FRAUD ANALYTICS

CREDIT RISK ANALYTICS

HR ANALYTICS

CONCLUSION

REVIEW QUESTIONS

NOTE

REFERENCES

CHAPTER 4: Uplift Modeling

INTRODUCTION

EXPERIMENTAL DESIGN, DATA COLLECTION, AND DATA PREPROCESSING

UPLIFT MODELING METHODS

EVALUATION OF UPLIFT MODELS

PRACTICAL GUIDELINES

CONCLUSION

REVIEW QUESTIONS

NOTE

REFERENCES

CHAPTER 5: Profit-Driven Analytical Techniques

INTRODUCTION

PROFIT-DRIVEN PREDICTIVE ANALYTICS

COST-SENSITIVE CLASSIFICATION

COST-SENSITIVE REGRESSION

COST-SENSITIVE LEARNING FOR REGRESSION

PROFIT-DRIVEN DESCRIPTIVE ANALYTICS

CONCLUSION

REVIEW QUESTIONS

NOTES

REFERENCES

CHAPTER 6: Profit-Driven Model Evaluation and Implementation

INTRODUCTION

PROFIT-DRIVEN EVALUATION OF CLASSIFICATION MODELS

PROFIT-DRIVEN EVALUATION OF REGRESSION MODELS

CONCLUSION

REVIEW QUESTIONS

NOTES

REFERENCES

CHAPTER 7: Economic Impact

INTRODUCTION

ECONOMIC VALUE OF BIG DATA AND ANALYTICS

KEY ECONOMIC CONSIDERATIONS

IMPROVING THE ROI OF BIG DATA AND ANALYTICS

CONCLUSION

REVIEW QUESTIONS

NOTES

REFERENCES

About the Authors

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Categories of Analytics from a Task-Oriented Perspective

Table 1.2 Example Datasets and Predictive Analytical Models

Table 1.3 Example Datasets and Descriptive Analytical Models

Table 1.4 Structured Dataset

Table 1.5 Examples of Business Decisions Matching Analytics

Table 1.6 Outline of the Book

Table 1.7 Key Characteristics of Successful Business Analytics Models

Chapter 2

Table 2.1 Missing Values in a Dataset

Table 2.2 Dataset for Linear Regression

Table 2.3 Example Classification Dataset

Table 2.4 Reference Values for Variable Significance

Table 2.5 Example Dataset for Performance Calculation

Table 2.6 The Confusion Matrix

Table 2.7 Receiver Operating Characteristic (ROC) Analysis

Table 2.8 Example Transaction Dataset

Table 2.9 The Lift Measure

Table 2.10 Example Transaction Dataset (left) and Sequential Dataset (right) for Sequence Rule Mining

Table 2.11 Matrix Representation of a Social Network

Table 2.12 Network Centrality Measures

Table 2.13 Centrality measures for the Kite network

Chapter 3

Table 3.1

K

-Means Clustering Sample Output

Table 3.2 Event Log of Customer Activities

Table 3.3 Example User-Item Matrix

Table 3.4 Example Call Detail Record Dataset for Fraud Detection

Chapter 4

Table 4.1 Overview of Model and Campaign Effect Measurement

Table 4.2 Example Model and Campaign Effect Measurement

Table 4.3 Dataset Including Treatment Dummy Variable

t

, Predictor Variables

x

i

and Target Variable

y

Table 4.4 Relabeled Dataset of Table 4.3 Following Lai's Method

Table 4.5 Relabeled Dataset of Table 4.3 Following the Generalized Lai Method

Chapter 5

Table 5.1 Confusion Matrix

Table 5.2 Cost Matrix for a Binary Classification Problem

Table 5.3 Example Cost Matrix for the German Credit Data

Table 5.4 Cost Matrix for an Ordinal Classification Problem

Table 5.5 Simplified Cost Matrix

Table 5.6 Structured Overview of the Cost-Sensitive Classification Approaches Discussed in This Chapter

Table 5.7 Overview of Sampling Based Ensemble Approaches for Cost-Sensitive Learning

Table 5.8 Overview of Weighting Approaches for Cost-Sensitive Learning

Table 5.9 Overview of Cost-Sensitive Decision-Tree Approaches

Table 5.10 Overview of Cost-Sensitive Boosting Based Approaches

Table 5.11 Example of the MetaCost Approach

Table 5.12 Elaborated Example of Algorithm 5.1

Table 5.13 Average CLV and Fraction of Observations per Segment

Table 5.14 Average CLV, RFM Variables and Cost of Service for the Observations in the Five

K

-means Clusters

Chapter 6

Table 6.1 Cost Matrix for a Binary Classification Problem

Table 6.2 Cost Matrix for the German Credit Data Example Dataset

Table 6.3 Cutoff Points, Accuracies, and Costs. The cutoff with the best accuracy is 0.90, and the one with the lowest average misclassification cost is 0.55

Table 6.4 Accuracy and AUC for the Five Candidate Models

Table 6.5 AUC and AUC of the Convex Hull for the Five Candidate Models

Table 6.6 H-Measure for the Five Candidate Models. Model 3 is the selected one, not Model 5

Table 6.7 Confusion Matrix for a Binary Classification Problem

Table 6.8 EMP and Selected Fraction per Model

Table 6.9 Benefits, Costs, and Profits for the Test Set. Model 3 gives the better performance, as expected

Table 6.10 Observation-Dependent Cost Matrix for Credit Scoring

Table 6.11 Overview of Standard Regression Evaluation Measures

Table 6.12 Data for REC Curve

Table 6.13 Data for REC Surface

Chapter 7

Table 7.1 Example Costs for Calculating Total Cost of Ownership (TCO)

Table 7.2 Data Quality Dimensions (Wang et al. 1996)

List of Illustrations

Chapter 1

Figure 1.1 The analytics process model.

Figure 1.2 Profile of a data scientist.

Chapter 2

Figure 2.1 Aggregating normalized data tables into a non-normalized data table.

Figure 2.2 Example dataset showing an ellipse rotated in 45 degrees.

Figure 2.3 PCA of the simulated data.

Figure 2.4 OLS regression.

Figure 2.5 Bounding function for logistic regression.

Figure 2.6 Linear decision boundary of logistic regression.

Figure 2.7 Calculating the

p

-value with a Student's

t

-distribution.

Figure 2.8 Variable subsets for four variables

x

1

,

x

2

,

x

3

, and

x

4

.

Figure 2.9 Example decision tree.

Figure 2.10 Example datasets for calculating impurity.

Figure 2.11 Entropy versus Gini.

Figure 2.12 Calculating the entropy for age split.

Figure 2.13 Using a validation set to stop growing a decision tree.

Figure 2.14 Decision boundary of a decision tree.

Figure 2.15 Example regression tree for predicting the fraud percentage.

Figure 2.16 Neural network representation of logistic regression.

Figure 2.17 A Multilayer Perceptron (MLP) neural network.

Figure 2.18 Local versus global minima.

Figure 2.19 Using a validation set for stopping neural network training.

Figure 2.20 Training and test set split-up for performance estimation.

Figure 2.21 Cross-validation for performance measurement.

Figure 2.22 Bootstrapping.

Figure 2.23 Receiver operating characteristic curve.

Figure 2.24 The lift curve.

Figure 2.25 The cumulative accuracy profile (CAP).

Figure 2.26 Calculating the accuracy ratio.

Figure 2.27 Scatter plot.

Figure 2.28 Hierarchical versus nonhierarchical clustering techniques.

Figure 2.29 Divisive versus agglomerative hierarchical clustering.

Figure 2.30 Euclidean versus Manhattan distance.

Figure 2.31 Calculating distances between clusters.

Figure 2.32 Example for clustering birds. The numbers indicate the clustering steps.

Figure 2.33 Dendrogram for birds example. The red line indicates the optimal clustering.

Figure 2.34 Scree plot for clustering.

Figure 2.35 Rectangular versus hexagonal SOM grid.

Figure 2.36 Clustering countries using SOMs.

Figure 2.37 Component plane for literacy.

Figure 2.38 Component plane for political rights.

Figure 2.39 Example of right censoring for churn prediction.

Figure 2.40 Example of a discrete event time distribution.

Figure 2.41 Cumulative distribution and survival function for the event time distribution in Figure 2.40.

Figure 2.42 Sample hazard shapes.

Figure 2.43 Kaplan Meier example.

Figure 2.44 Exponential event time distribution, with cumulative distribution and hazard function.

Figure 2.45 Weibull distributions.

Figure 2.46 The proportional hazards model.

Figure 2.47 Sociogram representation of a social network.

Figure 2.48 The Kite network.

Figure 2.49 Example social network for relational neighbor classifier.

Figure 2.50 Example social network for probabilistic relational neighbor classifier.

Figure 2.51 Relational logistic regression.

Figure 2.52 Example of featurization with features describing target behavior of neighbors.

Figure 2.53 Example of featurization with features describing local node behavior of neighbors.

Chapter 3

Figure 3.1 Constructing an RFM score (independent sorting).

Figure 3.2 Constructing an RFM score (dependent sorting).

Figure 3.3 Cluster profiling using histograms.

Figure 3.4 Using decision trees for clustering interpretation.

Figure 3.5 Example Markov chain (Pfeifer and Carraway 2000).

Figure 3.6 Customer journey in a mortgage sales process.

Figure 3.7 Example social network for fraud detection.

Figure 3.8 Multilevel credit risk model architecture.

Figure 3.9 Example employee network.

Chapter 4

Figure 4.1 Four types of customers identified as a function of purchasing behavior when treated or not treated.

Figure 4.2 Experimental design to collect the required data for uplift modeling, allowing the selection of a model base for the campaign.

Figure 4.3 Categorization of customers based on whether a customer was treated and whether the customer responded.

Figure 4.4 (a) High uplift but low number of observations in the left child node versus (b) lower uplift but applicable to a higher number of observations.

Figure 4.5 Illustration of

Gain

U

calculation.

Figure 4.6 Response rate by decile graph for both treatment and control groups (upper panel) and uplift by deciles graph (lower panel).

Figure 4.7 Response rate curve for the perfect uplift model, plotting the response rates for the treatment and control groups ranked according to estimated uplift.

Figure 4.8 Uplift by decile curve of an accurate uplift model.

Figure 4.9 Cumulative incremental gains charts or Qini curves for two uplift models and the baseline model.

Chapter 5

Figure 5.1 Oversampling the fraudsters.

Figure 5.2 Undersampling the nonfraudsters.

Figure 5.3 Synthetic minority oversampling technique (SMOTE).

Figure 5.4 Quadratic cost versus true cost as a function of the prediction error.

Figure 5.5 Linlin cost function

C

linlin

in function of the prediction error e.

Figure 5.6 Average misprediction cost in function of the adjustment

δ

.

Figure 5.7 Customer lifetime value distribution for a dataset containing 1,000 customers.

Figure 5.8 Three-cut strategy for CLV segmentation.

Figure 5.9 Three-group customer segmentation.

Figure 5.10 Plot of the first two principal components following a

K

-means clustering of the CLV example dataset.

Figure 5.11 Density maps of CLV example dataset SOMs of different sizes.

Figure 5.12 Distance map of the CLV example dataset SOM.

Figure 5.13 Codebook vector graph of the CLV example dataset SOM.

Figure 5.14 Heatmaps for the variables in the CLV example dataset.

Figure 5.15 Dendrogram plot of the hierarchical clustering procedure using single linkage.

Figure 5.16 Codebook vector graph with clustering limits superimposed.

Chapter 6

Figure 6.1 Illustration of the two-cutoff point strategy.

Figure 6.2 Receiver operating characteristic curve.

Figure 6.3 Convex hull for a nonconcave ROC curve.

Figure 6.4 Beta distribution for different values of the parameters

α

and

β

.

Figure 6.5 ROC curves for five credit risk models.

Figure 6.6 Customer churn management process. Adapted from Verbraken et al. (2013).

Figure 6.7 LGD histogram and percentage of observations per score group.

Figure 6.8 Regression error characteristic (REC) curve.

Figure 6.9 Regression error characteristic surface following the data in Table 6.13.

Chapter 7

Figure 7.1 ROI of big data and analytics.

Guide

Cover

Table of Contents

Begin Reading

Pages

C1

v

ii

iii

iv

vi

vii

xv

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

112

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

292

293

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

E1

Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:

Analytics: The Agile Way

by Phil Simon

Analytics in a Big Data World: The Essential Guide to Data Science and its Applications

by Bart Baesens

A Practical Guide to Analytics for Governments: Using Big Data for Good

by Marie Lowman

Bank Fraud: Using Technology to Combat Losses

by Revathi Subramanian

Big Data Analytics: Turning Big Data into Big Money

by Frank Ohlhorst

Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics

by Evan Stubbs

Business Analytics for Customer Intelligence

by Gert Laursen

Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure

by Michael Gendron

Business Intelligence and the Cloud: Strategic Implementation Guide

by Michael S. Gendron

Business Transformation: A Roadmap for Maximizing Organizational Insights

by Aiman Zeid

Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media

by Frank Leistner

Data-Driven Healthcare: How Analytics and BI are Transforming the Industry

by Laura Madsen

Delivering Business Analytics: Practical Guidelines for Best Practice

by Evan Stubbs

Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition

by Charles Chase

Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain

by Robert A. Davis

Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments

by Gene Pease, Barbara Beresford, and Lew Walker

The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business

by David Thomas and Mike Barlow

Economic and Business Forecasting: Analyzing and Interpreting Econometric Results

by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard

Economic Modeling in the Post Great Recession Era: Incomplete Data, Imperfect Markets

by John Silvia, Azhar Iqbal, and Sarah Watt House

Enhance Oil & Gas Exploration with Data-Driven Geophysical and Petrophysical Models

by Keith Holdaway and Duncan Irving

Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications

by Robert Rowan

Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection

by Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke

Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models

by Keith Holdaway

Health Analytics: Gaining the Insights to Transform Health Care

by Jason Burke

Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World

by Carlos Andre Reis Pinheiro and Fiona McNeill

Human Capital Analytics: How to Harness the Potential of Your Organization's Greatest Asset

by Gene Pease, Boyce Byerly, and Jac Fitz-enz

Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education

by Jamie McQuiggan and Armistead Sapp

Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards, Second Edition,

by Naeem Siddiqi

Killer Analytics: Top 20 Metrics Missing from your Balance Sheet

by Mark Brown

Machine Learning for Marketers: Hold the Math

by Jim Sterne

On-Camera Coach: Tools and Techniques for Business Professionals in a Video-Driven World

by Karin Reed

Predictive Analytics for Human Resources

by Jac Fitz-enz and John Mattox II

Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance

by Lawrence Maisel and Gary Cokins

Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value

by Wouter Verbeke, Cristian Bravo, and Bart Baesens

Retail Analytics: The Secret Weapon

by Emmett Cox

Social Network Analysis in Telecommunications

by Carlos Andre Reis Pinheiro

Statistical Thinking: Improving Business Performance, Second Edition

by Roger W. Hoerl and Ronald D. Snee

Strategies in Biomedical Data Science: Driving Force for Innovation

by Jay Etchings

Style & Statistic: The Art of Retail Analytics

by Brittany Bullard

Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics

by Bill Franks

Too Big to Ignore: The Business Case for Big Data

by Phil Simon

The Analytic Hospitality Executive

by Kelly A. McGuire

The Value of Business Analytics: Identifying the Path to Profitability

by Evan Stubbs

The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions

by Phil Simon

Using Big Data Analytics: Turning Big Data into Big Money

by Jared Dean

Win with Advanced Business Analytics: Creating Business Value from Your Data

by Jean Paul Isson and Jesse Harriott

For more information on any of the above titles, please visit www.wiley.com.

Profit-Driven Business Analytics

A Practitioner’s Guide to Transforming Big Data into Added Value

 

 

Wouter Verbeke

Bart Baesens

Cristián Bravo

 

 

 

 

 

Copyright © 2018 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993, or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data is Available:

ISBN 9781119286554 (Hardcover)

ISBN 9781119286998 (ePDF)

ISBN 9781119286981 (ePub)

Cover Design: Wiley

Cover Image: © Ricardo Reitmeyer/iStockphoto

To Luit,Titus, and Fien.

 

To my wonderful wife, Katrien, and kids Ann-Sophie, Victor, and Hannelore.

To my parents and parents-in-law.

 

To Cindy, for her unwavering support.

Foreword

Sandra Wilikens

Secretary General, responsible for CSR and member of the Executive Committee, BNP Paribas Fortis

In today's corporate world, strategic priorities tend to center on customer and shareholder value. One of the consequences is that analytics often focuses too much on complex technologies and statistics rather than long-term value creation. With their book Profit-Driven Business Analytics, Verbeke, Bravo, and Baesens pertinently bring forward a much-needed shift of focus that consists of turning analytics into a mature, value-adding technology. It further builds on the extensive research and industry experience of the author team, making it a must-read for anyone using analytics to create value and gain sustainable strategic leverage. This is even more true as we enter a new era of sustainable value creation in which the pursuit of long-term value has to be driven by sustainably strong organizations. The role of corporate employers is evolving as civic involvement and social contribution grow to be key strategic pillars.

Acknowledgments

It is a great pleasure to acknowledge the contributions and assistance of various colleagues, friends, and fellow analytics lovers to the writing of this book. This book is the result of many years of research and teaching in business analytics. We first would like to thank our publisher, Wiley, for accepting our book proposal.

We are grateful to the active and lively business analytics community for providing various user fora, blogs, online lectures, and tutorials, which proved very helpful.

We would also like to acknowledge the direct and indirect contributions of the many colleagues, fellow professors, students, researchers, and friends with whom we collaborated during the past years. Specifically, we would like to thank Floris Devriendt and George Petrides for contributing to the chapters on uplift modeling and profit-driven analytical techniques.

Last but not least, we are grateful to our partners, parents, and families for their love, support, and encouragement.

We have tried to make this book as complete, accurate, and enjoyable as possible. Of course, what really matters is what you, the reader, think of it. Please let us know your views by getting in touch. The authors welcome all feedback and comments—so do not hesitate to let us know your thoughts!

Wouter VerbekeBart BaesensCristián BravoMay 2017

CHAPTER 1A Value-Centric Perspective Towards Analytics

INTRODUCTION

In this first chapter, we set the scene for what is ahead by broadly introducing profit-driven business analytics. The value-centric perspective toward analytics proposed in this book will be positioned and contrasted with a traditional statistical perspective. The implications of adopting a value-centric perspective toward the use of analytics in business are significant: a mind shift is needed both from managers and data scientists in developing, implementing, and operating analytical models. This, however, calls for deep insight into the underlying principles of advanced analytical approaches. Providing such insight is our general objective in writing this book and, more specifically:

We aim to provide the reader with a structured overview of state-of-the art analytics for business applications.

We want to assist the reader in gaining a deeper practical understanding of the inner workings and underlying principles of these approaches from a practitioner's perspective.

We wish to advance managerial thinking on the use of advanced analytics by offering insight into how these approaches may either generate significant added value or lower operational costs by increasing the efficiency of business processes.

We seek to prosper and facilitate the use of analytical approaches that are customized to needs and requirements in a business context.

As such, we envision that our book will facilitate organizations stepping up to a next level in the adoption of analytics for decision making by embracing the advanced methods introduced in the subsequent chapters of this book. Doing so requires an investment in terms of acquiring and developing knowledge and skills but, as is demonstrated throughout the book, also generates increased profits. An interesting feature of the approaches discussed in this book is that they have often been developed at the intersection of academia and business, by academics and practitioners joining forces for tuning a multitude of approaches to the particular needs and problem characteristics encountered and shared across diverse business settings.

Most of these approaches emerged only after the millennium, which should not be surprising. Since the millennium, we have witnessed a continuous and pace-gaining development and an expanding adoption of information, network, and database technologies. Key technological evolutions include the massive growth and success of the World Wide Web and Internet services, the introduction of smart phones, the standardization of enterprise resource planning systems, and many other applications of information technology. This dramatic change of scene has prospered the development of analytics for business applications as a rapidly growing and thriving branch of science and industry.

To achieve the stated objectives, we have chosen to adopt a pragmatic approach in explaining techniques and concepts. We do not focus on providing extensive mathematical proof or detailed algorithms. Instead, we pinpoint the crucial insights and underlying reasoning, as well as the advantages and disadvantages, related to the practical use of the discussed approaches in a business setting. For this, we ground our discourse on solid academic research expertise as well as on many years of practical experience in elaborating industrial analytics projects in close collaboration with data science professionals. Throughout the book, a plethora of illustrative examples and case studies are discussed. Example datasets, code, and implementations are provided on the book's companion website, www.profit-analytics.com, to further support the adoption of the discussed approaches.

In this chapter, we first introduce business analytics. Next, the profit-driven perspective toward business analytics that will be elaborated in this book is presented. We then introduce the subsequent chapters of this book and how the approaches introduced in these chapters allow us to adopt a value-centric approach for maximizing profitability and, as such, to increase the return on investment of big data and analytics. Next, the analytics process model is discussed, detailing the subsequent steps in elaborating an analytics project within an organization. Finally, the chapter concludes by characterizing the ideal profile of a business data scientist.

Business Analytics

Data is the new oil is a popular quote pinpointing the increasing value of data and—to our liking—accurately characterizes data as raw material. Data are to be seen as an input or basic resource needing further processing before actually being of use. In a subsequent section in this chapter, we introduce the analytics process model that describes the iterative chain of processing steps involved in turning data into information or decisions, which is quite similar actually to an oil refinery process. Note the subtle but significant difference between the words data and information in the sentence above. Whereas data fundamentally can be defined to be a sequence of zeroes and ones, information essentially is the same but implies in addition a certain utility or value to the end user or recipient. So, whether data are information depends on whether the data have utility to the recipient. Typically, for raw data to be information, the data first need to be processed, aggregated, summarized, and compared. In summary, data typically need to be analyzed, and insight, understanding, or knowledge should be added for data to become useful.

Applying basic operations on a dataset may already provide useful insight and support the end user or recipient in decision making. These basic operations mainly involve selection and aggregation. Both selection and aggregation may be performed in many ways, leading to a plentitude of indicators or statistics that can be distilled from raw data. The following illustration elaborates a number of sales indicators in a retail setting.

Providing insight by customized reporting is exactly what the field of business intelligence (BI) is about. Typically, visualizations are also adopted to represent indicators and their evolution in time, in easy-to-interpret ways. Visualizations provide support by facilitating the user's ability to acquire understanding and insight in the blink of an eye. Personalized dashboards, for instance, are widely adopted in the industry and are very popular with managers to monitor and keep track of business performance. A formal definition of business intelligence is provided by Gartner (http://www.gartner.com/it-glossary):

Example

For managerial purposes, a retailer requires the development of real-time sales reports. Such a report may include a wide variety of indicators that summarize raw sales data. Raw sales data, in fact, concern transactional data that can be extracted from the online transaction processing (OLTP) system that is operated by the retailer. Some example indicators and the required selection and aggregation operations for calculating these statistics are:

Total amount of revenues generated over the last 24 hours

: Select all transactions over the last 24 hours and sum the paid amounts, with

paid

meaning the price net of promotional offers.

Average paid amount in online store over the last seven days

: Select all online transactions over the last seven days and calculate the average paid amount;

Fraction of returning customers within one month

: Select all transactions over the last month and select customer IDs that appear more than once; count the number of IDs.

Remark that calculating these indicators involves basic selection operations on characteristics or dimensions of transactions stored in the database, as well as basic aggregation operations such as sum, count, and average, among others.

Business intelligence is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.

Note that this definition explicitly mentions the required infrastructure and best practices as an essential component of BI, which is typically also provided as part of the package or solution offered by BI vendors and consultants. More advanced analysis of data may further support users and optimize decision making. This is exactly where analytics comes into play. Analytics is a catch-all term covering a wide variety of what are essentially data-processing techniques. In its broadest sense, analytics strongly overlaps with data science, statistics, and related fields such as artificial intelligence (AI) and machine learning. Analytics, to us, is a toolbox containing a variety of instruments and methodologies allowing users to analyze data for a diverse range of well-specified purposes. Table 1.1 identifies a number of categories of analytical tools that cover diverse intended uses or, in other words, allow users to complete a diverse range of tasks.

Table 1.1 Categories of Analytics from a Task-Oriented Perspective

Predictive Analytics

Descriptive Analytics

Classification Regression Survival analysis Forecasting

Clustering Association analysis Sequence analysis

A first main group of tasks identified in Table 1.1 concerns prediction. Based on observed variables, the aim is to accurately estimate or predict an unobserved value. The applicable subtype of predictive analytics depends on the type of target variable, which we intend to model as a function of a set of predictor variables. When the target variable is categorical in nature, meaning the variable can only take a limited number of possible values (e.g., churner or not, fraudster or not, defaulter or not), then we have a classification problem. When the task concerns the estimation of a continuous target variable (e.g., sales amount, customer lifetime value, credit loss), which can take any value over a certain range of possible values, we are dealing with regression. Survival analysis and forecasting explicitly account for the time dimension by either predicting the timing of events (e.g., churn, fraud, default) or the evolution of a target variable in time (e.g., churn rates, fraud rates, default rates). Table 1.2 provides simplified example datasets and analytical models for each type of predictive analytics for illustrative purposes.

Table 1.2 Example Datasets and Predictive Analytical Models

Example dataset

Predictive analytical model

Classification

ID

Recency

Frequency

Monetary

Churn

C1

26

4.2

126

Yes

C2

37

2.1

59

No

C3

2

8.5

256

No

C4

18

6.2

89

No

C5

46

1.1

37

Yes

Decision tree classification model:

Regression

ID

Recency

Frequency

Monetary

CLV

C1

26

4.2

126

3,817

C2

37

2.1

59

4,31

C3

2

8.5

256

2,187

C4

18

6.2

89

543

C5

46

1.1

37

1,548

Linear regression model:

Survival analysis

ID

Recency

Churn or Censored

Time of churn or Censoring

C1

26

Churn

181

C2

37

Censored

253

C3

2

Censored

37

C4

18

Censored

172

C5

46

Churn

98

General parametric survival analysis model:

Forecasting

Timestamp

Demand

January

513

February

652

March

435

April

578

May

601

Weighted moving average forecasting model:

The second main group of analytics comprises descriptive analytics that, rather than predicting a target variable, aim at identifying specific types of patterns. Clustering or segmentation aims at grouping entities (e.g., customers, transactions, employees, etc.) that are similar in nature. The objective of association analysis is to find groups of events that frequently co-occur and therefore appear to be associated. The basic observations that are being analyzed in this problem setting consist of variable groups of events; for instance, transactions involving various products that are being bought by a customer at a certain moment in time. The aim of sequence analysis is similar to association analysis but concerns the detection of events that frequently occur sequentially, rather than simultaneously as in association analysis. As such, sequence analysis explicitly accounts for the time dimension. Table 1.3 provides simplified examples of datasets and analytical models for each type of descriptive analytics.

Table 1.3 Example Datasets and Descriptive Analytical Models

Data

Descriptive analytical model

Clustering

ID

Recency

Frequency

C1

26

4.2

C2

37

2.1

C3

 2

8.5

C4

18

6.2

C5

46

1.1

K

-means clustering with

K

= 3:

Association analysis

ID

Items

T1

beer, pizza, diapers, baby food

T2

coke, beer, diapers

T3

crisps, diapers, baby food

T4

chocolates, diapers, pizza, apples

T5

tomatoes, water, oranges, beer

Association rules:

If

baby food

And

diapers

Then

beer

If

coke

And

pizza

Then

crisps

Sequence analysis

ID

Sequential items

C1

<{3},{9}>

C2

<{1 2},{3},{4 6 7}>

C3

<{3 5 7}>

C4

<{3},{4 7},{9}>

C5

<{9}>

Sequence rules: …

Note that Tables 1.1 through 1.3identify and illustrate categories of approaches that are able to complete a specific task from a technical rather than an applied perspective. These different types of analytics can be applied in quite diverse business and nonbusiness settings and consequently lead to many specialized applications. For instance, predictive analytics and, more specifically, classification techniques may be applied for detecting fraudulent credit-card transactions, for predicting customer churn, for assessing loan applications, and so forth. From an application perspective, this leads to various groups of analytics such as, respectively, fraud analytics, customer or marketing analytics, and credit risk analytics. A wide range of business applications of analytics across industries and business departments is discussed in detail in Chapter 3.

With respect to Table 1.1, it needs to be noted that these different types of analytics apply to structured data. An example of a structured dataset is shown in Table 1.4. The rows in such a dataset are typically called observations, instances, records, or lines, and represent or collect information on basic entities such as customers, transactions, accounts, or citizens. The columns are typically referred to as (explanatory or predictor) variables, characteristics, attributes, predictors, inputs, dimensions, effects, or features. The columns contain information on a particular entity as represented by a row in the table. In Table 1.4, the second column represents the age of a customer, the third column the postal code, and so on. In this book we consistently use the terms observation and variable (and sometimes more specifically, explanatory, predictor, or target variable).

Table 1.4 Structured Dataset

Customer

Age

Income

Gender

Duration

Churn

John

30

1,800

Male

620

Yes

Sarah

25

1,400

Female

12

No

Sophie

52

2,600

Female

830

No

David

42

2,200

Male

90

Yes

Because of the structure that is present in the dataset in Table 1.4 and the well-defined meaning of rows and columns, it is much easier to analyze such a structured dataset compared to analyzing unstructured data such as text, video, or networks, to name a few. Specialized techniques exist that facilitate analysis of unstructured data—for instance, text analytics with applications such as sentiment analysis, video analytics that can be applied for face recognition and incident detection, and network analytics with applications such as community mining and relational learning (see Chapter 2). Given the rough estimate that over 90% of all data are unstructured, clearly there is a large potential for these types of analytics to be applied in business.

However, due to the inherent complexity of analyzing unstructured data, as well as because of the often-significant development costs that only appear to pay off in settings where adopting these techniques significantly adds to the easier-to-apply structured analytics, currently we see relatively few applications in business being developed and implemented. In this book, we therefore focus on analytics for analyzing structured data, and more specifically the subset listed in Table 1.1. For unstructured analytics, one may refer to the specialized literature (Elder IV and Thomas 2012; Chakraborty, Murali, and Satish 2013; Coussement 2014; Verbeke, Martens and Baesens 2014; Baesens, Van Vlasselaer, and Verbeke 2015).

PROFIT-DRIVEN BUSINESS ANALYTICS

The premise of this book is that analytics is to be adopted in business for better decision making—“better” meaning optimal in terms of maximizing the net profits, returns, payoff, or value resulting from the decisions that are made based on insights obtained from data by applying analytics. The incurred returns may stem from a gain in efficiency, lower costs or losses, and additional sales, among others. The decision level at which analytics is typically adopted is the operational level, where many customized decisions are to be made that are similar and granular in nature. High-level, ad hoc decision making at strategic and tactical levels in organizations also may benefit from analytics, but expectedly to a much lesser extent.

The decisions involved in developing a business strategy are highly complex in nature and do not match the elementary tasks enlisted in Table 1.1. A higher-level AI would be required for such purpose, which is not yet at our disposal. At the operational level, however, there are many simple decisions to be made, which exactly match with the tasks listed in Table 1.1. This is not surprising, since these approaches have often been developed with a specific application in mind. In Table 1.5, we provide a selection of example applications, most of which will be elaborated on in detail in Chapter 3.

Table 1.5 Examples of Business Decisions Matching Analytics

Decision Making with Predictive Analytics

Classification

Credit officers have to screen loan applications and decide on whether to accept or reject an application based on the involved risk. Based on historical data on the performance of past loan applications, a classification model may learn to distinguish

good

from

bad

loan applications using a number of well-chosen characteristics of the application as well as of the applicant. Analytics and, more specifically, classification techniques allow us to optimize the loan-granting process by more accurately assessing risk and reducing bad loan losses (Van Gestel and Baesens 2009; Verbraken et al. 2014). Similar applications of decision making based on classification techniques, which are discussed in more detail in

Chapter 3

of this book, include customer churn prediction, response modeling, and fraud detection.

Regression

Regression models allow us to estimate a continuous target value and in practice are being adopted, for instance, to estimate customer lifetime value. Having an indication on the future worth in terms of revenues or profits a customer will generate is important to allow customization of marketing efforts, for pricing, etc. As is discussed in detail in

Chapter 3

, analyzing historical customer data allows estimating the future net value of current customers using a regression model. Similar applications involve loss given default modeling as is discussed in

Chapter 3

, as well as the estimation of software development costs (Dejaeger et al. 2012).

Survival analysis

Survival analysis is being adopted in predictive maintenance applications for estimating when a machine component will fail. Such knowledge allows us to optimize decisions related to machine maintenance—for instance, to optimally plan when to replace a vital component. This decision requires striking a balance between the cost of machine failure during operations and the cost of the component, which is preferred to be operated as long as possible before replacing it (Widodo and Yang 2011). Alternative business applications of survival analysis involve the prediction of time to churn and time to default where, compared to classification, the focus is on predicting

when

the event will occur rather than

whether

the event will occur.

Forecasting

A typical application of forecasting involves demand forecasting, which allows us to optimize production planning and supply chain management decisions. For instance, a power supplier needs to be able to balance electricity production and demand by the consumers and for this purpose adopts forecasting or time-series modeling techniques. These approaches allow an accurate prediction of the short-term evolution of demand based on historical demand patterns (Hyndman et al. 2008).

Decision Making with Descriptive Analytics

Clustering

Clustering is applied in credit-card fraud detection to block suspicious transactions in real time or to select suspicious transactions for investigation in near-real time. Clustering facilitates automated decision making by comparing a new transaction to clusters or groups of historical nonfraudulent transactions and by labeling it as suspicious when it differs too much from these groups (Baesens et al. 2015). Clustering can also be used for identifying groups of similar customers, which facilitates the customization of marketing campaigns.

Association analysis Sequence analysis

Association analysis is often applied for detecting patterns within transactional data in terms of products that are often purchased together. Sequence analysis, on the other hand, allows the detection of which products are often bought subsequently. Knowledge of such associations allows smarter decisions to be made about which products to advertise, to bundle, to place together in a store, etc. (Agrawal and Srikant 1994).

Analytics facilitates optimization of the fine granular decision-making activities listed in Table 1.5, leading to lower costs or losses and higher revenues and profits. The level of optimization depends on the accuracy and validity of the predictions, estimates, or patterns derived from the data. Additionally, as we stress in this book, the quality of data-driven decision making depends on the extent to which the actual use of the predictions, estimates, or patterns is accounted for in developing and applying analytical approaches. We argue that the actual goal, which in a business setting is to generate profits, should be central when applying analytics in order to further increase the return on analytics. For this, we need to adopt what we call profit-driven analytics. These are adapted techniques specifically configured for use in a business context.

Example

The following example highlights the tangible difference between a statistical approach to analytics and a profit-driven approach. Table 1.5 already indicated the use of analytics and, more specifically, classification techniques for predicting which customers are about to churn. Having such knowledge allows us to decide which customers are to be targeted in a retention campaign, thereby increasing the efficiency and returns of that campaign when compared to randomly or intuitively selecting customers. By offering a financial incentive to customers that are likely to churn—for instance, a temporary reduction of the monthly fee—they may be retained. Actively retaining customers has been shown by various studies to be much cheaper than acquiring new customers to replace those who defect (Athanassopoulos 2000; Bhattacharya 1998).

It needs to be noted, however, that not every customer generates the same amount of revenues and therefore represents the same value to a company. Hence, it is much more important to detect churn for the most valuable customers. In a basic customer churn prediction setup, which adopts what we call a statistical perspective, no differentiation is made between high-value and low-value customers when learning a classification model to detect future churn. However, when analyzing data and learning a classification model, it should be taken into account that missing a high-value churner is much costlier than missing a low-value churner. The aim of this would be to steer or tune the resulting predictive model so it accounts for value, and consequently for its actual end-use in a business context.

An additional difference between the statistical and business perspectives toward adopting classification and regression modeling concerns the difference between, respectively, explaining and predicting (Breiman 2001; Shmueli and Koppius 2011). The aim of estimating a model may be either of these two goals:

To establish the relation or detect dependencies between characteristics or independent variables and an observed dependent target variable(s) or outcome value.

To

estimate

or

predict

the unobserved or future value of the target variable as a function of the independent variables.

For instance, in a medical setting, the purpose of analyzing data may be to establish the impact of smoking behavior on the life expectancy of an individual. A regression model may be estimated that explains the observed age at death of a number of subjects in terms of characteristics such as gender and number of years that the subject smoked. Such a model will establish or quantify the impact or relation between each characteristic and the observed outcome, and allows for testing the statistical significance of the impact and measuring the uncertainty of the result (Cao 2016; Peto, Whitlock, and Jha 2010).

A clear distinction exists with estimating a regression model for, as an example, software effort prediction, as introduced in Table 1.5. In such applications where the aim is mainly to predict, essentially we are not interested in what drivers explain how much effort it will take to develop new software, although this may be a useful side result. Instead we mainly wish to predict as accurately as possible the effort that will be required for completing a project. Since the model's main use will be to produce an estimate allowing cost projection and planning, it is the exactness or accuracy of the prediction and the size of the errors that matters, rather than the exact relation between the effort and characteristics of the project.

Typically, in a business setting, the aim is to predict in order to facilitate improved or automated decision making. Explaining, as indicated for the case of software effort prediction, may have use as well since useful insights may be derived. For instance, from the predictive model, it may be found what the exact impact is of including more or less senior and junior programmers in a project team on the required effort to complete the project, allowing the team composition to be optimized as a function of project characteristics.

In this book, several versatile and powerful profit-driven approaches are discussed. These approaches facilitate the adoption of a value-centric business perspective toward analytics in order to boost the returns. Table 1.6 provides an overview of the structure of the book. First, we lay the foundation by providing a general introduction to analytics in Chapter 2, and by discussing the most important and popular business applications in detail in Chapter 3.

Table 1.6 Outline of the Book

Book Structure

Chapter 1

: A Value-Centric Perspective Towards Analytics

Chapter 2

: Analytical Techniques

Chapter 3

: Business Applications

Chapter 4

: Uplift Modeling

Chapter 5

: Profit-Driven Analytical Techniques

Chapter 6

: Profit-Driven Model Evaluation and Implementation

Chapter 7

: Economic Impact

Chapter 4 discusses approaches toward uplift modeling, which in essence is about distilling or estimating the net effect of a decision and then contrasting the expected result for alternative scenarios. This allows, for instance, the optimization of marketing efforts by customizing the contact channel and the format of the incentive for the response to the campaign to be maximal in terms of returns being generated. Standard analytical approaches may be adopted to develop uplift models. However, specialized approaches tuned toward the particular problem characteristics of uplift modeling have also been developed, and they are discussed in Chapter 4.

As such, Chapter 4 forms a bridge to Chapter 5 of the book, which concentrates on various advanced analytical approaches that can be adopted for developing profit-driven models by allowing us to account for profit when learning or applying a predictive or descriptive model. Profit-driven predictive analytics for classification and regression are discussed in the first part of Chapter 5, whereas the second part focuses on descriptive analytics and introduces profit-oriented segmentation and association analysis.

Chapter 6 subsequently focuses on approaches that are tuned toward a business-oriented evaluation of predictive models—for example, in terms of profits. Note that traditional statistical measures, when applied to customer churn prediction models, for instance, do not differentiate among incorrectly predicted or classified customers, whereas it definitely makes sense from a business point of view to account for the value of the customers when evaluating a model. For instance, incorrectly predicting a customer who is about to churn with a high value represents a higher loss or cost than not detecting a customer with a low value who is about to churn. Both, however, are accounted for equally by nonbusiness and, more specifically, non-profit-oriented evaluation measures. Both Chapters 4 and 6 allow using standard analytical approaches as discussed in Chapter 2, with the aim to maximize profitability by adopting, respectively, a profit-centric setup or profit-driven evaluation. The particular business application of the model will appear to be an important factor to account for in maximizing profitability.

Finally, Chapter 7 concludes the book by adopting a broader perspective toward the use of analytics in an organization by looking into the economic impact, as well as by zooming into some practical concerns related to the development, implementation, and operation of analytics within an organization.

ANALYTICS PROCESS MODEL

Figure 1.1 provides a high-level overview of the analytics process model (Hand, Mannila, and Smyth 2001; Tan, Steinbach, and Kumar 2005; Han and Kamber 2011; Baesens 2014). This model defines the subsequent steps in the development, implementation, and operation of analytics within an organization.

Figure 1.1 The analytics process model.

(Baesens 2014)

As a first step, a thorough definition of the business problem to be addressed is needed. The objective of applying analytics needs to be unambiguously defined. Some examples are: customer segmentation of a mortgage portfolio, retention modeling for a postpaid Telco subscription, or fraud detection for credit-cards. Defining the perimeter of the analytical modeling exercise requires a close collaboration between the data scientists and business experts. Both parties need to agree on a set of key concepts; these may include how we define a customer, transaction, churn, or fraud. Whereas this may seem self-evident, it appears to be a crucial success factor to make sure a common understanding of the goal and some key concepts is agreed on by all involved stakeholders.

Next, all source data that could be of potential interest need to be identified. This is a very important step as data are the key ingredient to any analytical exercise and the selection of data will have a deterministic impact on the analytical models that will be built in a subsequent step. The golden rule here is: the more data, the better! The analytical model itself will later decide which data are relevant and which are not for the task at hand. All data will then be gathered and consolidated in a staging area which could be, for example, a data warehouse, data mart, or even a simple spreadsheet file. Some basic exploratory data analysis can then be considered using for instance OLAP facilities for multidimensional analysis (e.g., roll-up, drill down, slicing and dicing). This will be followed by a data-cleaning step to get rid of all inconsistencies such as missing values, outliers and duplicate data. Additional transformations may also be considered such as binning, alphanumeric to numeric coding, geographical aggregation, to name a few, as well as deriving additional characteristics that are typically called features from the raw data. A simple example concerns the derivation of the age from the birth date; yet more complex examples are provided in Chapter 3.

In the analytics step, an analytical model will be estimated on the preprocessed and transformed data. Depending on the business objective and the exact task at hand, a particular analytical technique will be selected and implemented by the data scientist. In Table 1.1, an overview was provided of various tasks and types of analytics. Alternatively, one may consider the various types of analytics listed in Table 1.1 to be the basic building blocks or solution components that a data scientist employs to solve the problem at hand. In other words, the business problem needs to be reformulated in terms of the available tools enumerated in Table 1.1.

Finally, once the results are obtained, they will be interpreted and evaluated by the business experts. Results may be clusters, rules, patterns, or relations, among others, all of which will be called analytical models resulting from applying analytics. Trivial patterns (e.g., an association rule is found stating that spaghetti and spaghetti sauce are often purchased together) that may be detected by the analytical model are interesting as they help to validate the model. But of course, the key issue is to find the unknown yet interesting and actionable patterns (sometimes also referred to as knowledge diamonds) that can provide new insights into your data that can then be translated into new profit opportunities. Before putting the resulting model or patterns into operation, an important evaluation step is to consider the actual returns or profits that will be generated, and to compare these to a relevant base scenario such as a do-nothing decision or a change-nothing decision. In the next section, an overview of various evaluation criteria is provided; these are discussed to validate analytical models.

Once the analytical model has been appropriately validated and approved, it can be put into production as an analytics application (e.g., decision support system, scoring engine). Important considerations here are how to represent the model output in a user-friendly way, how to integrate it with other applications (e.g., marketing campaign management tools, risk engines), and how to make sure the analytical model can be appropriately monitored and backtested on an ongoing basis.

It is important to note that the process model outlined in Figure 1.1 is iterative in nature in the sense that one may have to return to previous steps during the exercise. For instance, during the analytics step, a need for additional data may be identified that will necessitate additional data selection, cleaning, and transformation. The most time-consuming step typically is the data selection and preprocessing step, which usually takes around 80% of the total efforts needed to build an analytical model.

ANALYTICAL MODEL EVALUATION

Before adopting an analytical model and making operational decisions based on the obtained clusters, rules, patterns, relations, or predictions, the model needs to be thoroughly evaluated. Depending on the exact type of output, the setting or business environment, and the particular usage characteristics, different aspects may need to be assessed during evaluation in order to ensure the model is acceptable for implementation.

A number of key characteristics of successful analytical models are defined and explained in Table 1.7. These broadly defined evaluation criteria may or may not apply, depending on the exact application setting, and will have to be further specified in practice.

Table 1.7 Key Characteristics of Successful Business Analytics Models

Accuracy

Refers to the predictive power or the correctness of the analytical model. Several statistical evaluation criteria exist and may be applied to assess this aspect, such as the hit rate, lift curve, or AUC. A number of profit-driven evaluation measures will be discussed in detail in

Chapter 6

. Accuracy may also refer to statistical significance, meaning that the patterns that have been found in the data have to be real, robust, and not the consequence of coincidence. In other words, we need to make sure that the model

generalizes

well (to other entities, to the future, etc.) and is not overfitted to the historical dataset that was used for deriving or estimating the model.

Interpretability

When a deeper understanding of the retrieved patterns is required—for instance, to validate the model before it is adopted for use—a model needs to be interpretable. This aspect involves a certain degree of subjectivism, since interpretability may depend on the user's knowledge or skills. The interpretability of a model depends on its format, which, in turn, is determined by the adopted analytical technique. Models that allow the user to understand the underlying reasons as to why the model arrives at a certain result are called white-box models, whereas complex incomprehensible mathematical models are often referred to as black-box models. White-box approaches include, for instance, decision trees and linear regression models, examples of which have been provided in

Table 1.2

. A typical example of a black-box approach concerns neural networks, which are discussed in

Chapter 2

.