32,99 €
Maximize profit and optimize decisions with advanced business analytics Profit-Driven Business Analytics provides actionable guidance on optimizing the use of data to add value and drive better business. Combining theoretical and technical insights into daily operations and long-term strategy, this book acts as a development manual for practitioners seeking to conceive, develop, and manage advanced analytical models. Detailed discussion delves into the wide range of analytical approaches and modeling techniques that can help maximize business payoff, and the author team draws upon their recent research to share deep insight about optimal strategy. Real-life case studies and examples illustrate these techniques at work, and provide clear guidance for implementation in your own organization. From step-by-step instruction on data handling, to analytical fine-tuning, to evaluating results, this guide provides invaluable guidance for practitioners seeking to reap the advantages of true business analytics. Despite widespread discussion surrounding the value of data in decision making, few businesses have adopted advanced analytic techniques in any meaningful way. This book shows you how to delve deeper into the data and discover what it can do for your business. * Reinforce basic analytics to maximize profits * Adopt the tools and techniques of successful integration * Implement more advanced analytics with a value-centric approach * Fine-tune analytical information to optimize business decisions Both data stored and streamed has been increasing at an exponential rate, and failing to use it to the fullest advantage equates to leaving money on the table. From bolstering current efforts to implementing a full-scale analytics initiative, the vast majority of businesses will see greater profit by applying advanced methods. Profit-Driven Business Analytics provides a practical guidebook and reference for adopting real business analytics techniques.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 623
Veröffentlichungsjahr: 2017
Cover
Wiley & SAS Business Series
Title Page
Foreword
Acknowledgments
CHAPTER 1: A Value-Centric Perspective Towards Analytics
INTRODUCTION
PROFIT-DRIVEN BUSINESS ANALYTICS
ANALYTICS PROCESS MODEL
ANALYTICAL MODEL EVALUATION
ANALYTICS TEAM
CONCLUSION
REVIEW QUESTIONS
REFERENCES
CHAPTER 2: Analytical Techniques
INTRODUCTION
DATA PREPROCESSING
TYPES OF ANALYTICS
PREDICTIVE ANALYTICS
ENSEMBLE METHODS
EVALUATING PREDICTIVE MODELS
DESCRIPTIVE ANALYTICS
SURVIVAL ANALYSIS
SOCIAL NETWORK ANALYTICS
CONCLUSION
REVIEW QUESTIONS
NOTES
REFERENCES
CHAPTER 3: Business Applications
INTRODUCTION
MARKETING ANALYTICS
FRAUD ANALYTICS
CREDIT RISK ANALYTICS
HR ANALYTICS
CONCLUSION
REVIEW QUESTIONS
NOTE
REFERENCES
CHAPTER 4: Uplift Modeling
INTRODUCTION
EXPERIMENTAL DESIGN, DATA COLLECTION, AND DATA PREPROCESSING
UPLIFT MODELING METHODS
EVALUATION OF UPLIFT MODELS
PRACTICAL GUIDELINES
CONCLUSION
REVIEW QUESTIONS
NOTE
REFERENCES
CHAPTER 5: Profit-Driven Analytical Techniques
INTRODUCTION
PROFIT-DRIVEN PREDICTIVE ANALYTICS
COST-SENSITIVE CLASSIFICATION
COST-SENSITIVE REGRESSION
COST-SENSITIVE LEARNING FOR REGRESSION
PROFIT-DRIVEN DESCRIPTIVE ANALYTICS
CONCLUSION
REVIEW QUESTIONS
NOTES
REFERENCES
CHAPTER 6: Profit-Driven Model Evaluation and Implementation
INTRODUCTION
PROFIT-DRIVEN EVALUATION OF CLASSIFICATION MODELS
PROFIT-DRIVEN EVALUATION OF REGRESSION MODELS
CONCLUSION
REVIEW QUESTIONS
NOTES
REFERENCES
CHAPTER 7: Economic Impact
INTRODUCTION
ECONOMIC VALUE OF BIG DATA AND ANALYTICS
KEY ECONOMIC CONSIDERATIONS
IMPROVING THE ROI OF BIG DATA AND ANALYTICS
CONCLUSION
REVIEW QUESTIONS
NOTES
REFERENCES
About the Authors
Index
End User License Agreement
Chapter 1
Table 1.1 Categories of Analytics from a Task-Oriented Perspective
Table 1.2 Example Datasets and Predictive Analytical Models
Table 1.3 Example Datasets and Descriptive Analytical Models
Table 1.4 Structured Dataset
Table 1.5 Examples of Business Decisions Matching Analytics
Table 1.6 Outline of the Book
Table 1.7 Key Characteristics of Successful Business Analytics Models
Chapter 2
Table 2.1 Missing Values in a Dataset
Table 2.2 Dataset for Linear Regression
Table 2.3 Example Classification Dataset
Table 2.4 Reference Values for Variable Significance
Table 2.5 Example Dataset for Performance Calculation
Table 2.6 The Confusion Matrix
Table 2.7 Receiver Operating Characteristic (ROC) Analysis
Table 2.8 Example Transaction Dataset
Table 2.9 The Lift Measure
Table 2.10 Example Transaction Dataset (left) and Sequential Dataset (right) for Sequence Rule Mining
Table 2.11 Matrix Representation of a Social Network
Table 2.12 Network Centrality Measures
Table 2.13 Centrality measures for the Kite network
Chapter 3
Table 3.1
K
-Means Clustering Sample Output
Table 3.2 Event Log of Customer Activities
Table 3.3 Example User-Item Matrix
Table 3.4 Example Call Detail Record Dataset for Fraud Detection
Chapter 4
Table 4.1 Overview of Model and Campaign Effect Measurement
Table 4.2 Example Model and Campaign Effect Measurement
Table 4.3 Dataset Including Treatment Dummy Variable
t
, Predictor Variables
x
i
and Target Variable
y
Table 4.4 Relabeled Dataset of Table 4.3 Following Lai's Method
Table 4.5 Relabeled Dataset of Table 4.3 Following the Generalized Lai Method
Chapter 5
Table 5.1 Confusion Matrix
Table 5.2 Cost Matrix for a Binary Classification Problem
Table 5.3 Example Cost Matrix for the German Credit Data
Table 5.4 Cost Matrix for an Ordinal Classification Problem
Table 5.5 Simplified Cost Matrix
Table 5.6 Structured Overview of the Cost-Sensitive Classification Approaches Discussed in This Chapter
Table 5.7 Overview of Sampling Based Ensemble Approaches for Cost-Sensitive Learning
Table 5.8 Overview of Weighting Approaches for Cost-Sensitive Learning
Table 5.9 Overview of Cost-Sensitive Decision-Tree Approaches
Table 5.10 Overview of Cost-Sensitive Boosting Based Approaches
Table 5.11 Example of the MetaCost Approach
Table 5.12 Elaborated Example of Algorithm 5.1
Table 5.13 Average CLV and Fraction of Observations per Segment
Table 5.14 Average CLV, RFM Variables and Cost of Service for the Observations in the Five
K
-means Clusters
Chapter 6
Table 6.1 Cost Matrix for a Binary Classification Problem
Table 6.2 Cost Matrix for the German Credit Data Example Dataset
Table 6.3 Cutoff Points, Accuracies, and Costs. The cutoff with the best accuracy is 0.90, and the one with the lowest average misclassification cost is 0.55
Table 6.4 Accuracy and AUC for the Five Candidate Models
Table 6.5 AUC and AUC of the Convex Hull for the Five Candidate Models
Table 6.6 H-Measure for the Five Candidate Models. Model 3 is the selected one, not Model 5
Table 6.7 Confusion Matrix for a Binary Classification Problem
Table 6.8 EMP and Selected Fraction per Model
Table 6.9 Benefits, Costs, and Profits for the Test Set. Model 3 gives the better performance, as expected
Table 6.10 Observation-Dependent Cost Matrix for Credit Scoring
Table 6.11 Overview of Standard Regression Evaluation Measures
Table 6.12 Data for REC Curve
Table 6.13 Data for REC Surface
Chapter 7
Table 7.1 Example Costs for Calculating Total Cost of Ownership (TCO)
Table 7.2 Data Quality Dimensions (Wang et al. 1996)
Chapter 1
Figure 1.1 The analytics process model.
Figure 1.2 Profile of a data scientist.
Chapter 2
Figure 2.1 Aggregating normalized data tables into a non-normalized data table.
Figure 2.2 Example dataset showing an ellipse rotated in 45 degrees.
Figure 2.3 PCA of the simulated data.
Figure 2.4 OLS regression.
Figure 2.5 Bounding function for logistic regression.
Figure 2.6 Linear decision boundary of logistic regression.
Figure 2.7 Calculating the
p
-value with a Student's
t
-distribution.
Figure 2.8 Variable subsets for four variables
x
1
,
x
2
,
x
3
, and
x
4
.
Figure 2.9 Example decision tree.
Figure 2.10 Example datasets for calculating impurity.
Figure 2.11 Entropy versus Gini.
Figure 2.12 Calculating the entropy for age split.
Figure 2.13 Using a validation set to stop growing a decision tree.
Figure 2.14 Decision boundary of a decision tree.
Figure 2.15 Example regression tree for predicting the fraud percentage.
Figure 2.16 Neural network representation of logistic regression.
Figure 2.17 A Multilayer Perceptron (MLP) neural network.
Figure 2.18 Local versus global minima.
Figure 2.19 Using a validation set for stopping neural network training.
Figure 2.20 Training and test set split-up for performance estimation.
Figure 2.21 Cross-validation for performance measurement.
Figure 2.22 Bootstrapping.
Figure 2.23 Receiver operating characteristic curve.
Figure 2.24 The lift curve.
Figure 2.25 The cumulative accuracy profile (CAP).
Figure 2.26 Calculating the accuracy ratio.
Figure 2.27 Scatter plot.
Figure 2.28 Hierarchical versus nonhierarchical clustering techniques.
Figure 2.29 Divisive versus agglomerative hierarchical clustering.
Figure 2.30 Euclidean versus Manhattan distance.
Figure 2.31 Calculating distances between clusters.
Figure 2.32 Example for clustering birds. The numbers indicate the clustering steps.
Figure 2.33 Dendrogram for birds example. The red line indicates the optimal clustering.
Figure 2.34 Scree plot for clustering.
Figure 2.35 Rectangular versus hexagonal SOM grid.
Figure 2.36 Clustering countries using SOMs.
Figure 2.37 Component plane for literacy.
Figure 2.38 Component plane for political rights.
Figure 2.39 Example of right censoring for churn prediction.
Figure 2.40 Example of a discrete event time distribution.
Figure 2.41 Cumulative distribution and survival function for the event time distribution in Figure 2.40.
Figure 2.42 Sample hazard shapes.
Figure 2.43 Kaplan Meier example.
Figure 2.44 Exponential event time distribution, with cumulative distribution and hazard function.
Figure 2.45 Weibull distributions.
Figure 2.46 The proportional hazards model.
Figure 2.47 Sociogram representation of a social network.
Figure 2.48 The Kite network.
Figure 2.49 Example social network for relational neighbor classifier.
Figure 2.50 Example social network for probabilistic relational neighbor classifier.
Figure 2.51 Relational logistic regression.
Figure 2.52 Example of featurization with features describing target behavior of neighbors.
Figure 2.53 Example of featurization with features describing local node behavior of neighbors.
Chapter 3
Figure 3.1 Constructing an RFM score (independent sorting).
Figure 3.2 Constructing an RFM score (dependent sorting).
Figure 3.3 Cluster profiling using histograms.
Figure 3.4 Using decision trees for clustering interpretation.
Figure 3.5 Example Markov chain (Pfeifer and Carraway 2000).
Figure 3.6 Customer journey in a mortgage sales process.
Figure 3.7 Example social network for fraud detection.
Figure 3.8 Multilevel credit risk model architecture.
Figure 3.9 Example employee network.
Chapter 4
Figure 4.1 Four types of customers identified as a function of purchasing behavior when treated or not treated.
Figure 4.2 Experimental design to collect the required data for uplift modeling, allowing the selection of a model base for the campaign.
Figure 4.3 Categorization of customers based on whether a customer was treated and whether the customer responded.
Figure 4.4 (a) High uplift but low number of observations in the left child node versus (b) lower uplift but applicable to a higher number of observations.
Figure 4.5 Illustration of
Gain
U
calculation.
Figure 4.6 Response rate by decile graph for both treatment and control groups (upper panel) and uplift by deciles graph (lower panel).
Figure 4.7 Response rate curve for the perfect uplift model, plotting the response rates for the treatment and control groups ranked according to estimated uplift.
Figure 4.8 Uplift by decile curve of an accurate uplift model.
Figure 4.9 Cumulative incremental gains charts or Qini curves for two uplift models and the baseline model.
Chapter 5
Figure 5.1 Oversampling the fraudsters.
Figure 5.2 Undersampling the nonfraudsters.
Figure 5.3 Synthetic minority oversampling technique (SMOTE).
Figure 5.4 Quadratic cost versus true cost as a function of the prediction error.
Figure 5.5 Linlin cost function
C
linlin
in function of the prediction error e.
Figure 5.6 Average misprediction cost in function of the adjustment
δ
.
Figure 5.7 Customer lifetime value distribution for a dataset containing 1,000 customers.
Figure 5.8 Three-cut strategy for CLV segmentation.
Figure 5.9 Three-group customer segmentation.
Figure 5.10 Plot of the first two principal components following a
K
-means clustering of the CLV example dataset.
Figure 5.11 Density maps of CLV example dataset SOMs of different sizes.
Figure 5.12 Distance map of the CLV example dataset SOM.
Figure 5.13 Codebook vector graph of the CLV example dataset SOM.
Figure 5.14 Heatmaps for the variables in the CLV example dataset.
Figure 5.15 Dendrogram plot of the hierarchical clustering procedure using single linkage.
Figure 5.16 Codebook vector graph with clustering limits superimposed.
Chapter 6
Figure 6.1 Illustration of the two-cutoff point strategy.
Figure 6.2 Receiver operating characteristic curve.
Figure 6.3 Convex hull for a nonconcave ROC curve.
Figure 6.4 Beta distribution for different values of the parameters
α
and
β
.
Figure 6.5 ROC curves for five credit risk models.
Figure 6.6 Customer churn management process. Adapted from Verbraken et al. (2013).
Figure 6.7 LGD histogram and percentage of observations per score group.
Figure 6.8 Regression error characteristic (REC) curve.
Figure 6.9 Regression error characteristic surface following the data in Table 6.13.
Chapter 7
Figure 7.1 ROI of big data and analytics.
Cover
Table of Contents
Begin Reading
C1
v
ii
iii
iv
vi
vii
xv
xvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
112
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
292
293
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
E1
The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.
Titles in the Wiley & SAS Business Series include:
Analytics: The Agile Way
by Phil Simon
Analytics in a Big Data World: The Essential Guide to Data Science and its Applications
by Bart Baesens
A Practical Guide to Analytics for Governments: Using Big Data for Good
by Marie Lowman
Bank Fraud: Using Technology to Combat Losses
by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money
by Frank Ohlhorst
Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics
by Evan Stubbs
Business Analytics for Customer Intelligence
by Gert Laursen
Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure
by Michael Gendron
Business Intelligence and the Cloud: Strategic Implementation Guide
by Michael S. Gendron
Business Transformation: A Roadmap for Maximizing Organizational Insights
by Aiman Zeid
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media
by Frank Leistner
Data-Driven Healthcare: How Analytics and BI are Transforming the Industry
by Laura Madsen
Delivering Business Analytics: Practical Guidelines for Best Practice
by Evan Stubbs
Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition
by Charles Chase
Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain
by Robert A. Davis
Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments
by Gene Pease, Barbara Beresford, and Lew Walker
The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business
by David Thomas and Mike Barlow
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results
by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard
Economic Modeling in the Post Great Recession Era: Incomplete Data, Imperfect Markets
by John Silvia, Azhar Iqbal, and Sarah Watt House
Enhance Oil & Gas Exploration with Data-Driven Geophysical and Petrophysical Models
by Keith Holdaway and Duncan Irving
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications
by Robert Rowan
Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection
by Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models
by Keith Holdaway
Health Analytics: Gaining the Insights to Transform Health Care
by Jason Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World
by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization's Greatest Asset
by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education
by Jamie McQuiggan and Armistead Sapp
Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards, Second Edition,
by Naeem Siddiqi
Killer Analytics: Top 20 Metrics Missing from your Balance Sheet
by Mark Brown
Machine Learning for Marketers: Hold the Math
by Jim Sterne
On-Camera Coach: Tools and Techniques for Business Professionals in a Video-Driven World
by Karin Reed
Predictive Analytics for Human Resources
by Jac Fitz-enz and John Mattox II
Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance
by Lawrence Maisel and Gary Cokins
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
by Wouter Verbeke, Cristian Bravo, and Bart Baesens
Retail Analytics: The Secret Weapon
by Emmett Cox
Social Network Analysis in Telecommunications
by Carlos Andre Reis Pinheiro
Statistical Thinking: Improving Business Performance, Second Edition
by Roger W. Hoerl and Ronald D. Snee
Strategies in Biomedical Data Science: Driving Force for Innovation
by Jay Etchings
Style & Statistic: The Art of Retail Analytics
by Brittany Bullard
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics
by Bill Franks
Too Big to Ignore: The Business Case for Big Data
by Phil Simon
The Analytic Hospitality Executive
by Kelly A. McGuire
The Value of Business Analytics: Identifying the Path to Profitability
by Evan Stubbs
The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions
by Phil Simon
Using Big Data Analytics: Turning Big Data into Big Money
by Jared Dean
Win with Advanced Business Analytics: Creating Business Value from Your Data
by Jean Paul Isson and Jesse Harriott
For more information on any of the above titles, please visit www.wiley.com.
Wouter Verbeke
Bart Baesens
Cristián Bravo
Copyright © 2018 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993, or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Cataloging-in-Publication Data is Available:
ISBN 9781119286554 (Hardcover)
ISBN 9781119286998 (ePDF)
ISBN 9781119286981 (ePub)
Cover Design: Wiley
Cover Image: © Ricardo Reitmeyer/iStockphoto
To Luit,Titus, and Fien.
To my wonderful wife, Katrien, and kids Ann-Sophie, Victor, and Hannelore.
To my parents and parents-in-law.
To Cindy, for her unwavering support.
Sandra Wilikens
Secretary General, responsible for CSR and member of the Executive Committee, BNP Paribas Fortis
In today's corporate world, strategic priorities tend to center on customer and shareholder value. One of the consequences is that analytics often focuses too much on complex technologies and statistics rather than long-term value creation. With their book Profit-Driven Business Analytics, Verbeke, Bravo, and Baesens pertinently bring forward a much-needed shift of focus that consists of turning analytics into a mature, value-adding technology. It further builds on the extensive research and industry experience of the author team, making it a must-read for anyone using analytics to create value and gain sustainable strategic leverage. This is even more true as we enter a new era of sustainable value creation in which the pursuit of long-term value has to be driven by sustainably strong organizations. The role of corporate employers is evolving as civic involvement and social contribution grow to be key strategic pillars.
It is a great pleasure to acknowledge the contributions and assistance of various colleagues, friends, and fellow analytics lovers to the writing of this book. This book is the result of many years of research and teaching in business analytics. We first would like to thank our publisher, Wiley, for accepting our book proposal.
We are grateful to the active and lively business analytics community for providing various user fora, blogs, online lectures, and tutorials, which proved very helpful.
We would also like to acknowledge the direct and indirect contributions of the many colleagues, fellow professors, students, researchers, and friends with whom we collaborated during the past years. Specifically, we would like to thank Floris Devriendt and George Petrides for contributing to the chapters on uplift modeling and profit-driven analytical techniques.
Last but not least, we are grateful to our partners, parents, and families for their love, support, and encouragement.
We have tried to make this book as complete, accurate, and enjoyable as possible. Of course, what really matters is what you, the reader, think of it. Please let us know your views by getting in touch. The authors welcome all feedback and comments—so do not hesitate to let us know your thoughts!
Wouter VerbekeBart BaesensCristián BravoMay 2017
In this first chapter, we set the scene for what is ahead by broadly introducing profit-driven business analytics. The value-centric perspective toward analytics proposed in this book will be positioned and contrasted with a traditional statistical perspective. The implications of adopting a value-centric perspective toward the use of analytics in business are significant: a mind shift is needed both from managers and data scientists in developing, implementing, and operating analytical models. This, however, calls for deep insight into the underlying principles of advanced analytical approaches. Providing such insight is our general objective in writing this book and, more specifically:
We aim to provide the reader with a structured overview of state-of-the art analytics for business applications.
We want to assist the reader in gaining a deeper practical understanding of the inner workings and underlying principles of these approaches from a practitioner's perspective.
We wish to advance managerial thinking on the use of advanced analytics by offering insight into how these approaches may either generate significant added value or lower operational costs by increasing the efficiency of business processes.
We seek to prosper and facilitate the use of analytical approaches that are customized to needs and requirements in a business context.
As such, we envision that our book will facilitate organizations stepping up to a next level in the adoption of analytics for decision making by embracing the advanced methods introduced in the subsequent chapters of this book. Doing so requires an investment in terms of acquiring and developing knowledge and skills but, as is demonstrated throughout the book, also generates increased profits. An interesting feature of the approaches discussed in this book is that they have often been developed at the intersection of academia and business, by academics and practitioners joining forces for tuning a multitude of approaches to the particular needs and problem characteristics encountered and shared across diverse business settings.
Most of these approaches emerged only after the millennium, which should not be surprising. Since the millennium, we have witnessed a continuous and pace-gaining development and an expanding adoption of information, network, and database technologies. Key technological evolutions include the massive growth and success of the World Wide Web and Internet services, the introduction of smart phones, the standardization of enterprise resource planning systems, and many other applications of information technology. This dramatic change of scene has prospered the development of analytics for business applications as a rapidly growing and thriving branch of science and industry.
To achieve the stated objectives, we have chosen to adopt a pragmatic approach in explaining techniques and concepts. We do not focus on providing extensive mathematical proof or detailed algorithms. Instead, we pinpoint the crucial insights and underlying reasoning, as well as the advantages and disadvantages, related to the practical use of the discussed approaches in a business setting. For this, we ground our discourse on solid academic research expertise as well as on many years of practical experience in elaborating industrial analytics projects in close collaboration with data science professionals. Throughout the book, a plethora of illustrative examples and case studies are discussed. Example datasets, code, and implementations are provided on the book's companion website, www.profit-analytics.com, to further support the adoption of the discussed approaches.
In this chapter, we first introduce business analytics. Next, the profit-driven perspective toward business analytics that will be elaborated in this book is presented. We then introduce the subsequent chapters of this book and how the approaches introduced in these chapters allow us to adopt a value-centric approach for maximizing profitability and, as such, to increase the return on investment of big data and analytics. Next, the analytics process model is discussed, detailing the subsequent steps in elaborating an analytics project within an organization. Finally, the chapter concludes by characterizing the ideal profile of a business data scientist.
Data is the new oil is a popular quote pinpointing the increasing value of data and—to our liking—accurately characterizes data as raw material. Data are to be seen as an input or basic resource needing further processing before actually being of use. In a subsequent section in this chapter, we introduce the analytics process model that describes the iterative chain of processing steps involved in turning data into information or decisions, which is quite similar actually to an oil refinery process. Note the subtle but significant difference between the words data and information in the sentence above. Whereas data fundamentally can be defined to be a sequence of zeroes and ones, information essentially is the same but implies in addition a certain utility or value to the end user or recipient. So, whether data are information depends on whether the data have utility to the recipient. Typically, for raw data to be information, the data first need to be processed, aggregated, summarized, and compared. In summary, data typically need to be analyzed, and insight, understanding, or knowledge should be added for data to become useful.
Applying basic operations on a dataset may already provide useful insight and support the end user or recipient in decision making. These basic operations mainly involve selection and aggregation. Both selection and aggregation may be performed in many ways, leading to a plentitude of indicators or statistics that can be distilled from raw data. The following illustration elaborates a number of sales indicators in a retail setting.
Providing insight by customized reporting is exactly what the field of business intelligence (BI) is about. Typically, visualizations are also adopted to represent indicators and their evolution in time, in easy-to-interpret ways. Visualizations provide support by facilitating the user's ability to acquire understanding and insight in the blink of an eye. Personalized dashboards, for instance, are widely adopted in the industry and are very popular with managers to monitor and keep track of business performance. A formal definition of business intelligence is provided by Gartner (http://www.gartner.com/it-glossary):
For managerial purposes, a retailer requires the development of real-time sales reports. Such a report may include a wide variety of indicators that summarize raw sales data. Raw sales data, in fact, concern transactional data that can be extracted from the online transaction processing (OLTP) system that is operated by the retailer. Some example indicators and the required selection and aggregation operations for calculating these statistics are:
Total amount of revenues generated over the last 24 hours
: Select all transactions over the last 24 hours and sum the paid amounts, with
paid
meaning the price net of promotional offers.
Average paid amount in online store over the last seven days
: Select all online transactions over the last seven days and calculate the average paid amount;
Fraction of returning customers within one month
: Select all transactions over the last month and select customer IDs that appear more than once; count the number of IDs.
Remark that calculating these indicators involves basic selection operations on characteristics or dimensions of transactions stored in the database, as well as basic aggregation operations such as sum, count, and average, among others.
Business intelligence is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.
Note that this definition explicitly mentions the required infrastructure and best practices as an essential component of BI, which is typically also provided as part of the package or solution offered by BI vendors and consultants. More advanced analysis of data may further support users and optimize decision making. This is exactly where analytics comes into play. Analytics is a catch-all term covering a wide variety of what are essentially data-processing techniques. In its broadest sense, analytics strongly overlaps with data science, statistics, and related fields such as artificial intelligence (AI) and machine learning. Analytics, to us, is a toolbox containing a variety of instruments and methodologies allowing users to analyze data for a diverse range of well-specified purposes. Table 1.1 identifies a number of categories of analytical tools that cover diverse intended uses or, in other words, allow users to complete a diverse range of tasks.
Table 1.1 Categories of Analytics from a Task-Oriented Perspective
Predictive Analytics
Descriptive Analytics
Classification Regression Survival analysis Forecasting
Clustering Association analysis Sequence analysis
A first main group of tasks identified in Table 1.1 concerns prediction. Based on observed variables, the aim is to accurately estimate or predict an unobserved value. The applicable subtype of predictive analytics depends on the type of target variable, which we intend to model as a function of a set of predictor variables. When the target variable is categorical in nature, meaning the variable can only take a limited number of possible values (e.g., churner or not, fraudster or not, defaulter or not), then we have a classification problem. When the task concerns the estimation of a continuous target variable (e.g., sales amount, customer lifetime value, credit loss), which can take any value over a certain range of possible values, we are dealing with regression. Survival analysis and forecasting explicitly account for the time dimension by either predicting the timing of events (e.g., churn, fraud, default) or the evolution of a target variable in time (e.g., churn rates, fraud rates, default rates). Table 1.2 provides simplified example datasets and analytical models for each type of predictive analytics for illustrative purposes.
Table 1.2 Example Datasets and Predictive Analytical Models
Example dataset
Predictive analytical model
Classification
ID
Recency
Frequency
Monetary
Churn
C1
26
4.2
126
Yes
C2
37
2.1
59
No
C3
2
8.5
256
No
C4
18
6.2
89
No
C5
46
1.1
37
Yes
…
…
…
…
…
Decision tree classification model:
Regression
ID
Recency
Frequency
Monetary
CLV
C1
26
4.2
126
3,817
C2
37
2.1
59
4,31
C3
2
8.5
256
2,187
C4
18
6.2
89
543
C5
46
1.1
37
1,548
…
…
…
…
…
Linear regression model:
Survival analysis
ID
Recency
Churn or Censored
Time of churn or Censoring
C1
26
Churn
181
C2
37
Censored
253
C3
2
Censored
37
C4
18
Censored
172
C5
46
Churn
98
…
…
…
…
General parametric survival analysis model:
Forecasting
Timestamp
Demand
January
513
February
652
March
435
April
578
May
601
…
…
Weighted moving average forecasting model:
The second main group of analytics comprises descriptive analytics that, rather than predicting a target variable, aim at identifying specific types of patterns. Clustering or segmentation aims at grouping entities (e.g., customers, transactions, employees, etc.) that are similar in nature. The objective of association analysis is to find groups of events that frequently co-occur and therefore appear to be associated. The basic observations that are being analyzed in this problem setting consist of variable groups of events; for instance, transactions involving various products that are being bought by a customer at a certain moment in time. The aim of sequence analysis is similar to association analysis but concerns the detection of events that frequently occur sequentially, rather than simultaneously as in association analysis. As such, sequence analysis explicitly accounts for the time dimension. Table 1.3 provides simplified examples of datasets and analytical models for each type of descriptive analytics.
Table 1.3 Example Datasets and Descriptive Analytical Models
Data
Descriptive analytical model
Clustering
ID
Recency
Frequency
C1
26
4.2
C2
37
2.1
C3
2
8.5
C4
18
6.2
C5
46
1.1
…
…
…
K
-means clustering with
K
= 3:
Association analysis
ID
Items
T1
beer, pizza, diapers, baby food
T2
coke, beer, diapers
T3
crisps, diapers, baby food
T4
chocolates, diapers, pizza, apples
T5
tomatoes, water, oranges, beer
…
…
Association rules:
If
baby food
And
diapers
Then
beer
If
coke
And
pizza
Then
crisps
…
Sequence analysis
ID
Sequential items
C1
<{3},{9}>
C2
<{1 2},{3},{4 6 7}>
C3
<{3 5 7}>
C4
<{3},{4 7},{9}>
C5
<{9}>
…
…
Sequence rules: …
Note that Tables 1.1 through 1.3identify and illustrate categories of approaches that are able to complete a specific task from a technical rather than an applied perspective. These different types of analytics can be applied in quite diverse business and nonbusiness settings and consequently lead to many specialized applications. For instance, predictive analytics and, more specifically, classification techniques may be applied for detecting fraudulent credit-card transactions, for predicting customer churn, for assessing loan applications, and so forth. From an application perspective, this leads to various groups of analytics such as, respectively, fraud analytics, customer or marketing analytics, and credit risk analytics. A wide range of business applications of analytics across industries and business departments is discussed in detail in Chapter 3.
With respect to Table 1.1, it needs to be noted that these different types of analytics apply to structured data. An example of a structured dataset is shown in Table 1.4. The rows in such a dataset are typically called observations, instances, records, or lines, and represent or collect information on basic entities such as customers, transactions, accounts, or citizens. The columns are typically referred to as (explanatory or predictor) variables, characteristics, attributes, predictors, inputs, dimensions, effects, or features. The columns contain information on a particular entity as represented by a row in the table. In Table 1.4, the second column represents the age of a customer, the third column the postal code, and so on. In this book we consistently use the terms observation and variable (and sometimes more specifically, explanatory, predictor, or target variable).
Table 1.4 Structured Dataset
Customer
Age
Income
Gender
Duration
Churn
John
30
1,800
Male
620
Yes
Sarah
25
1,400
Female
12
No
Sophie
52
2,600
Female
830
No
David
42
2,200
Male
90
Yes
Because of the structure that is present in the dataset in Table 1.4 and the well-defined meaning of rows and columns, it is much easier to analyze such a structured dataset compared to analyzing unstructured data such as text, video, or networks, to name a few. Specialized techniques exist that facilitate analysis of unstructured data—for instance, text analytics with applications such as sentiment analysis, video analytics that can be applied for face recognition and incident detection, and network analytics with applications such as community mining and relational learning (see Chapter 2). Given the rough estimate that over 90% of all data are unstructured, clearly there is a large potential for these types of analytics to be applied in business.
However, due to the inherent complexity of analyzing unstructured data, as well as because of the often-significant development costs that only appear to pay off in settings where adopting these techniques significantly adds to the easier-to-apply structured analytics, currently we see relatively few applications in business being developed and implemented. In this book, we therefore focus on analytics for analyzing structured data, and more specifically the subset listed in Table 1.1. For unstructured analytics, one may refer to the specialized literature (Elder IV and Thomas 2012; Chakraborty, Murali, and Satish 2013; Coussement 2014; Verbeke, Martens and Baesens 2014; Baesens, Van Vlasselaer, and Verbeke 2015).
The premise of this book is that analytics is to be adopted in business for better decision making—“better” meaning optimal in terms of maximizing the net profits, returns, payoff, or value resulting from the decisions that are made based on insights obtained from data by applying analytics. The incurred returns may stem from a gain in efficiency, lower costs or losses, and additional sales, among others. The decision level at which analytics is typically adopted is the operational level, where many customized decisions are to be made that are similar and granular in nature. High-level, ad hoc decision making at strategic and tactical levels in organizations also may benefit from analytics, but expectedly to a much lesser extent.
The decisions involved in developing a business strategy are highly complex in nature and do not match the elementary tasks enlisted in Table 1.1. A higher-level AI would be required for such purpose, which is not yet at our disposal. At the operational level, however, there are many simple decisions to be made, which exactly match with the tasks listed in Table 1.1. This is not surprising, since these approaches have often been developed with a specific application in mind. In Table 1.5, we provide a selection of example applications, most of which will be elaborated on in detail in Chapter 3.
Table 1.5 Examples of Business Decisions Matching Analytics
Decision Making with Predictive Analytics
Classification
Credit officers have to screen loan applications and decide on whether to accept or reject an application based on the involved risk. Based on historical data on the performance of past loan applications, a classification model may learn to distinguish
good
from
bad
loan applications using a number of well-chosen characteristics of the application as well as of the applicant. Analytics and, more specifically, classification techniques allow us to optimize the loan-granting process by more accurately assessing risk and reducing bad loan losses (Van Gestel and Baesens 2009; Verbraken et al. 2014). Similar applications of decision making based on classification techniques, which are discussed in more detail in
Chapter 3
of this book, include customer churn prediction, response modeling, and fraud detection.
Regression
Regression models allow us to estimate a continuous target value and in practice are being adopted, for instance, to estimate customer lifetime value. Having an indication on the future worth in terms of revenues or profits a customer will generate is important to allow customization of marketing efforts, for pricing, etc. As is discussed in detail in
Chapter 3
, analyzing historical customer data allows estimating the future net value of current customers using a regression model. Similar applications involve loss given default modeling as is discussed in
Chapter 3
, as well as the estimation of software development costs (Dejaeger et al. 2012).
Survival analysis
Survival analysis is being adopted in predictive maintenance applications for estimating when a machine component will fail. Such knowledge allows us to optimize decisions related to machine maintenance—for instance, to optimally plan when to replace a vital component. This decision requires striking a balance between the cost of machine failure during operations and the cost of the component, which is preferred to be operated as long as possible before replacing it (Widodo and Yang 2011). Alternative business applications of survival analysis involve the prediction of time to churn and time to default where, compared to classification, the focus is on predicting
when
the event will occur rather than
whether
the event will occur.
Forecasting
A typical application of forecasting involves demand forecasting, which allows us to optimize production planning and supply chain management decisions. For instance, a power supplier needs to be able to balance electricity production and demand by the consumers and for this purpose adopts forecasting or time-series modeling techniques. These approaches allow an accurate prediction of the short-term evolution of demand based on historical demand patterns (Hyndman et al. 2008).
Decision Making with Descriptive Analytics
Clustering
Clustering is applied in credit-card fraud detection to block suspicious transactions in real time or to select suspicious transactions for investigation in near-real time. Clustering facilitates automated decision making by comparing a new transaction to clusters or groups of historical nonfraudulent transactions and by labeling it as suspicious when it differs too much from these groups (Baesens et al. 2015). Clustering can also be used for identifying groups of similar customers, which facilitates the customization of marketing campaigns.
Association analysis Sequence analysis
Association analysis is often applied for detecting patterns within transactional data in terms of products that are often purchased together. Sequence analysis, on the other hand, allows the detection of which products are often bought subsequently. Knowledge of such associations allows smarter decisions to be made about which products to advertise, to bundle, to place together in a store, etc. (Agrawal and Srikant 1994).
Analytics facilitates optimization of the fine granular decision-making activities listed in Table 1.5, leading to lower costs or losses and higher revenues and profits. The level of optimization depends on the accuracy and validity of the predictions, estimates, or patterns derived from the data. Additionally, as we stress in this book, the quality of data-driven decision making depends on the extent to which the actual use of the predictions, estimates, or patterns is accounted for in developing and applying analytical approaches. We argue that the actual goal, which in a business setting is to generate profits, should be central when applying analytics in order to further increase the return on analytics. For this, we need to adopt what we call profit-driven analytics. These are adapted techniques specifically configured for use in a business context.
The following example highlights the tangible difference between a statistical approach to analytics and a profit-driven approach. Table 1.5 already indicated the use of analytics and, more specifically, classification techniques for predicting which customers are about to churn. Having such knowledge allows us to decide which customers are to be targeted in a retention campaign, thereby increasing the efficiency and returns of that campaign when compared to randomly or intuitively selecting customers. By offering a financial incentive to customers that are likely to churn—for instance, a temporary reduction of the monthly fee—they may be retained. Actively retaining customers has been shown by various studies to be much cheaper than acquiring new customers to replace those who defect (Athanassopoulos 2000; Bhattacharya 1998).
It needs to be noted, however, that not every customer generates the same amount of revenues and therefore represents the same value to a company. Hence, it is much more important to detect churn for the most valuable customers. In a basic customer churn prediction setup, which adopts what we call a statistical perspective, no differentiation is made between high-value and low-value customers when learning a classification model to detect future churn. However, when analyzing data and learning a classification model, it should be taken into account that missing a high-value churner is much costlier than missing a low-value churner. The aim of this would be to steer or tune the resulting predictive model so it accounts for value, and consequently for its actual end-use in a business context.
An additional difference between the statistical and business perspectives toward adopting classification and regression modeling concerns the difference between, respectively, explaining and predicting (Breiman 2001; Shmueli and Koppius 2011). The aim of estimating a model may be either of these two goals:
To establish the relation or detect dependencies between characteristics or independent variables and an observed dependent target variable(s) or outcome value.
To
estimate
or
predict
the unobserved or future value of the target variable as a function of the independent variables.
For instance, in a medical setting, the purpose of analyzing data may be to establish the impact of smoking behavior on the life expectancy of an individual. A regression model may be estimated that explains the observed age at death of a number of subjects in terms of characteristics such as gender and number of years that the subject smoked. Such a model will establish or quantify the impact or relation between each characteristic and the observed outcome, and allows for testing the statistical significance of the impact and measuring the uncertainty of the result (Cao 2016; Peto, Whitlock, and Jha 2010).
A clear distinction exists with estimating a regression model for, as an example, software effort prediction, as introduced in Table 1.5. In such applications where the aim is mainly to predict, essentially we are not interested in what drivers explain how much effort it will take to develop new software, although this may be a useful side result. Instead we mainly wish to predict as accurately as possible the effort that will be required for completing a project. Since the model's main use will be to produce an estimate allowing cost projection and planning, it is the exactness or accuracy of the prediction and the size of the errors that matters, rather than the exact relation between the effort and characteristics of the project.
Typically, in a business setting, the aim is to predict in order to facilitate improved or automated decision making. Explaining, as indicated for the case of software effort prediction, may have use as well since useful insights may be derived. For instance, from the predictive model, it may be found what the exact impact is of including more or less senior and junior programmers in a project team on the required effort to complete the project, allowing the team composition to be optimized as a function of project characteristics.
In this book, several versatile and powerful profit-driven approaches are discussed. These approaches facilitate the adoption of a value-centric business perspective toward analytics in order to boost the returns. Table 1.6 provides an overview of the structure of the book. First, we lay the foundation by providing a general introduction to analytics in Chapter 2, and by discussing the most important and popular business applications in detail in Chapter 3.
Table 1.6 Outline of the Book
Book Structure
Chapter 1
: A Value-Centric Perspective Towards Analytics
Chapter 2
: Analytical Techniques
Chapter 3
: Business Applications
Chapter 4
: Uplift Modeling
Chapter 5
: Profit-Driven Analytical Techniques
Chapter 6
: Profit-Driven Model Evaluation and Implementation
Chapter 7
: Economic Impact
Chapter 4 discusses approaches toward uplift modeling, which in essence is about distilling or estimating the net effect of a decision and then contrasting the expected result for alternative scenarios. This allows, for instance, the optimization of marketing efforts by customizing the contact channel and the format of the incentive for the response to the campaign to be maximal in terms of returns being generated. Standard analytical approaches may be adopted to develop uplift models. However, specialized approaches tuned toward the particular problem characteristics of uplift modeling have also been developed, and they are discussed in Chapter 4.
As such, Chapter 4 forms a bridge to Chapter 5 of the book, which concentrates on various advanced analytical approaches that can be adopted for developing profit-driven models by allowing us to account for profit when learning or applying a predictive or descriptive model. Profit-driven predictive analytics for classification and regression are discussed in the first part of Chapter 5, whereas the second part focuses on descriptive analytics and introduces profit-oriented segmentation and association analysis.
Chapter 6 subsequently focuses on approaches that are tuned toward a business-oriented evaluation of predictive models—for example, in terms of profits. Note that traditional statistical measures, when applied to customer churn prediction models, for instance, do not differentiate among incorrectly predicted or classified customers, whereas it definitely makes sense from a business point of view to account for the value of the customers when evaluating a model. For instance, incorrectly predicting a customer who is about to churn with a high value represents a higher loss or cost than not detecting a customer with a low value who is about to churn. Both, however, are accounted for equally by nonbusiness and, more specifically, non-profit-oriented evaluation measures. Both Chapters 4 and 6 allow using standard analytical approaches as discussed in Chapter 2, with the aim to maximize profitability by adopting, respectively, a profit-centric setup or profit-driven evaluation. The particular business application of the model will appear to be an important factor to account for in maximizing profitability.
Finally, Chapter 7 concludes the book by adopting a broader perspective toward the use of analytics in an organization by looking into the economic impact, as well as by zooming into some practical concerns related to the development, implementation, and operation of analytics within an organization.
Figure 1.1 provides a high-level overview of the analytics process model (Hand, Mannila, and Smyth 2001; Tan, Steinbach, and Kumar 2005; Han and Kamber 2011; Baesens 2014). This model defines the subsequent steps in the development, implementation, and operation of analytics within an organization.
Figure 1.1 The analytics process model.
(Baesens 2014)
As a first step, a thorough definition of the business problem to be addressed is needed. The objective of applying analytics needs to be unambiguously defined. Some examples are: customer segmentation of a mortgage portfolio, retention modeling for a postpaid Telco subscription, or fraud detection for credit-cards. Defining the perimeter of the analytical modeling exercise requires a close collaboration between the data scientists and business experts. Both parties need to agree on a set of key concepts; these may include how we define a customer, transaction, churn, or fraud. Whereas this may seem self-evident, it appears to be a crucial success factor to make sure a common understanding of the goal and some key concepts is agreed on by all involved stakeholders.
Next, all source data that could be of potential interest need to be identified. This is a very important step as data are the key ingredient to any analytical exercise and the selection of data will have a deterministic impact on the analytical models that will be built in a subsequent step. The golden rule here is: the more data, the better! The analytical model itself will later decide which data are relevant and which are not for the task at hand. All data will then be gathered and consolidated in a staging area which could be, for example, a data warehouse, data mart, or even a simple spreadsheet file. Some basic exploratory data analysis can then be considered using for instance OLAP facilities for multidimensional analysis (e.g., roll-up, drill down, slicing and dicing). This will be followed by a data-cleaning step to get rid of all inconsistencies such as missing values, outliers and duplicate data. Additional transformations may also be considered such as binning, alphanumeric to numeric coding, geographical aggregation, to name a few, as well as deriving additional characteristics that are typically called features from the raw data. A simple example concerns the derivation of the age from the birth date; yet more complex examples are provided in Chapter 3.
In the analytics step, an analytical model will be estimated on the preprocessed and transformed data. Depending on the business objective and the exact task at hand, a particular analytical technique will be selected and implemented by the data scientist. In Table 1.1, an overview was provided of various tasks and types of analytics. Alternatively, one may consider the various types of analytics listed in Table 1.1 to be the basic building blocks or solution components that a data scientist employs to solve the problem at hand. In other words, the business problem needs to be reformulated in terms of the available tools enumerated in Table 1.1.
Finally, once the results are obtained, they will be interpreted and evaluated by the business experts. Results may be clusters, rules, patterns, or relations, among others, all of which will be called analytical models resulting from applying analytics. Trivial patterns (e.g., an association rule is found stating that spaghetti and spaghetti sauce are often purchased together) that may be detected by the analytical model are interesting as they help to validate the model. But of course, the key issue is to find the unknown yet interesting and actionable patterns (sometimes also referred to as knowledge diamonds) that can provide new insights into your data that can then be translated into new profit opportunities. Before putting the resulting model or patterns into operation, an important evaluation step is to consider the actual returns or profits that will be generated, and to compare these to a relevant base scenario such as a do-nothing decision or a change-nothing decision. In the next section, an overview of various evaluation criteria is provided; these are discussed to validate analytical models.
Once the analytical model has been appropriately validated and approved, it can be put into production as an analytics application (e.g., decision support system, scoring engine). Important considerations here are how to represent the model output in a user-friendly way, how to integrate it with other applications (e.g., marketing campaign management tools, risk engines), and how to make sure the analytical model can be appropriately monitored and backtested on an ongoing basis.
It is important to note that the process model outlined in Figure 1.1 is iterative in nature in the sense that one may have to return to previous steps during the exercise. For instance, during the analytics step, a need for additional data may be identified that will necessitate additional data selection, cleaning, and transformation. The most time-consuming step typically is the data selection and preprocessing step, which usually takes around 80% of the total efforts needed to build an analytical model.
Before adopting an analytical model and making operational decisions based on the obtained clusters, rules, patterns, relations, or predictions, the model needs to be thoroughly evaluated. Depending on the exact type of output, the setting or business environment, and the particular usage characteristics, different aspects may need to be assessed during evaluation in order to ensure the model is acceptable for implementation.
A number of key characteristics of successful analytical models are defined and explained in Table 1.7. These broadly defined evaluation criteria may or may not apply, depending on the exact application setting, and will have to be further specified in practice.
Table 1.7 Key Characteristics of Successful Business Analytics Models
Accuracy
Refers to the predictive power or the correctness of the analytical model. Several statistical evaluation criteria exist and may be applied to assess this aspect, such as the hit rate, lift curve, or AUC. A number of profit-driven evaluation measures will be discussed in detail in
Chapter 6
. Accuracy may also refer to statistical significance, meaning that the patterns that have been found in the data have to be real, robust, and not the consequence of coincidence. In other words, we need to make sure that the model
generalizes
well (to other entities, to the future, etc.) and is not overfitted to the historical dataset that was used for deriving or estimating the model.
Interpretability
When a deeper understanding of the retrieved patterns is required—for instance, to validate the model before it is adopted for use—a model needs to be interpretable. This aspect involves a certain degree of subjectivism, since interpretability may depend on the user's knowledge or skills. The interpretability of a model depends on its format, which, in turn, is determined by the adopted analytical technique. Models that allow the user to understand the underlying reasons as to why the model arrives at a certain result are called white-box models, whereas complex incomprehensible mathematical models are often referred to as black-box models. White-box approaches include, for instance, decision trees and linear regression models, examples of which have been provided in
Table 1.2
. A typical example of a black-box approach concerns neural networks, which are discussed in
Chapter 2
.
