22,99 €
CLEAR AND CONCISE TECHNIQUES FOR USING ANALYTICS TO DELIVER BUSINESS IMPACT AT ANY ORGANIZATION
Organizations have more data at their fingertips than ever, and their ability to put that data to productive use should be a key source of sustainable competitive advantage. Yet, business leaders looking to tap into a steady and manageable stream of “actionable insights” often, instead, get blasted with a deluge of dashboards, chart-filled slide decks, and opaque machine learning jargon that leaves them asking, “So what?”
Analytics the Right Way is a guide for these leaders. It provides a clear and practical approach to putting analytics to productive use with a three-part framework that brings together the realities of the modern business environment with the deep truths underpinning statistics, computer science, machine learning, and artificial intelligence. The result: a pragmatic and actionable guide for delivering clarity, order, and business impact to an organization’s use of data and analytics.
The book uses a combination of real-world examples from the authors’ direct experiences—working inside organizations, as external consultants, and as educators—mixed with vivid hypotheticals and illustrations—little green aliens, petty criminals with an affinity for ice cream, skydiving without parachutes, and more—to empower the reader to put foundational analytical and statistical concepts to effective use in a business context.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 429
Veröffentlichungsjahr: 2024
CHAPTER 3
Figure 3.1 Revenue results after a price drop
Figure 3.2 A counterfactual: what revenue would have been if there had been no price drop...
Figure 3.3 The corrected counterfactual: what revenue actually would have been if there h...
Figure 3.4 The distribution over expected counterfactuals vs. the actual results (no caus...
Figure 3.5 The distribution over expected counterfactuals vs. the actual results (had a c...
CHAPTER 4
Figure 4.1 There are three fundamentally different ways businesses use data
Figure 4.2 The interconnected nature of performance measurement and hypothesis validation...
Figure 4.3 The interconnected nature of operational enablement with performance measureme...
CHAPTER 5
Figure 5.1 A marketing campaign result without context
Figure 5.2 The natural question asked in the absence of context: “Is that result g...
Figure 5.3 “Good” is determined based on expectations
Figure 5.4 Expectations must get set during planning
Figure 5.5 The performance measurement time machine
Figure 5.6 The range of possible qualified leads resulting from a campaign
Figure 5.7 An absurdly wide range of possible results
Figure 5.8 Work up from the lower bound
Figure 5.9 Work down from the upper bound
Figure 5.10 The bracket is now a “target range”
CHAPTER 8
Figure 8.1 Social media engagements and orders over time
Figure 8.2 A scatterplot of social media engagements and orders
Figure 8.3 Scatterplot of the Day-to-day Change in Social Media Engagements and the Day-t...
Figure 8.4 Employee satisfaction based on how many vacation days have been taken
Figure 8.5 The actual data used to generate the line
Figure 8.6 Vacation days and employee satisfaction, zoomed in
CHAPTER 9
Figure 9.1 Graph of items related to purchase conversion
Figure 9.2 Adding arrows to our graph
Figure 9.3 The potentially confounded relationship
CHAPTER 11
Figure 11.1 Performance measurement triggering hypothesis validation
Figure 11.2 Hypothesis validation triggering performance measurement
Figure 11.3 Hypothesis validation triggering operational enablement
Figure 11.4 Performance measurement in support of operational enablement
CHAPTER 6
Table 6.1 An Example of a Hypothesis Library
CHAPTER 7
Table 7.1 The Causal Evidence Ladder and When It’s Needed
Cover
Table of Contents
Title Page
Copyright
Dedication
Acknowledgments
About the Authors
Begin Reading
Index
End User License Agreement
iii
iv
v
xiii
xiv
xv
xvi
xvii
xviii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
TIM WILSON
JOE SUTHERLAND
Copyright © 2025 John Wiley & Sons Inc. All rights reserved, including rights for text and data mining and training of artificial intelligence technologies or similar technologies.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
ISBN: 978-1-394-26449-0 (paperback)
ISBN: 978-1-394-26451-3 (ePDF)
ISBN: 978-1-394-26450-6 (epub)
LCCN: 2024946991
Cover Image: © Paul M. Lyren
Cover Design: Wiley
To Julie and Sarah
Together, we’d like to thank our publishing and editorial team at Wiley—Jim Minatel, Cathleen Small, Pete Gaughan, and Sara Deichman—without whom this book wouldn’t have been possible. Paul Lyren provided both comical wit and a talented hand in illustrating our book, taking our goofy ideas of what passed for “comedy” and making them actually funny; the herculean task of illustrating no fewer than 11 chapters worth of content about analytics—yuck!—is a feat, and we deeply appreciate him doing so. Mike Gustafson and our many colleagues at Search Discovery deserve credit for bringing the two of us together, and for supporting our harebrained ideas, for whatever reason—it was Mike’s light touch that led us to build a friendship and point of view for how the analytics business is and should be.
Tim: I would like to thank everyone from whom I learned something that worked its way into the ideas and examples in this book. But then the editors said that would be several orders of magnitude beyond the appropriate word count, so this list is tragically abbreviated. Matt Gershoff is quoted in the book, and I’ve learned that “minimization of regret” means jumping at every opportunity to have a drink, a meal, or a walk with him—be the topic reinforcement learning, decision-making at the margin, the cost of reducing uncertainty, or the relative merits of European versus American restaurant payment systems, it’s always a delight. Another Matt—Matt Coen—is responsible for giving me the language of “the two magic questions.” I was loosely applying the idea, but it was during our relatively brief work together when he articulated them (it was purely my cheekiness that then branded them as “magic”). John Lovett—a coworker twice over—provided an on-the-job master class in how to listen to business partners and clients effectively, and he provided encouragement and advice ever since this book was the faintest of glimmers in my eye back in 2017. Val Kroll and I have been collaborating for more than a decade, although we didn’t realize we were codeveloping a shared point of view until we became coworkers several years ago, and she’s now my cofounder (and podcast cohost), so that collaboration continues. Alison Ferguson, Jim Sterne, Eric Peterson, and Matty Wishnow all took leaps of faith (twice in Matty’s case!) that moved my career and my professional growth forward—giving me the opportunity to develop and practice many of the ideas included in this book. The on-stage practice of those ideas—at industry conferences and in meetings with clients—went much better than they could have thanks to concepts and techniques I learned from Lea Pica, and her friendship and encouragement to tackle the authorship of this book at all was a valuable boost.
The Analytics Power Hour is a podcast that has given me more opportunities to have deep discussions with smart people—the cohosts and the guests—for a decade (and counting) as of the publication of the book, and I don’t think the book would have been possible without those discussions. I’m indebted to Jim “Analysts Don’t Do Anything” Cain for being the spark that got that rolling. Michael Helbing—the smoothest voice in explicit analytics podcasts—has been my cohost, my mentor, my manager, and my sounding board and has calmly talked me down from countless temper tantrums in all of those roles, so most importantly, he is my dear friend. Moe Kiss joined the show near the start, and her brilliant perspective, friendship, and willingness to hash out just about any topic on or off the mic has made me a better analyst and a better consultant. Julie Hoyer was a coworker long before she became a cohost, but her ability to nail the weaknesses in any idea, as well as to build on any idea that has some underlying merit, and to do so instantaneously terrifies me just a little bit, but she contributed directly to expanding and solidifying several of the techniques described in the book.
Joe: I’d like to thank my family and friends for providing ample leeway and grace during the work on this manuscript. The late nights, rain checks, and vacation days spent working on this book can add up, and I appreciate their kindness and support during the process. My wife, Sarah, deserves special appreciation for giving me the space to peck away at the manuscript while we gleefully anticipated the arrival of our son.
Some of the themes in this book, which fed into the mind meld Tim and I achieved in producing this book and wouldn’t have been developed if not for our partnership, were the culmination of practical ideas from my two-decade journey through software engineering, scientific inference, and executive service. It was a public library in my youth (and its staff and patrons) that I must thank for setting me on a path to fully learn and appreciate the power of engineering, machine learning, and artificial intelligence in an applied business context (it all started with a book on How to Learn BASIC when I was a kid). My Washington University in St. Louis, Columbia, Princeton, Johns Hopkins, and Emory colleagues helped me develop entirely new perspectives through which to view data, analytics, and business. The ideas of counterfactual reasoning and evidentiary weight owe homage to my work with, and learning from, Don Green, Andy Gelman, Greg Wawro, Bob Shapiro, and Suresh Naidu. I credit Jon Rogowski, Andrew Reeves, and Dan Butler for launching me on my academic odyssey, which has proven enjoyable and rewarding.
It is impossible to recount all who in the course of my business inspired and guided the ideas that are reflected in this work; we stand on the shoulders of giants. Evan Schnidman and Bill Macmillan encouraged me to use data and analysis in what was an entirely new way at the time: to forecast Federal Reserve policymaking from unstructured texts. My colleagues and clients at Peachtree AI, Search Discovery, Cisco, and Amazon, lent me the experiences needed to operationally enable these ideas in the business context.
Finally, thank you to Emory University, the Department of Quantitative Theory and Methods, the Center for AI Learning, the Weidenbaum Center, and the Woodruff Library, for enabling me to work on this book.
Tim started his career in architecture, pivoted quickly to technical writing, and then found himself working in marketing communications just as the internet was starting to become a thing that businesses took seriously in the early 2000s as a means of finding, acquiring, and keeping customers. In retrospect, he realizes he was in the right place at the right time to have this weird and wonderful thing called “web analytics” land in his lap—a primitive but useful set of technologies for collecting and analyzing the behavior of visitors to the website of the high tech B2B company where he was working at the time. He went on to head up the business intelligence department at that same company before stepping into the agency and consulting world: creating and growing the analytics practices at three different agencies that worked with a range of large consumer brands; consulting with the analytics teams at various Fortune 500 companies on the their strategies, processes, and tactics for effectively putting analytics to actionable use; and then cofounding a consultancy, facts & feelings (factsandfeelings.io), dedicated to helping organizations productively employ research, analytics, and experimentation to drive growth. Tim is a long-time creator of pragmatic content for analysts and marketers, including, in 2008, cofounding the still-running Data and Analytics Wednesday monthly meetup in Columbus, Ohio, and cohosting the biweekly Analytics Power Hour podcast (analyticshour.io) since 2015. He has been a regular conference speaker across multiple continents on a wide range of topics: data visualization, data storytelling, the R programming language, marketing and digital analytics, and, of course, many of the concepts and techniques addressed in this book. While Tim is physically based in Columbus, his heart and soul maintain joint custody with Austin, Texas. He holds a bachelor of science from the Massachusetts Institute of Technology, an MBA from the University of Texas at Austin, and a Certificate of Amazement from Joe that a hillbilly raised in Sour Lake, Texas, can, indeed, learn the fundamentals of causal inference.
With a career spanning the White House, the Dow Jones 30, and America’s top universities, Dr. Joe Sutherland has served as executive, public service leader, and educator. Sutherland is the founding director of the Emory Center for AI Learning, where he serves as lead principal investigator of the Emory branch of the US AI Safety Institute Consortium, associate faculty of the Empathetic AI for Health Institute at Emory Healthcare, and fellow of the Weidenbaum Center on the Economy, Public Policy, and Government at Washington University in St. Louis. He is a professor in Emory’s Department of Quantitative Theory & Methods. Sutherland’s professional experience spans public service in the White House, technology entrepreneurship, executive roles including as chief executive officer of an AI company and at Amazon and Cisco, and academic positions at Columbia, Johns Hopkins, and Princeton. Sutherland founded two startups that were later acquired: Peachtree AI, a professional services firm specializing in artificial intelligence integrations, and Prattle, a fintech company that uses natural language processing to forecast both the Federal Reserve’s monetary policy decisions and the performance of publicly traded companies. From 2011 to 2013, he served in the White House Office of Scheduling and Advance for President Barack Obama, traveling with the president in support of various policy initiatives.
His research exploring the utilization of machine learning and AI in a wide variety of topics is published in top peer-reviewed journals, and his work has been featured on FOX 5 Good Day Atlanta, Atlanta Journal Constitution, Forbes, Georgia Trend, Government Technology, MIT Sloan Management Review, and many other venues. In 2017, the National Science Foundation recognized his work in state politics and policy with Honorable Mention, considered a national honor. Sutherland earned his PhD, MPhil, and master’s degrees in political science from Columbia University and his bachelor’s degree in political science from Washington University in St. Louis. He lives in Historic Brookhaven, Atlanta, Georgia with his family, where he enjoys playing golf and tennis.
Scientists say that gathering data used to be like a walk in the desert. To get any water, you would have to hang out a giant tarp and wait for a small drop of rain to fall on it, only to hope it would travel down into a thermos to quench your thirst. Today, gathering data is easy. You walk out the door and are deluged by a torrential stream of bits.
Gathering reams of data, storing it, and accessing it has become easier and easier, but organizations are regularly frustrated that they’re not seeing a steady and voluminous flow of actionable information as a result. At the root of these frustrations are a series of ideas about “data”—how it works and what it can do—that are, frankly, incorrect.
Getting data does not mean you get insight.
To begin to understand why, consider the journey we’ve been on for the past few decades. Many of us were stranded in that metaphorical desert. A drop of water in the desert is precious, and so its uses are meticulous. With such limited water, you’re careful to use it in exactly the way you need to achieve your goals. When you’re flooded with cheap and easy water in the suburbs, you water your lawn until it’s practically submerged. If you had lived through the desert and found your way to suburban utopia, you would take pride in your reckless abandon.
Generic data are cheap and omnipresent today, and those of us who lived through the data revolution can tend to, undeservedly, feel like royalty.1 But even those who didn’t go from the desert to the suburbs, the trappings of this royal feeling have captured the majority of professionals. And it is, in part, this feeling that our data today are an unexpected bounty of pure “gold” in and of themselves that lead to misconceptions that hurt our ability to use data productively.
Metaphorical comparisons like “data is the new oil” are trite, and perhaps not fully informative. Oil is a high-priced commodity, and finding it on your property can feel like you have found gold in your backyard. But oil requires refinement before it can generate profits. Without refinement, oil is simply a sticky dark substance found on your property. Data work the same way. You have to refine the data in order for it to produce profitable value. The difference is that, without refinement, data aren’t worth the bits and bytes they’re stored on.
Another difference is that it is predictable what oil’s outputs are, and therefore, any oil from anywhere can produce a profitable output. Although machine learning is often treated as though it can magically convert commodified data into useful insights and outputs, the missing ingredient is usually the appropriate theory for how the data were generated and how they should be used. Whereas with oil, the outputs are fairly constrained and replicable without additional knowledge (notwithstanding the plastics industry), with data, the outputs are nearly infinite. When something is infinite in nature, you have to know where to look. That is what necessitates this missing ingredient of theory, and you can inject it through hypothesis validation, a topic into which we will drill deeply starting in Chapter 6.
In this chapter, we argue that the mental models and techniques analytics required to process data into good decisions have been ignored in favor of a much larger lucrative enterprise: the collection of data. We have become hooked on the promise that the data bring. We mine for data in our backyards just as we once rooted around for oil, but we’ve forgotten that oil too, before refinement, is a boring, not-very-useful goo.
Data are not valuable in and of themselves. You have to interact with the data in a particular way to get any insight from it. You have to know what you’d like to learn, and then you have to know the analysis technique to apply to it in order to learn it.
In an age of artificial intelligence (AI) and machine learning, it’s commonplace to assume that simply offering our data for consumption by the analytics techniques and AI tools we now have available will simply “generate insights.” This idea has even made its way into our popular culture: anyone who’s seen the android AI character named “Data” on Star Trek knows that the crew rely on him to autonomously perform insightful analysis quickly and at the drop of a hat. But in truth, even these tools are useless without a good idea to test.
Our society—including “the industry,” broadly—has misidentified “the data” as an end in itself. We are not sure exactly why, and this book certainly does not suffice as a comprehensive research report on the topic, but we have identified four misconceptions underpinning the data and analytics zeitgeist that have put us in this situation. In fact, this idea that data are the be-all, end-all, is driven by forces that are sneakily powerful (hard to combat) and well-intended (the people behind these forces are misguided rather than malevolent) but are, arguably, quite destructive.
Consider the following four statements about data:
With enough data, uncertainty can be eliminated.
Data must be comprehensive to be useful.
Data are inherently objective and unbiased.
Democratizing access to data makes an organization data-driven.
How many of these four statements do you agree with—completely or in part? In the remainder of this chapter, we will explore why each of them is a dangerous misconception.
Most of the world’s processes are inherently random. When a process is random, it means that the outcome cannot be determined with exact certainty, and there’s absolutely nothing anyone or anything can do about it. In life, we want to be certain we’re going to win—or at the very least, be certain that we won’t lose. Randomness is the root cause of most intense discomfort in life because it makes almost everything, to some degree, uncertain. After all, we would never rationally choose to put ourselves in a losing situation.
We deal with uncertainty by gathering information. For example, one of the authors was uncertain if his wife would like the purse he bought for her as a holiday gift. If she didn’t, it would cause him discomfort because she would feel obligated to like it and not return it, but she would never use it. He knew he would have to gaze upon his failure daily—a lonely and unused accessory permanently hanging from a hook in the closet. So, he asked his mother-in-law for an indication as to whether his wife would like it or not. When she replied, “She’s been telling me for a year she wants that one,” it reduced his uncertainty about his wife’s reception of the gift. The author gathered information to reduce his uncertainty.
To executives, having lots of data available is even better than having your mother-in-law on speed-dial.2 With data, they can reduce their uncertainty without even having to pick up the phone! In fact, search advertising, which is today one of the most lucrative advertising strategies, works exactly like a robot mother-in-law. You don’t need to wonder if a user wants a Dior handbag if that user types, “buy black Dior handbag” into the search bar.
As the volume of data available to businesses has exponentially increased over the past few decades, technology vendors in the business of helping other companies collect or manage data are incentivized to promote the idea that, given sufficient data, advanced analytics and predictive models can deliver “the truth” without ever having to phone anyone. While business leaders may have some intuition that the data will never turn them into truly omniscient beings, they also generally accept as a fact that “more data is better.”
The reality is that, while data play a key role in reducing uncertainty:
having a large amount of data doesn’t mean you have
any
information that reduces uncertainty regarding the decision at hand;
even with the best information, the natural law of uncertainty still governs;
data can cost more than the benefit you get from collecting it; and
it is impossible to actually collect and use “all” of the data.
With even a relatively small volume of raw data, the outputs are infinite—data can be combined, added, multiplied, and extrapolated in countless ways.3 That means, in effect, that without an idea of where you are trying to go with the data, any amount of data, even tremendous amounts of data, cannot help you to get there.
Consider the Library of Babel, which Jose Luis Borges introduced in 1941. This fictional library contains a seemingly infinite number of books. Each book was written by a computer that was trained to scramble all of the letters of the alphabet into every one of its possible permutations. So, for the first book, the computer would take all of the letters of the alphabet, shuffle their order—including the As, the Bs, the spaces, the punctuation, etc.—and it would write 300 pages of that, bind the pages together, and then put the book on the shelf of the library. The computer continues to do this until every possible string of letters, spaces, and punctuation have been represented in the library.
In this library exists every possible idea any person ever has or will have. That means that the library contains the patent for Windows 95. The library contains the greatest literary work ever to be written, from a future century, that has not yet come to pass. The library contains, if it is possible, the technology required to travel through outer space at a speed faster than light. The library would even contain a history of your life’s past, present, and future—enabling you to live just as Biff Tannen did in Back to the Future, winning horse bets and conducting stock trades to amass the largest fortune ever known. By scrambling these letters, we have generated every solution to any problem that humanity has ever or will ever face. There is the ultimate level of certainty in this library of all possible iterations of knowledge.
The problem is, how do you find anything in this library? There are simply too many books available to know where to look. Most of the books in the library are complete gibberish, and many of the books, while coherent sets of words, are fiction rather than fact (the tale of a small online bookseller called Amazon that went bankrupt and ceased to exist after a mere 2 years in business). And the process that you would use to search through the books, or order them, would require you to have read all of them in the first place. Ironically, having access to the ultimate level of information means that now you have absolutely no certainty about where to look for it.
Perhaps one solution is to have an artificial intelligence tool based on a large language model (LLM) framework review all of the materials and find the books that seem most promising. However, in order to know what is most promising, the tool has to know what ideas it should be looking for. The problem is that LLMs are only familiar with the ideas that we have available today and how influential they have been up until now. They do not know what the influential ideas of the future will be. Well, dang!
Let’s say we go one step further. We give the LLM tool the ability to implement the hypothesis validation techniques we discuss in this book (see Chapter 6), and we give it the authority to start testing the ideas it finds to see if they have any value or merit. The amount of time it would take to test every possible idea it finds, in the order in which it finds them, is still in excess of the amount of time we would have available in a million lifetimes.
Therein lies the problem with data. Because data can be refined into an infinite number of applications, we have to have an idea of where to look for those applications before we can successfully see value from them. We have to bring ideas about where we should be looking in order to advance our ability to apply data. It is not the fact that we have data available in and of itself that creates insight; rather, it is the application of the data to the right idea at the right time that generates our ability to create history through the actions we take. We are, in this sense, charting our own path forward by deciding what to test and act upon, even if the future history of the world is entirely predetermined, as some people believe.
Let’s make this point more specific. Imagine that we come across a gentleman looking for his keys in the street. We say, “Can I help you? Where do you think you’ve lost them?”
He replies, “I think I lost them over there near the grass, not here.”
We ask, “Why are you looking over here if you lost them over there?”
And he replies, “Well, this is where the streetlight is, so I can see what I’m looking for.”
Because the streetlight is setting the agenda for where he should look for the thing he wants to find, he will, by definition, never find it. Uh…sound familiar?
We see this a lot in business. New questions that are valuable to the actions the business needs to take are often unable to be answered by the data at hand. And if we try to answer them with the data that we “found in our backyard,” then we will always get the wrong answer. Just because data are available doesn’t mean they are the right data for the job.
For instance, consider one of our clients: an insurance provider to frontline professionals that provides coverage for medical problems and catastrophic occurrences that some frontline professionals endure in the line of duty, like exposure to COVID-19. The insurance provider wanted to release a new insurance product that would change the price of the insurance premium based on the lifestyle characteristics of the frontline professional seeking insurance. They gave us a data set, including factors such as whether the professional smoked, whether the professional had a pre-existing heart condition, etc., and asked us to predict whether these factors changed the risk the company would assume by writing a policy for the professional. So, in theory, a professional who does not smoke—a factor known to be correlated with cancer diagnosis—could be given a cheaper premium than a person who does, because it creates less risk for the company.
And so the first question we asked was, “What do you mean by risk? Is risk the probability that you have to pay out a claim? Is it the amount of the claims paid? Is it the probability that the person is catastrophically harmed? Is it the probability that the person will be readmitted to the hospital several times after a single admission?” There are several outcome factors used to quantify risk in the insurance industry.
The client responded that, in an ideal world, the risk we compute would be related to the total amount of claims that would need to be paid out over the lifetime of the insured professional. But then the client pointed out that they did not have this data available. They only had, on short notice, data related to the satisfaction of the insured professionals with the services that the company provided. Couldn’t we just use that?4
Well, no. You can’t look for an answer in the place where you already have the data, just because that’s where the data already are. And any analysis using that as the outcome variable would not ultimately produce something that would help the client actually reduce their uncertainty with respect to the insurance policies.