19,99 €
A thought-provoking and startlingly insightful reworking of the science of prediction In Prediction Revisited: The Importance of Observation, a team of renowned experts in the field of data-driven investing delivers a ground-breaking reassessment of the delicate science of prediction for anyone who relies on data to contemplate the future. The book reveals why standard approaches to prediction based on classical statistics fail to address the complexities of social dynamics, and it provides an alternative method based on the intuitive notion of relevance. The authors describe, both conceptually and with mathematical precision, how relevance plays a central role in forming predictions from observed experience. Moreover, they propose a new and more nuanced measure of a prediction's reliability. Prediction Revisited also offers: * Clarifications of commonly accepted but less commonly understood notions of statistics * Insight into the efficacy of traditional prediction models in a variety of fields * Colorful biographical sketches of some of the key prediction scientists throughout history * Mutually supporting conceptual and mathematical descriptions of the key insights and methods discussed within With its strikingly fresh perspective grounded in scientific rigor, Prediction Revisited is sure to earn its place as an indispensable resource for data scientists, researchers, investors, and anyone else who aspires to predict the future from the data-driven lessons of the past.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 362
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Timeline of Innovations
Essential Concepts
Preface
1 Introduction
Relevance
Roadmap
Note
2 Observing Information
Observing Information Conceptually
Observing Information Mathematically
Observing Information Applied
Appendix 2.1: On the Inflection Point of the Normal Distribution
References
Notes
3 Co-occurrence
Co-occurrence Conceptually
Co-occurrence Mathematically
Co-occurrence Applied
References
Note
4 Relevance
Relevance Conceptually
Relevance Mathematically
Relevance Applied
Appendix 4.1: Predicting Binary Outcomes
References
Notes
5 Fit
Fit Conceptually
Fit Mathematically
Fit Applied
Notes
6 Reliability
Reliability Conceptually
Reliability Mathematically
Reliability Applied
References
Notes
7 Toward Complexity
Toward Complexity Conceptually
Toward Complexity Mathematically
Complexity Applied
References
8 Foundations of Relevance
Observations and Relevance: A Brief Review of the Main Insights
Abraham de Moivre (1667–1754)
Pierre-Simon Laplace (1749–1827)
Carl Friedrich Gauss (1777–1855)
Francis Galton (1822–1911)
Karl Pearson (1857–1936)
Ronald Fisher (1890–1962)
Prasanta Chandra Mahalanobis (1893–1972)
Claude Shannon (1916–2001)
References
Notes
Concluding Thoughts
Perspective
Insights
Prescriptions
Index
End User License Agreement
Chapter 2
Exhibit 2.5 Dataset
Exhibit 2.6 Arithmetic Averages
Exhibit 2.9 Pairwise Spreads and Variance Calculation—Industrial Production...
Exhibit 2.10 Conventional Variance Calculation—Industrial Production...
Exhibit 2.11 Arithmetic Averages, Variances, and Standard Deviations
Exhibit 2.12 Flipping 10 Coins
Chapter 3
Exhibit 3.4 Correlation Matrix
Exhibit 3.6 Pairwise z-scores for Individual Attributes
Exhibit 3.7 Pairwise Co-occurrence—Industrial Production and Nonfarm Payrol...
Exhibit 3.8 Information Distance of Pairwise Co-occurrence—Industrial Produ...
Exhibit 3.9 Pairwise Correlation Calculation—Industrial Production and Nonf...
Exhibit 3.11 Conventional Correlation Calculation—Industrial Production and...
Exhibit 3.12 Correlation Matrix
Exhibit 3.13 Covariance Matrix
Chapter 4
Exhibit 4.9 Observations and Their Relevance to
Exhibit 4.11 Relevance-Weighted Outcomes and Full-Sample Regression Predict...
Exhibit 4.13 Conventional Full-Sample Regression Prediction
Exhibit 4.15 Partial Sample Prediction Calculation
Exhibit 4.16 Asymmetry Calculation for
Prediction
Exhibit 4.17 Aggregate Asymmetry Calculation
Chapter 5
Exhibit 5.3 Pairwise Relevance and Outcomes
Exhibit 5.4 Prediction Fit
Exhibit 5.5 Prediction Variance and Bounds
Exhibit 5.6 Precision
Exhibit 5.7 Calculating Prediction Fit for Partial Sample Prediction
Exhibit 5.8 Prediction Variance, Bounds, and Precision for Partial Sample P...
Chapter 6
Exhibit 6.3 Calculating Reliability
Exhibit 6.4 Pairwise Relevance and Pairwise Outcomes
Exhibit 6.5 Pairwise Calculation of Full-Sample Reliability
Exhibit 6.6 Traditional Approach to Calculating R-squared
Exhibit 6.7 Calculating Reliability for Partial Sample Regression
Exhibit 6.9 Removing the Impact of Biased Terms
Exhibit 6.10 Addressing the Bias of R-Squared
Chapter 8
Exhibit 8.1 Binomial Distribution
Chapter 2
Exhibit 2.1 Spreads
Exhibit 2.2 Triangle of Pairs
Exhibit 2.3 Subsequent GDP Growth
Exhibit 2.4 Trailing Percentage Changes in Industrial Production
Exhibit 2.7 Industrial Production
Exhibit 2.8 Histogram of Industrial Production
Exhibit 2.13 Distribution of Combinations
Exhibit 2.14 Counting Distinct Combinations for 30 Trials
Chapter 3
Exhibit 3.1 Two Positively Related Attributes for a Single Observation
Exhibit 3.2 Possible Co-occurrence Patterns for Attributes with a Given Aver...
Exhibit 3.3 Co-occurrence for Single Observations
Exhibit 3.5 Pairs of Pairs Approach for Estimating Co-occurrence
Exhibit 3.10 Pairwise Co-occurrence—Industrial Production and Nonfarm Payrol...
Chapter 4
Exhibit 4.1 Scatter Plot of Two Hypothetical Attributes
Exhibit 4.2 Similarity, Informativeness, and Relevance of Hypothetical Obser...
Exhibit 4.3 Simulated Symmetric Relationship
Exhibit 4.4 Predictions versus Actual Outcomes
Exhibit 4.5 Simulated Asymmetric Relationship
Exhibit 4.6 Predictions versus Actual Outcomes
Exhibit 4.7 Informativeness and Similarity with Two Attributes
Exhibit 4.8 Equally Informative Observations with Five Attributes
Exhibit 4.10 Relevance of Observations to
Exhibit 4.12 Outcome Deviations from Average
Exhibit 4.14 Relevance and Outcome Deviations—25% Most Relevant Observations...
Exhibit 4.18 Asymmetry
Exhibit 4.19 Logistic Function
Exhibit 4.20 Logit Function (inverse of the logistic function)
Chapter 5
Exhibit 5.1 Relevance and Outcomes for a Very Good Partial Fit
Exhibit 5.2 Pairwise Relevance and Fit
Chapter 6
Exhibit 6.1 Components of Fit for Simulated Random Noise
Exhibit 6.2 Components of Fit for Simulated Random Noise
Exhibit 6.8 Reliability of Partial Sample Regression
Chapter 7
Exhibit 7.1 Decision Tree
Chapter 8
Exhibit 8.2 Galton's Quincunx
Cover
Table of Contents
Title Page
Copyright
Timeline of Innovations
Essential Concepts
Preface
Begin Reading
Concluding Thoughts
Index
End User License Agreement
iii
iv
ix
xi
xii
xiii
xiv
xv
xvi
xvii
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
209
210
211
212
213
214
215
216
217
218
219
220
MEGAN CZASONIS
MARK KRITZMAN
DAVID TURKINGTON
Copyright © 2022 by Megan Czasonis, Mark Kritzman, and David Turkington. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is Available:
ISBN 9781119895589 (hardback)ISBN 9781119895602 (ePDF)ISBN 9781119895596 (epub)
Cover Design: WileyCover Image: © akinbostanci/Getty Images
Relevance is the centerpiece of our approach to prediction. The key concepts that give rise to relevance were introduced over the past three centuries, as illustrated in this timeline. In Chapter 8, we offer more detail about the people who made these groundbreaking discoveries.
This book introduces a new approach to prediction, which requires a new vocabulary—not new words, but new interpretations of words that are commonly understood to have other meanings. Therefore, to facilitate a quicker understanding of what awaits you, we define some essential concepts as they are used throughout this book. And rather than follow the convention of presenting them alphabetically, we present them in a sequence that matches the progression of ideas as they unfold in the following pages.
Observation:
One element among many that are described by a common set of attributes, distributed across time or space, and which collectively provide guidance about an outcome that has yet to be revealed. Classical statistics often refers to an observation as a multivariate data point.
Attribute:
A recorded value that is used individually or alongside other attributes to describe an observation. In classical statistics, attributes are called independent variables.
Outcome:
A measurement of interest that is usually observed alongside other attributes, and which one wishes to predict. In classical statistics, outcomes are called dependent variables.
Arithmetic average:
A weighted summation of the values of attributes or outcomes that efficiently aggregates the information contained in a sample of observations. Depending on the context and the weights that are used, the result may be interpreted as a typical value or as a prediction of an unknown outcome.
Spread:
The pairwise distance between observations of an attribute, measured in units of surprise. We compute this distance as the average of half the squared difference in values across every pair of observations. In classical statistics, the same quantity is usually computed as the average of squared deviations of observations from their mean and is referred to as variance. However, the equivalent evaluation of pairwise spreads reveals why we must divide by
N
– 1 rather than
N
to obtain an unbiased estimate of a sample's variance; it is because the zero distance of an observation with itself (the diagonal in a matrix of pairs) conveys no information.
Information theory:
A unified mathematical theory of communication, created by Claude Shannon, which expresses messages as sequences of 0s and 1s and, based on the inverse relationship of information and probability, prescribes the optimal redundancy of symbols to manage the speed and accuracy of transmission.
Circumstance:
A set of attribute values that collectively describes an observation.
Informativeness:
A measure of the information conveyed by the circumstances of an observation, based on the inverse relationship of information and probability. For an observation of a single attribute, it is equal to the observed distance from the average, squared. For an observation of two or more uncorrelated attributes, it is equal to the sum of each individual attribute's informativeness. For an observation of two or more correlated attributes—the most general case—it is given by the Mahalanobis distance of the observation from the average of the observations. Informativeness is a component of relevance. It does not depend on the units of measurement.
Co-occurrence:
The degree of alignment between two attributes for a single observation. It ranges between –1 and +1 and does not depend on the units of measurement.
Correlation:
The average co-occurrence of a pair of attributes across all observations, weighted by the informativeness of each observation. In classical statistics, it is known as the Pearson correlation coefficient.
Covariance matrix:
A symmetric square matrix of numbers that concisely summarizes the spreads of a set of attributes along with the signs and strengths of their correlation. Each element pertains to a pair of attributes and is equal to their correlation times their respective standard deviations (the square root of variance or spread).
Mahalanobis distance:
A standardized measure of distance or surprise for a single observation across many attributes, which incorporates all the information from the covariance matrix. The Mahalanobis distance of a set of attribute values (a circumstance) from the average of the attribute values measures the informativeness of that observation. Half of the negative of the Mahalanobis distance of one circumstance from another measures the similarity between them.
Similarity:
A measure of the closeness between one circumstance and another, based on their attributes. It is equal to the opposite (negative) of half the Mahalanobis distance between the two circumstances. Similarity is a component of relevance.
Relevance:
A measure of the importance of an observation to forming a prediction. Its components are the informativeness of past circumstances, the informativeness of current circumstances, and the similarity of past circumstances to current circumstances.
Partial sample regression:
A two-step prediction process in which one first identifies a subset of observations that are relevant to the prediction task and, second, forms the prediction as a relevance-weighted average of the historical outcomes in the subset. When the subset from the first step equals the full-sample, this procedure converges to classical linear regression.
Asymmetry:
A measure of the extent to which predictions differ when they are formed from a partial sample regression that includes the most relevant observations compared to one that includes the least relevant observations. It is computed as the average dissimilarity of the predictions from these two methods. Equivalently, it may be computed by comparing the respective fits of the most and least relevant subsets of observations to the cross-fit between them. The presence of asymmetry causes partial sample regression predictions to differ from those of classical linear regression. The minimum amount of asymmetry is zero, in which case the predictions from full-sample and partial-sample regression match.
Fit:
The average alignment between relevance and outcomes across all observation pairs for a single prediction. It is normalized by the spreads of relevance and outcomes, and while the alignment for one pair of observations may be positive or negative, their average always falls between zero and one. A large value indicates that observations that are similarly relevant have similar outcomes, in which case one should have more confidence in the prediction. A small value indicates that relevance does not line up with the outcomes, in which case one should view the prediction more cautiously.
Bias:
The artificial inflation of fit resulting from the inclusion of the alignment of each observation with itself. This bias is addressed by partitioning fit into two components—outlier influence, which is the fit of observations with themselves, and agreement, which is the fit of observations with their peers—and using agreement to give an unbiased measure of fit.
Outlier influence:
The fit of observations with themselves. It is always greater than zero, owing to the inherent bias of comparing observations with themselves, and it is larger to the extent that unusual circumstances coincide with unusual outcomes.
Agreement:
The fit of observations with their peers. It may be positive, negative, or zero, and is not systematically biased.
Precision:
The inverse of the extent to which the randomness of historical observations (often referred to as noise) introduces uncertainty to a prediction.
Focus:
The choice to form a prediction from a subset of relevant observations even though the smaller subset may be more sensitive to noise than the full sample of observations, because the consistency of the relevant subset improves confidence in the prediction more than noise undermines confidence.
Reliability:
The average fit across a set of prediction tasks, weighted by the informativeness of each prediction circumstance. For a full sample of observations, it may be computed as the average alignment of pairwise relevance and outcomes and is equivalent to the classical R-squared statistic.
Complexity:
The presence of nonlinearities or other conditional features that undermine the efficacy of linear prediction models. The conventional approach for addressing complexity is to apply machine learning algorithms, but one must counter the tendency of these algorithms to overfit the data. In addition, it can be difficult to interpret the inner workings of machine learning models. A simpler and more transparent approach to complexity is to filter observations by relevance. The two approaches can also be combined.
The path that led us to write this book began in 1999. We wanted to build an investment portfolio that would perform well across a wide range of market environments. We quickly came to the view that we needed more reliable estimates of volatilities and correlations—the inputs that determine portfolio risk—than the estimates given by the conventional method of extrapolating historical values. Our thought back then was to measure these statistics from a subset of the most unusual periods in history. We reasoned that unusual observations were likely to be associated with material events and would therefore be more informative than common observations, which probably reflected useless noise. We had not yet heard of the Mahalanobis distance, nor were we aware of Claude Shannon's information theory. Nonetheless, as we worked on our task, we derived the same formula Mahalanobis originated to analyze human skulls in India more than 60 years earlier.
As we extended our research to a broader set of problems, we developed a deep appreciation of the versatility of the Mahalanobis distance. In a single number, his distance measure tells us how dissimilar two items are from each other, accounting not only for the size and alignment of their many features, but also the typical variation and covariation of those features across a broader sample. We applied the method first to compare periods in time, each characterized by its economic circumstances or the returns of financial assets, and this led to other uses. We were impressed by the method's potential to tackle familiar problems in new ways, often leading to new paths of understanding. This eventually led to our own discovery that the prediction from a linear regression equation can be equivalently expressed as a weighted average of the values of past outcomes, in which the weights are the sum of two Mahalanobis distances: one that measures unusualness and the other similarity. Although we understood intuitively why unusual observations are more informative than common ones, it was not until we connected our research to information theory that we fully appreciated the nuances of the inverse relationship of information and probability.
Our focus on observations led us to the insight that we can just as well analyze data samples as collections of pairs rather than distributions of observations around their average. This insight enabled us to view variance, correlation, and R-squared through a new lens, which shed light on statistical notions that are commonly accepted but not so well understood. It clarified, for example, why we must divide by N – 1 instead of N to compute a sample variance. It gave us more insight into the bias of R-squared and suggested a new way to address this bias. And it showed why we square distances in so many statistical calculations. (It is not merely because unsquared deviations from the mean sum to zero.)
But our purpose goes beyond illuminating vague notions of statistics, although we hope that we do this to some extent. Our larger mission is to enable researchers to deploy data more effectively in their prediction models. It is this quest that led us down a different path from the one selected by the founders of classical statistics. Their purpose was to understand the movement of heavenly bodies or games of chance, which obey relatively simple laws of nature. Today's most pressing challenges deal with esoteric social phenomena, which obey a different and more complex set of rules.
The emergent approach for dealing with this complexity is the field of machine learning, but more powerful algorithms introduce complexities of their own. By reorienting data-driven prediction to focus on observation, we offer a more transparent and intuitive approach to complexity. We propose a simple framework for identifying asymmetries in data and weighting the data accordingly. In some cases, traditional linear regression analysis gives sufficient guidance about the future. In other cases, only sophisticated machine learning algorithms offer any hope of dealing with a system's complexity. However, in many instances the methods described in this book offer the ideal blend of transparency and sophistication for deploying data to guide us into the future.
We should acknowledge upfront that our approach to statistics and prediction is unconventional. Though we are versed, to some degree, in classical statistics and have a deep appreciation for the insights gifted to us by a long line of scholars, we have found it instructive and pragmatic to reconsider the principles of statistics from a fresh perspective—one that is motivated by the challenge we face as financial researchers and by our quest for intuition. But mostly we are motivated by a stubborn refusal to stop asking the question: Why?
Practitioners have difficult problems to solve and often too little time. Those on the front lines may struggle to absorb everything that technical training has to offer. And there are bound to be many useful ideas, often published in academic articles and books, that are widely available yet seldom used, perhaps because they are new, complex, or just hard to find.
Most of the ideas we present in this book are new to us, meaning that we have never encountered them in school courses or publications. Nor are we aware of their application in practice, even though investors clearly thrive on the quality of their predictions. But we are not so much concerned with precedence as we are with gaining and sharing a better understanding of the process of data-driven prediction. We would, therefore, be pleased to learn of others who have already come to the insights we present in this book, especially if they have advanced them further than we do in this book.
We rely on experience to shape our view of the unknown, with the notable exception of religion. But for most practical purposes we lean on experience to guide us through an uncertain world. We process experiences both naturally and statistically; however, the way we naturally process experiences often diverges from the methods that classical statistics prescribes. Our purpose in writing this book is to reorient common statistical thinking to accord with our natural instincts.
Let us first consider how we naturally process experience. We record experiences as narratives, and we store these narratives in our memory or in written form. Then when we are called upon to decide under uncertainty, we recall past experiences that resemble present circumstances, and we predict that what will happen now will be like what happened following similar past experiences. Moreover, we instinctively focus more on past experiences that were exceptional rather than ordinary because they reside more prominently in our memory.
Now, consider how classical statistics advises us to process experience. It tells us to record experiences not as narratives, but as data. It suggests that we form decisions from as many observations as we can assemble or from a subset of recent observations, rather than focus on observations that are like current circumstances. And it advises us to view unusual observations with skepticism. To summarize:
Natural Process
Records experiences as narratives.
Focuses on experiences that are like current circumstances.
Focuses on experiences that are unusual.
Classical Statistics
Record experiences as data.
Include observations irrespective of their similarity to current circumstances.
Treat unusual observations with skepticism.
The advantage of the natural process is that it is intuitive and sensible. The advantage of classical statistics is that by recording experiences as data we can analyze experiences more rigorously and efficiently than would be allowed by narratives. Our purpose is to reconcile classical statistics with our natural process in a way that secures the advantages of both approaches.
We accomplish this reconciliation by shifting the focus of prediction away from the selection of variables to the selection of observations. As part of this shift in focus from variables to observations, we discard the term variable. Instead, we use the word attribute to refer to an independent variable (something we use to predict) and the word outcome to refer to a dependent variable (something we want to predict). Our purpose is to induce you to think foremost of experiences, which we refer to as observations, and less so of the attributes and outcomes we use to measure those experiences. This shift in focus from variables to observations does not mean we undervalue the importance of choosing the right variables. We accept its importance. We contend, however, that the choice of variables has commanded disproportionately more attention than the choice of observations. We hope to show that by choosing observations as carefully as we choose variables, we can use data to greater effect.
The underlying premise of this book is that some observations are relevant, and some are not—a distinction that we argue receives far less attention than it deserves. Moreover, of those that are relevant, some observations are more relevant than others. By separating relevant observations from those that are not, and by measuring the comparative relevance of observations, we can use data more effectively to guide our decisions. As suggested by our discussion thus far, relevance has two components: similarity and unusualness. We formally refer to the latter as informativeness. This component of relevance is less intuitive than similarity but is perhaps more foundational to our notion of relevance; therefore, we tackle it first.
Informativeness is related to information theory, the creation of Claude Shannon, arguably the greatest genius of the twentieth century.1 As we discuss in Chapter 2, information theory posits that information is inversely related to probability. In other words, observations that are unusual contain more information than those that are common. We could stop here and rest on Shannon's formidable reputation to validate our inclusion of informativeness as one of the two components of relevance. But it never hurts to appeal to intuition. Therefore, let us consider the following example.
Suppose we would like to measure the relationship between the performance of the stock market and a collection of economic attributes (think variables) such as inflation, interest rates, energy prices, and economic growth. Our initial thought might be to examine how stock returns covary with changes in these attributes. If these economic attributes behaved in an ordinary way, it would be difficult to tell which of the attributes were driving stock returns or even if the performance of the stock market was instead responding to hidden forces. However, if one of the attributes behaved in an unusual way, and the stock market return we observed was also notable, we might suspect that these two occurrences are linked by more than mere coincidence. It could be evidence of a fundamental relationship. We provide a more formal explanation of informativeness in Chapter 2, but for now let us move on to similarity.
Imagine you are a health care professional charged with treating a patient who has contracted a life-threatening disease. It is your job to decide which treatment to apply among a variety of available treatments. You might consider examining the outcomes of alternative treatments from as large a sample of patients with the same disease as you can find, reasoning that a large sample should produce more reliable guidance than a small sample. Alternatively, you might focus on a subset of the large sample comprising only patients of a similar age, with similar health conditions, and with similar behavior regarding exercise and smoking. The first approach of using as large a sample as possible to evaluate treatments would undoubtedly yield the more robust treatment; that is, the treatment that would help, at least to some extent, the largest number of patients irrespective of each person's specific features. But the second approach of focusing on a targeted subset of similar patients is more likely to identify the most effective treatment for the specific patient under your care.
We contrived these examples to lend intuition to the notions of informativeness and similarity. In most cases, though, informativeness and similarity depend on nuances that we would fail to detect by casual inspection. Moreover, it is important that we combine an observation's informativeness and similarity in proper proportion to determine its relevance. This would be difficult, if not impossible, to do informally.
Fortunately, we have discovered how to measure informativeness, similarity, and therefore relevance, in a mathematically precise way. The recipe for doing so is one of the key insights of this book. However, before we reveal it, we need to establish a new conceptual and mathematical foundation for observing data. By viewing common statistical measures through a new lens, we hope to bring clarity to certain statistical concepts that, although they are commonly accepted, are not always commonly understood. But our purpose is not to present these new statistical concepts merely to enlighten you; rather, we hope to equip you with tools that will enable you to make better predictions.
Here is what awaits you. In Chapter 2, we lay out the foundations of our approach to observing information from data. In Chapter 3, we characterize patterns between multiple attributes. In Chapter 4, we introduce relevance and show how to use it to form predictions. In Chapter 5, we discuss how to measure confidence in predictions by considering the tradeoff between relevance and noise. In Chapter 6, we apply this new perspective to evaluate the efficacy of prediction models. In Chapter 7, we compare our relevance-based approach to prediction to machine learning algorithms. And lastly, in Chapter 8, we provide biographical sketches of some of the key scientists throughout history who established the theoretical foundation that underpins our notion of relevance.
In each chapter, we first present the material conceptually, leaning heavily on intuition. And we highlight the key takeaways from our conceptual exposition. Then, we present the material again, but this time mathematically. We conclude each chapter with an empirical application of the concepts, which builds upon itself as we progress through the chapters.
If you are strongly disinclined toward mathematics, you can pass by the math and concentrate only on the prose, which is sufficient to convey the key concepts of this book. In fact, you can think of this book as two books: one written in the language of poets and one written in the language of mathematics, although you may conclude we are not very good at poetry.
We expect some readers will view our key insight about relevance skeptically, because it calls into question notions about statistical analysis that are deeply entrenched in beliefs from earlier training. To get the most out of this book, we ask you to suspend these beliefs and give us a chance to convince you of the validity of our counterclassical interpretation of data by appealing to intuition, mathematics, and empirical illustration. We thank you in advance for your forbearance.
1
Some might prefer to assign this accolade to Albert Einstein, but why quibble? Both were pretty smart.
Our journey into data-driven prediction begins with some basic ideas. In this chapter, we set forth principles which may at first seem obvious, but which, upon deeper inspection, have profound implications. These ideas lay the foundation for everything that follows.
Whenever we approach a new dataset the first order of business is to get our bearings. We have before us a series of observations, each of which is described by a set of attributes. The observations could be of people, described by attributes like age, health, education, salary, and place of residence. They could be times at-bat for a major league baseball player, with attributes of runs-batted-in, home runs, walks, strikeouts, weather conditions, and where the game took place. Or the observations could be periods of economic performance measured by attributes such as growth in output, inflation, interest rates, unemployment, stock market returns, and perhaps the political parties in power at the time. What matters is that we have a set of observations characterized by a consistent collection of attributes. A conventional statistics approach would have us focus on these attributes and refer to them as variables, but as we stated earlier, we ask that you indulge us as we focus mainly on how we observe these attributes.
We begin by summarizing the observations as averages. Throughout this book we will compute many averages. We use the average as a device to let the data speak. Sometimes this process will act democratically, assigning equal weight to each observation. Other times, we will overweight more relevant observations at the expense of others. In either case, our goal is to separate information from noise. The humble arithmetic average does this job well.
After attending the West of England Fat Stock and Poultry Exhibition in 1906, the British polymath Francis Galton was struck by the surprising power of a simple average. As he documented in an article soon after the exhibition, 787 people guessed the weight of an ox that had been slaughtered for market, with the hope of winning a prize. Galton found that the average of their guesses came remarkably close to the true weight, delivering more accuracy than any individual guess, including those of the proclaimed experts in attendance. James Suroweicki pays homage to this effect in his 2004 book The Wisdom of Crowds. Among the myriad examples he includes is a more practical equivalent of Galton's experiment, whereby people guess the number of jelly beans in a jar. The punchline of these experiments is that when conditions are right, an average guess is eerily precise.
Suppose we choose a single attribute for whatever data we have in mind and compute its average across observations. The result provides a measure of central tendency, as it surely lies within the range of values we observe. There are, of course, other ways to gauge the characteristic value of an attribute. The median splits observed values in half and points to the dividing line, while the mode represents the most common occurrence. The median and mode are sometimes considered better measures of central tendency because they are less sensitive to outliers than the average, and therefore generally more stable. But as we have already mentioned and will soon show, unusual observations are the most informative of all.
Assume we compute the average for every attribute in our dataset. Now, in addition to having counted the number of observations, we know what types of values to expect for each attribute. Observing a value of 0.00001, 1, 1,000,000, or –1,000,000 should not necessarily surprise us so long as that value is near the average, because this means it is within the range of recorded experience.
Consider a set of observations for an attribute, which we plot along a line as shown in Exhibit 2.1. Are these observations tightly clustered or broadly dispersed? A natural way to address this question—though interestingly not the typical way—is to consider the distance between two observations of a pair. In other words, if we present you with two random observations from the set, how far apart should you expect them to be? The answer to this question involves taking another average. This time, however, we do not average over every observation, but over every pair of observations.
In an analysis of basketball greats, this calculation might involve comparing points per game for Michael Jordan to LeBron James, Wilt Chamberlain to Jordan, and Chamberlain to James. For three players there are three distinct pairs, for 10 players there would be 45 pairs, and for N players there would be (N – 1)/2 pairs. Imagine lining up each player's name in a row and a column to form a grid. Of all the N2 matchups, we remove the N diagonal entries that compare a player to himself and divide the remainder by 2 to remove redundancies. We want to average across all these pairs.
Let us assert, boldly and without any justification for now, that the quantity we are most interested in measuring is half the squared distance between any two observations. This choice leads us to an important equivalence. It turns out that the average across pairs yields precisely the same result as the textbook formula for variance. The equivalence holds even though the conventional formula for variance measures deviations from average, rather than deviations across every pair. It helps explain a feature of the well-known sample variance formula that may at first seem puzzling: the requirement to divide by N – 1 instead of N to obtain an unbiased result. This pairwise perspective makes clear why we must use one fewer than the number of observations in the classical formula. It arises directly from the fact that we exclude the trivial comparisons of values to themselves, which would otherwise impart an overconfidence bias (see the math section for more detail).
Exhibit 2.1 Spreads
Yet a puzzle remains. Why should we focus on half the squared distance between values? It would seem much simpler to record the distance and leave it at that. This choice is worth considering carefully. There is something special about half the squared distance, and we will encounter this theme repeatedly. Why should we square the distance? And why should we divide it in half?
The first part of the answer rests on insights from Claude Shannon, the father of information theory. Shannon was a creative genius who laid the foundation for our modern information age. In 1937, while a graduate student at MIT, he introduced new rigor to circuit design, proving that mere electrical switches can implement logical reasoning to solve problems. After earning his PhD at MIT, Shannon went to work at Bell Laboratories, a storied innovation hub where dozens of brilliant minds crossed paths in the twentieth century. Nine alumni have received Nobel prizes, mostly in physics, with credit to work they did at Bell Labs. Shannon himself is notably absent from this list; it is not for lack of profundity or practical impact of his work, but because the Nobel Prize has no clear category for his contributions to humanity (he received many other awards). He introduced the field of cryptography after working on practical problems of message encryption during World War II. He defined the universal language of computing in terms of the binary digits of 0 and 1, recognizing that all information boils down to this form. And, in a breakthrough 1948 paper, he unveiled the contribution for which he is best known, a unified mathematical theory of communication that spawned several fields of study and decades of further innovation.
But by his own admission, Shannon did not set out with grand plans to change the world. He was at heart an eccentric tinkerer with a healthy sense of humor. It is safe to assume these personality traits fueled his creative spirit. He was known for riding a unicycle down the hallways of Bell Labs while juggling. And it is entertaining to note that a man who made so many practical contributions was also at times captivated by so-called useless machines—contraptions whose only purpose when switched on is to turn themselves back off with a mechanical arm. Shannon built many such machines and displayed one in his office. Together with his colleague Edward Thorpe he devised a computer that they concealed and wore to casinos to gain a statistical advantage at the roulette table. As William Poundstone recounts in Fortune's Formula, the pair of inventors figured they could exploit tiny probability advantages that arose from the tilt of roulette wheels. Though imperceptible to the naked eye, off-kilter wheels land on some numbers more than they should, and the wearable device allowed inputs by foot pedal to inform which numbers to bet on. Alas, he did not become rich from gambling. Later in life, Shannon spent a great deal of time contemplating artificial intelligence and experimenting with it. To this end he built a robotic mouse to navigate a maze, and much more.
Shannon's theory of information relates to our present discussion. He formalized the essential notion that information must be a sort of inverse of probability. To emphasize this point, let us start by acknowledging that it is not newsworthy when a likely outcome occurs. Rare events, on the other hand, are notable. Shannon showed, with mathematical precision, that rare events convey more information than common ones.
We can illustrate this fact using everyday examples and basic intuition. Suppose a friend tells you that she went to the grocery store, bought some apples, and came home: it was uneventful. We can easily understand and visualize this story by drawing on common experience. Now suppose another friend tells you that she went to the store and something truly crazy happened. From our perspective as a listener this could be anything; we need more information to understand. Moreover, the story might take a while, because when things are unusual there is more to explain.
