Fact Forward - Dan Gaylin - E-Book

Fact Forward E-Book

Dan Gaylin

0,0
21,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Solutions to increase trust and empower better decision making in a data-rich world

Fact Forward: The Perils of Bad Information and the Promise of a Data-Savvy Society explores how a growing deluge of data has led to a data-rich world with abundant new opportunities and a precipitous decline in trust due to the problems we face in producing, communicating, and consuming data. This book takes readers on a journey through the data ecosystem, showing how data producers, data consumers, and data disseminators all have a role to play in creating a more data-savvy society.

Written by Dan Gaylin, president and CEO of NORC at the University of Chicago, a leading research organization in the field of social science and data science, this book demonstrates the urgent need for:

  • greater transparency on the part of data producers
  • increased data literacy on the part of data communicators and data consumers
  • a societal commitment to data education and infrastructure

Fact Forward: The Perils of Bad Information and the Promise of a Data-Savvy Society earns a well-deserved spot on the bookshelves of leaders across industries and all individuals who want to build a better society and world by improving the way we present, analyze, and make use of data.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 434

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Praise for

Fact Forward

Title Page

Copyright

About the Cover

Preface

1 The Importance of Being Data Savvy

Why This Matters

The Role of Bad Data in the 2008 Global Financial Crisis

The Financial Disaster Reveals the Four Types of Data Failure

What Does It Mean to Be Data Savvy and Why Is It Important?

Understanding Roles in Data Proliferation

The Spread of Faulty Data Is Destructive to Society

The Solution to Data Failures

Notes

2 Understanding the Data Ecosystem

An Evolving Data Ecosystem Shapes Our Understanding of Statistics

Data Consumers Are Generally Thoughtful About the Data Ecosystem

Building a Fact‐Forward Society in a World of Data Democratization

To Understand Any Data, You Must Be Aware of Its Data Ecosystem Context

Notes

3 The Breadth of the Data Universe

An Illustrative Example: Unemployment Statistics

Ubiquitous Data

Analyzing Data Enables Smart Decisions

Common Principles Amid the Diversity of Data Types and Analysis

Notes

4 The Challenge of Data Integrity

Data Must Be Fit for Purpose

Matching Research Questions and Methods

To Trust Research, You Need an Adequate Sample Size

Only a Representative, Unbiased Sample Can Support a Fair Statistical Analysis

The Data‐Savvy Consumer

Notes

5 Data and Algorithmic Transparency

Why Transparency Matters

Transparency in Research Studies and the Reproducibility Crisis

Transparency Is a Continuum

Transparency in Public Data

Public Opinion Researchers Must Clarify Their Data's Limitations

Transparency in Financial Data of Publicly Traded Companies

The Challenge of Algorithmic Transparency

Transparency Is Central to the Data Ecosystem

Notes

6 The Paradox of Data Neutrality

Maintaining Neutrality Is Difficult, but Essential for a Fact‐Forward World

Types of Data Advocacy

Data Advocacy in the Sciences

The Worrisome Challenge of Data Advocacy in Government

How to Be Data Neutral in a World Full of Data Advocacy

Notes

7 Fostering Data Literacy

Data Literacy Creates Clarity Across the Data Ecosystem

Data Journalism Illuminates New Facts About the World

Data‐Literate Journalists Must Inject Context into Their Work

Data Literacy Has Become a Corporate Imperative

Data Literacy in Government

Most Consumers Lack Data Literacy, and They Know It

Education at All Levels Will Create the Foundation for Data Literacy

Notes

8 Standards and Privacy

Expanding the Usefulness of Data Depends on Standards and Privacy Protections

Examples of How Growth Emerged from Data Standards

Standardized Formats and Definitions Are Central to Government Data Infrastructure

Without Privacy Protections, Combining Data Is Problematic

The Promise of Federal Information Sharing Depends on Updated Privacy Regimes

Standards and Privacy Are Building Blocks for Public Data Infrastructure

Notes

9 Public Data Infrastructure

Public Data Infrastructure Is Worth Billions, Perhaps Trillions

What Are the Elements of Public Data Infrastructure?

Around the World, Governments Are Creating Public Data Infrastructure

American Public Data Infrastructure Successes

Steps Toward America's Public Data Infrastructure

A Powerful Vision Emerges

The Far‐Reaching Benefits of a Growing Public Data Infrastructure

Notes

10 The Promise and Challenge of Artificial Intelligence

How to Think About AI

AI Has Great Potential to Improve the Analysis of Data

Ensuring Accuracy Remains a Major AI Challenge

Distrust and Verify

Throughout the AI Ecosystem, a “Human in the Loop” Is Essential

The Story Continues to Be Written – Rapidly

Notes

11 The Data‐Savvy Future

Individuals and Institutions Are Becoming More Data Savvy

Training and Infrastructure Are Evolving to Support a Data‐Savvy World

Technology – Especially AI – Will Accelerate These Trends

Where Will Data‐Savvy Leadership Come From?

Notes

Acknowledgments

About the Author

Index

End User License Agreement

List of Illustrations

Chapter 1

Figure 1‐1: The Google Flu model's initial success in predicting actual flu ...

Figure 1‐2: Candidates' estimated chances of winning in the 2016 US presiden...

Chapter 2

Figure 2‐1: Data democratization expands roles in the ecosystem.

Figure 2‐2: Most Americans are overwhelmed at times by the amount of informa...

Figure 2‐3: People trust social media more if it comes from a trusted sharer...

Figure 2‐4: Older Americans acquire and share news and information less on s...

Figure 2‐5: Factors that influence whether people share information on socia...

Chapter 3

Figure 3‐1: There is a significant correlation between moderate physical act...

Figure 3‐2: The correlation of chocolate consumption and Nobel laureates per...

Figure 3‐3: The number of US adults who never attend religious services is r...

Figure 3‐4: Life expectancy recently declined for people without a college d...

Chapter 4

Figure 4‐1: Conspiratorial thinkers are characterized by certain core belief...

Chapter 5

Figure 5‐1: US adults are concerned about some data practices at online comp...

Figure 5‐2: Many factors influence people's trust in research. Survey questi...

Figure 5‐3: Estimates from the Department of Justice National Crime Victimiz...

Chapter 6

Figure 6‐1: Despite yearly variation, there is a clear upward trend in globa...

Figure 6‐2: Americans' confidence in the scientific community declined in 20...

Chapter 7

Figure 7‐1: The LA Times water usage data journalism project shows water use...

Figure 7‐2: The LA Times graphically shows historical and current water leve...

Figure 7‐3: A color‐coded chart reveals historical drought patterns....

Figure 7‐4: FiveThirtyEight uses aggregated poll averages to show shifting v...

Figure 7‐5: Financial Times data journalism shows patterns in which mosques ...

Figure 7‐6: Our World In Data reveals decades of data on how nations' health...

Figure 7‐7: Journalists believe their data skills are limited relative to th...

Figure 7‐8: Many factors influence people's trust in research. Survey questi...

Figure 7‐9: About half of US adults feel they are capable of understanding s...

Figure 7‐10: An overwhelming majority of teens feel data skills are importan...

Figure 7‐11: A majority of US adults say it is essential for students to lea...

Figure 7‐12: While one in three teens would take a course in data literacy, ...

Chapter 9

Figure 9‐1: Trends in gun ownership by education level, 1972 to 2022. Survey...

Figure 9‐2: Trends in acceptance of sexual relations between same‐sex adults...

Figure 9‐3: Trends in who favors the death penalty for murder convictions by...

Figure 9‐4: Trends in who says preschool kids suffer if the mother works, by...

Figure 9‐5: Details on sector contributions to a data-savvy society. An ove...

Figure 9‐6: Canadian vehicle registrations, 2021.

Chapter 10

Figure 10‐1: Americans are concerned about the risks associated with artific...

Figure 10‐2: The Gartner Hype Cycle.

Guide

Cover Page

Table of Contents

Praise for Fact Forward

Title Page

Copyright

About the Cover

Preface

Begin Reading

Acknowledgments

About the Author

Index

End User License Agreement

Pages

i

v

vi

ix

xi

xii

xiii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

155

156

157

158

159

160

161

162

163

164

165

166

167

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

Praise for Fact Forward

“Dan Gaylin's Fact Forward resonates deeply with USAFacts' mission of empowering Americans with the facts. It shows how we can build a society where facts aren't just available but truly empower citizens to make informed decisions. This is exactly the kind of resource we need to help citizens and leaders alike ground their choices in evidence rather than rhetoric.”

—Poppy MacDonald President, USAFacts

“When trust in institutions and in information itself is in decline, ‘good’ data is essential for social progress and individual well‐being. Dan Gaylin's Fact Forward offers an accessible roadmap for building the ‘data‐savvy’ skills needed for smart decisions by everyday people and the institutions and organizations that affect their lives.”

—Julia Stasch Immediate Past President John D. and Catherine T.MacArthur Foundation

“Having led research institutions and educational organizations, I've seen how data can illuminate or obscure critical social realities. Dan Gaylin's Fact Forward shows us how to understand and use data responsibly – not just to advance knowledge but to create better outcomes across society. This is precisely the kind of guidance we need to transform data literacy from an academic skill into an essential life skill for everyone.”

—Raynard Kington MD, PhD, Head of School, Philips Academy; President Emeritus, Grinnell College

 

DAN GAYLIN

FACT FORWARD

The Perils of Bad Information and the Promise of a Data‐Savvy Society

 

 

 

 

Copyright © 2025 by National Opinion Research Center, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication Data is available:

ISBN: 9781394219896 (cloth)

ISBN: 9781394219902 (ePub)

ISBN: 9781394219919 (ePDF)

Cover Design: Wiley

Cover Image: Courtesy of NORC

Author Photo: © John Zich

About the Cover

The Fact Forward cover image is a data visualization that NORC created to illustrate the displacement of people due to Superstorm Sandy. It is one of several NORC explorations of how different methods of visualizing social media data could help track population movements during and after natural disasters. Each arc represents an individual Twitter account. The beginning of each arc represents the account holder's primary location two days before Superstorm Sandy, as measured by the zip codes where most of their tweets originated. The end of each arc represents the account holder's primary location two days after the storm.1

To see the original artwork, visit fact‐forward.norc.org/coverart.

Note

1

. NORC at the University of Chicago. (2013). NORC insight illuminated: 2013‐2014 annual report. NORC at the University of Chicago.

https://www.norc.org/content/dam/norc-org/pdfs/NORC%202013-14%20AnnualReport_optimized.pdf

Preface

Fact Forward is my first book. But really, it is not just my book. Rather, it is a broad rendering of the ethos of researchers everywhere who care about using data effectively and responsibly. To be fact forward, a person needs to develop a basic understanding of how to use data that I refer to as data savvy. The central theme of Fact Forward is that having data‐savvy skills is no longer solely for researchers. With data now woven into the fabric of our daily activities, it is incumbent on all of us to become responsible and effective users of data. In the pages that follow I offer a road map for doing that.

This book is grounded in the fundamental values of the research institutes I have been a part of for most of my professional career and the lessons I have learned from the dedicated researchers and colleagues who mentored me. In particular, the chapters ahead describe how we think at NORC at the University of Chicago, the 84‐year‐old organization that I have had the honor of leading for more than a decade. NORC is an objective, nonpartisan research organization known for delivering scientifically rigorous, trustworthy data and analysis to decision‐makers across society: individuals, families, communities, journalists, business leaders, people in government, policymakers, and legislators.

The decisions all of us make – at least the important ones – have the goal of producing the best outcomes possible for the people, principles, and ideas that we care about. That means we all have a common interest to ensure that we base our key decisions on the most reliable and trustworthy data available.

But how can you evaluate the data you encounter? If, for the moment, we call reliable and trustworthy data “good” data, how do we tell the difference between good data and bad data? What skills do all of us need to understand and identify these differences? What, specifically, are the obligations and responsibilities all of us have, as we produce, analyze, use, or share information? And at a deeper level, how do we determine what the best available data are to inform the particular decision we are trying to make? These are the central questions Fact Forward explores. And more than just exploring them, it offers insights along the way about how to do this, and the data‐savvy knowledge and skills we need to develop as individuals and as a society to be more effective in this arena.

In an increasingly data‐driven world, it is ever more imperative for all of us to become data savvy. Not so long ago, we thought of data analysts as a very specialized group of people with very particular roles. Today, one way or another, each of us is a data analyst – and a lot more. We all generate data, assess and interpret data, and share and discuss data. We do these things knowingly and (unfortunately) unknowingly as well. With the vast proliferation of digital technology, a simple mobile device becomes a mechanism for doing all these things. And in our day‐to‐day interactions with media, businesses, government, and one another, each of us is constantly bombarded with rapidly changing and newly emerging types of data.

What's more, with the spread of data analytics and data science, many (if not most) organizational actors are creating and pushing data to their audiences and stakeholders as a core part of what they do. Sometimes these data are reliable and trustworthy. Sometimes they are not. And sometimes they are designed to actively manipulate or mislead their audiences. It's fair to say that the concept of caveat emptor (Latin for let the buyer beware), originally applied to the marketplace of goods and services, now is equally relevant to the world of data, but with an added level of responsibility: We need to be aware of the risks both as consumers of data and as producers of data.

This is the key rationale for why we must all, as individuals and as a society, develop data‐savvy skills. A world filled with data is now an essential aspect of our existence. Piles of data are all around us. They’re growing and evolving. They’re there for the taking and the giving. They’re there for all of us to use, and for all of us to misuse. We must all become fact forward – for our own sake and for the greater good.

1The Importance of Being Data Savvy

One of the central causes of the Global Financial Crisis of 2008 and the Great Recession that followed was bad data. The available information on key financial instruments at the heart of the crisis was faulty. Even so, investors – from individual homeowners to our most storied financial institutions – bet billions of dollars on that information. The root cause of one of the worst economic meltdowns in history was a combination of poor‐quality information, lack of transparency on its origins and limitations, and wishful thinking (and in some cases outright fraud) on the part of the people and organizations generating and analyzing and sharing the data.

The result was bankruptcies, crashing markets, thousands of jobs lost, hundreds of billions of dollars in government bailouts, a disaster that took years to recover from, and lasting damage to the public's trust in financial and governmental institutions.1

This book is the story of the central role of data in the way citizens, consumers, companies, institutions, and governments perceive and act in the world – and how we can all improve our skills and interactions within that data ecosystem. Limitations in our ability to use data effectively, together with inaccurate data or data of poor quality, lead to widespread misunderstanding, uncertainty, and deception: problems in which we all play a part that undermine the common workings of the society.

Why This Matters

We'll get back to the details of the data failures that created the Global Financial Crisis shortly. But first, let me explain why I care about this and why you should, too.

I have the privilege of serving as president and chief executive officer of NORC, one of the largest independent research organizations in the world. NORC is an objective, nonpartisan, global research institute that conducts hundreds of millions of dollars in research every year for governments, nonprofits, and businesses in the United States and many other nations. My background includes 35 years of conducting research using a wide range of data at some of the world's leading research institutes, and in private consulting, and also serving as a senior health policy advisor at the US Department of Health and Human Services.

I am committed not just to promoting and supporting honest, unbiased, and transparent research but also to helping everyone understand how data are generated2 and how to use and interpret data to inform their most important decisions.

Today, many forces combine to create a vast sea of information of varying quality, leading to uncertainty across all aspects of society. These forces include the creation of flawed or biased data, a lack of transparency about data sources, and the distortion of data to manipulate and mislead people. This book provides a framework for understanding all forms of data and their limitations, and what I hope will become common expectations about appropriate use of data. The idea is to live in a fact‐forward world in which we consistently advance facts as the basis for making critical decisions. While this may sound elusive, I believe that the promise of a fact‐forward world is before us. To get there, all of us as individuals and as a society must prioritize the development of better data skills, in how we create, access, use, and share data. The broad development of these skills across these multiple dimensions is what I refer to throughout the book as becoming “data savvy.”

While it has been more than 15 years since the Global Financial Crisis, the data challenges it reveals are just as relevant today as they were in the 2000s. Moreover, we are now sufficiently far removed from these events to be able to look back at them and assess what went wrong and why. Four data problems led directly to this crisis: failures of data integrity, failures of data transparency, failures of data neutrality, and failures of data literacy. These problems remain highly relevant today, which means that we continue to be very much at risk for additional global disasters based on information failure.

The Role of Bad Data in the 2008 Global Financial Crisis

It was easy to get a mortgage in the 2000s.3 Consumers with limited incomes, poor credit, or inadequate down payments could still qualify for low‐documentation or no‐documentation mortgages. Mortgage brokers who made money on loan volume assured borrowers that they had the economic means to take on excessive mortgage debt. And mortgage bankers, incentivized to originate loans, were willing to lend money to underqualified borrowers. Many of these loans were adjustable‐rate mortgages with “teaser” interest rates that stayed low for the first two years but increased rapidly thereafter.

According to a paper by the economist Thomas Herndon, 70% of the eventual losses in the mortgage markets were caused by defaults on these low‐documentation and no‐documentation loans.4 But on their own, these loan defaults would never have brought down global financial markets.

At the center of the crisis was a stack of financial instruments known as CDOs and CDSs. Collateralized debt obligations (CDOs) were bonds based on hundreds or thousands of mortgages, while credit default swaps (CDSs) were insurance on the value of those bonds.

The task of accurately measuring the risk in these instruments fell to the independent bond‐rating agencies: Standard & Poor's (S&P), Moody's, and Fitch. An agency like S&P might rate a CDO bond backed by the highest‐quality homeowners and mortgages AA, indicating an investment grade bond with a very low risk of default, while a bond backed by lower‐quality mortgages might be graded BBB – still investment grade, but with a higher risk of default. The ratings agencies also assigned the highest possible ratings to most of the CDSs, indicating perhaps a 1‐in‐1,000 risk that their buyers would ever need to pay off the insurance.

The allure of low‐risk, high‐reward investments is enormous. The investment‐grade ratings on CDOs and CDSs encouraged financial institutions throughout Wall Street to buy billions of dollars of them.

As long as home values continued to increase, homeowners were able either to refinance with a new mortgage or to sell their highly mortgaged houses at a profit before their teaser rates expired. Financial firms profited from the bonds and derivatives based on those homes. This in turn further fueled home values and attracted still more questionable borrowers into low‐documentation loans to cash in on appreciating prices.

That, of course, is what a bubble looks like. And in 2008 – slowly, and then catastrophically – everything collapsed.

Home buyers began to default on their loans – especially as those two‐year low‐interest lockup periods began to expire, and their payments ballooned. The CDOs based on those mortgages became worthless. This triggered billions of dollars in insurance payments for the owners of the CDSs. The largest blue‐chip investment firms on Wall Street – including AIG, Lehman Brothers, Bear Stearns, and Merrill Lynch – found themselves with massive, completely unanticipated losses. The resulting implosion in financial markets froze monetary liquidity and led to the Great Recession. Despite a $700 billion government bailout for Wall Street, the recession put almost 9 million Americans out of work.

The global financial crisis was caused by the triple whammy of the risky loans, which were then bundled into CDOs and CDSs, and were then rated as low risk by the ratings agencies. Despite the excellent ratings for these investments, they were all built on adjustable‐rate mortgages doomed to eventually tumble, creating a highly correlated set of risks that blindsided all the major financial institutions at once. Each step was riddled with limited or bad data. This was the central cause of the crisis.

The Financial Disaster Reveals the Four Types of Data Failure

Now let's ask a crucially important question: Why were the ratings agencies creating the faulty ratings that led to the global financial crisis, even though these agencies' key purpose is to accurately assess risk?

The answer to that question illuminates the four main types of data failure that threaten every part of our global society that depends on data and, as I'll show, that includes virtually everything that government, business, and consumers do. Consider the four failures that led to the overoptimistic bond ratings that brought on the crisis:

A failure of data integrity.

5

Data integrity means that data are based on solid information interpreted in statistically valid ways. But the ratings agencies were not actually assessing risks of default; they were instead looking at broad general characteristics of loan pools, such as the median credit scores of borrowers. Unfortunately, such measures could conceal vast numbers of risky mortgages. As one former Goldman Sachs bond trader explained to the author Michael Lewis, “The ratings agencies didn't really have their own CDO model.”

6

A failure of data transparency.

Anyone using data to make decisions must be able to understand where the data came from and how they were analyzed. But for the ratings agencies to maintain their proprietary advantage and keep the creators of securities from gaming their ratings, they needed to keep their methods secret. As a result, there was no way for financial institutions to question or verify a given security's AA rating. The mortgage brokers originated lots of loans with very little transparency as well.

A failure of data neutrality.

Data neutrality demands analysis based only on the actual data, not the prejudices of those analyzing it. But regrettably, people collecting and analyzing data may, consciously or unconsciously, seek data and analyses that confirm their beliefs or prior knowledge. Then everyone working with data is subject to confirmation bias: that is, finding what they

hope

to be true. The ratings agencies were predisposed to see the CDOs and CDSs as good financial instruments. The conventional wisdom was that, on average, housing prices would continue to rise. Given competition among the agencies and the huge market power of the large Wall Street investment banks, the ratings agencies were essentially expected by the banks to produce desirable ratings. Given the complexity of the CDOs and CDSs, it was difficult for a ratings agency to model the risks effectively. And the Wall Street banks invested significant effort to shape the data to meet those expectations, skewing the resulting ratings to the bond traders' advantage and thus hiding risk from investors.

A failure of data literacy.

None of these data failures would matter if the ultimate consumers of the data were aware of and accounted for the flaws. But failures of interpretation pervaded the financial crisis. The homeowners ignored the risks that their adjustable rates might rise and that they couldn't refinance if home prices fell. The large financial institutions took the investment‐grade ratings on CDOs and CDSs as gospel, failing to spot the huge, nationwide risk from a bursting home price bubble. These investors failed to notice that they were cross‐insuring each other's investments, which added systemic risk to these instruments. Data literacy demands a skeptical attitude toward data and the skills and willingness to assess the uncertainty of the data on which you are basing your most important decisions. Borrowers and investors failed both of these tests.

What Does It Mean to Be Data Savvy and Why Is It Important?

Whether people realize it or not, data underlie every decision they make: in companies, in government, and as consumers.

Your doctor uses data to determine which treatments to recommend. Your boss uses data to determine where you stand among other employees and whether you deserve a raise. Your town uses data to determine how much to tax your house and how much to invest in schools for your kids. Data are at the center of government decisions about how to set interest rates, how to invest tax revenues, how to investigate crime, how to price unemployment insurance, and where to build new roads and highways. Political campaigns intensively assess data to determine what positions to take, what speeches to give, what actions to publicize, and what messages to send. Companies use data to determine what products to build, what features to add, where to invest resources, how much to pay staff, what products are popular, how consumer tastes are shifting, what marketing campaigns are working, and how aggressively and where to compete. Data are quite simply the backdrop and driver for every decision, everywhere.

Ideally, we would all adopt a fact‐forward attitude about data. That is, we'd attempt to make sure the data on which we based decisions are of the highest possible quality and relevance. But to become fact forward, every smart decision‐maker – that is, all of us – must be data savvy.

But what does it mean to be data savvy?

A data‐savvy decision‐maker asks questions about the context of data before acting on it. With what level of integrity were the data created? How transparently were they assembled, analyzed, and shared? Were the creators, consumers, and disseminators of the data acting in a neutral and bias‐free way? Have we applied data literacy in our interpretation of the data?

To be data savvy is to understand that all data are created in context, and to interpret, consume, and share data with that context in mind. This applies whether you are a data creator, a data disseminator, or a data consumer. In each of these roles, we each have essential responsibilities to ensure that data are used to inform and not to mislead.

Because data underlie so many important personal, corporate, and governmental decisions, it's essential for all decision‐makers to be as data savvy as possible. Unless we can recognize data challenges – and unless we as a society can create an environment that maximizes data quality and data literacy – we are going to make questionable decisions based on poorly informed ideas about the world. That's harmful and potentially catastrophic.

Why does that matter? Let's look at a few additional examples of how data failures have led people astray.

Google Flu Trends Made Remarkable Predictions, Until It Didn't

Are more data necessarily better for making predictions? Consider the case of Google Flu Trends.

In 2008, Google researchers attempted to take advantage of data derived from human behavior. Prior to these efforts, the best way to measure the spread of flu variants was based on data from the US Centers for Disease Control and Prevention (CDC), compiled regionally using reporting from doctors who tested patients reporting flu symptoms. But the Google researchers recognized that the first thing people do when they think they may have the flu is not to go to the doctor, but to do a web search on flu symptoms. Track those searches, they reasoned, and you'll be able to model flu outbreaks well ahead of the CDC reporting.

Sure enough, the Google Flu Trends tracker was able to identify flu outbreaks several days before the CDC reporting. The tracker was also able to make predictions that were eerily close to where and when influenza ended up spreading.7

But as any sports gambling operator will tell you, making a few accurate predictions doesn't mean you've beaten the system.

The Google Flu Trends algorithm needed adjustments in 2009, as it significantly underestimated influenza infections, possibly because the model was poorly matched to the virulence of the newly emerged H1N1 (swine flu) strain. But more problematic was the model's massive over‐prediction – by more than 100% – of the peak of the flu season in 2013 (see Figure 1‐1).

Another challenge was created by improvements in Google's search tool, which now suggested related terms for browser users to search. While this improved Google's generic search function, it distorted the Google Flu Trends model, which was based on manual, unprompted searches. And, at least originally, according to the researchers, “Google's efforts … were remarkably opaque in terms of methods and data—making it dangerous to rely on Google Flu Trends for any decision making.”8

Google Flu Trends is no longer available, perhaps due to the compounding of these errors. The model it created had data integrity problems due to changes in search features. While it was running, it lacked the transparency that would allow researchers to analyze flaws in its methods.9 The excitement around the successful early predictions may have undermined its researchers' neutrality. And perhaps more data literacy on the part of consumers and journalists would have put a check on the unbridled enthusiasm surrounding the Google flu model described in news stories in CNN, the New York Times,10 the Wall Street Journal, and other sources.

Figure 1‐1: The Google Flu model's initial success in predicting actual flu cases was overshadowed by its substantial overprediction in 2013.

Source: Adapted from original graphic in “When Google got flu wrong” by Declan Butler, February 13, 2013, in the journal Nature. Original data sources: Google Flu Trends; CDC; Flu Near You. Adapted by permission.

Donald Trump's Victory in the 2016 US Presidential Election Flummoxed Pollsters and Pundits

Leading up to the 2016 presidential election, the polls were clear. Hillary Rodham Clinton was likely to coast to an easy victory. Donald Trump, who'd never held a political office before, had very little chance to win.

As with Google Flu Trends, the challenge here was certainly not a lack of data. Polling organizations conducted many hundreds of polls in the months leading up to the election, both in the nation at large and in battleground states like Pennsylvania, Michigan, and Wisconsin. On the eve of the election, a RealClearPolitics average of 10 solid national polls projected a popular vote lead of about 3% for Clinton over Trump, a prediction that turned out to be within a single percentage point of the final result.11

Figure 1‐2: Candidates' estimated chances of winning in the 2016 US presidential election changed rapidly at several key points in time, including very close to election day.

Source: FiveThirtyEight. Used by permission.

Based on these polls, election watchers built models that combined national and state polls, weighted them based on poll quality and recency, and projected a percentage likelihood of who would win the election. The New York Times gave Clinton an 85% chance of winning.12 The model at FiveThirtyEight, which had given Clinton an 87% chance of winning on October 19, shifted toward Trump as the election drew near and new polls came in. On election eve, it was predicting Clinton's chances of victory at 71% (see Figure 1‐2).13

Clinton's widely expected victory, of course, never happened. Were the data at fault?

Several factors influenced the challenges with the predictions. There were polling errors in key battleground states including Pennsylvania, Wisconsin, and Michigan, all of which the polls predicted would narrowly vote for Clinton. But it's important to understand the meaning of the term “error” here. Pre‐election polls, since they reach only a small subset of voters, will inevitably generate different results from what actually happens at the ballot box. The difference between the poll results and the election results is called “polling error.”

What happened in 2016 that upended the prediction was systematic polling error – that is, a set of errors that were all skewed in the same direction. One cause is that low‐education voters were underrepresented in many of the state‐level polling samples, although they made up a significant share of Trump voters, especially in the three key post‐industrial Midwest states. The pollsters did not weight their results to correct for this deficiency. This oversight was largely the result of reusing the status quo approach, even in the face of an election that was proving to be quite different.

In recent elections prior to 2016, many pollsters didn't include education in their weighting, and they still generated accurate results. In those elections, voters with the lowest levels of formal education and voters with the highest levels of formal education both tended to vote for the Democratic candidate, especially in the Rust Belt states. When those pre‐2016 polls over‐represented high‐education voters and under‐represented low‐education voters, it didn't impact the vote choice results because both groups were supporting the Democratic candidate at similar levels. Perhaps leaning too heavily on an approach that had always worked in the past, many pollsters didn't realize the importance of incorporating education into their survey weighting to correct for these changing voting patterns. Even though the broad national polls were accurate, because of the way US presidential elections work, wrong predictions in three close states led to incorrect predictions for the winner of the race.

Polls can capture only a snapshot at the time the data are being collected. Few polls were fielded in the final days of the election and were therefore unable to capture the effect of events that happen too close to Election Day. Later estimates from a committee of pollsters suggest that 13% of voters in key swing states didn't make up their minds until the final week before the election, and these voters overwhelmingly supported Trump at the ballot box.14 One late‐breaking event was the release of a statement by FBI chief James Comey on the eve of the election regarding the investigation into Clinton's mishandling of emails. By the time Comey released the statement, all pre‐election polling had already been completed.

I believe a failure of data neutrality may have had something to do with the predictions as well. Most experienced politics‐watchers were used to predicting elections with politicians who behaved in predictable ways, like John McCain, Mitt Romney, and Barack Obama. The back and forth between conventional Democrats and Republicans was their favored turf; predicting the impact of a completely unconventional politician like Donald Trump was outside of their experience. This may have led to mainstream media coverage that predicted a continuation of politics as usual, which is likely what would have happened had Hillary Clinton won the election.

A final challenge here has to do with data literacy, that is, with how people interpreted the predictions of election‐tracking pundits. In the mind of the average reader, an 85% chance of a candidate winning immediately registers as, “He or she will win.” We make decisions like this in our lives all the time. If the weather forecaster predicts a 15% chance of rain, you probably leave the raincoat at home. On the other hand, if you were about to step into an intersection to cross the street and knew you had a 15% chance of getting hit by a car, you'd probably stay on the curb. Homeowners buy home insurance, not because they think there's a good chance their house will burn down, but because in the unlikely event that it does, they don't want to be completely ruined. Everyone knows a 15% chance is not zero, but they behave in many cases as if it is.

The Artificial Intelligence Tool ChatGPT Swallows Huge Amounts of Data – And Hallucinates

In the 2020s we started to become aware, as a society, of one of the most far‐reaching, and potentially disruptive, technologies created by masses of data: artificial intelligence (AI). ChatGPT, the generative AI chatbot tool created by OpenAI, is trained on, basically, the entire readable portion of the World Wide Web, tens of trillions of gigabytes of content. If quantity of data predicted quality of output, you would expect ChatGPT to be nearly perfect, since it “knows” nearly everything readable online.

ChatGPT is indeed a powerful tool. It's excellent at summarizing the contents of a document. It's good at producing first drafts of text based on online research. Because it communicates in plain English, it's easy to use. And it's particularly good at generative tasks, as its name suggests, using the vast amount of information it has assimilated to produce new content. So, for example, ChatGPT can virtually instantaneously write a convincing speech that a politician might deliver extolling the virtues of green technology (or, alternatively, fossil fuels). It can even write that speech in the style of an ancient Roman orator or as a country music ballad! And unlike ordinary web searches on sites like Google, you can have a conversation with it, adjusting your prompts to explain the shortcomings of what it came up with and how to generate better output on the second or third try. In this sense, it feels more like working with a research assistant than interacting with a computer.

This very quality, though, contributes to one of the main problems with generative AI tools: The text they produce is so convincing that it seems accurate, even when it's not.

ChatGPT routinely gets facts wrong, a phenomenon often described as “hallucinations.” For example, when Jon Christian, editor of Futurism, asked ChatGPT to write a short biography for him, this is what it generated15:

Jon Christian is a journalist who covers technology and science. He has written for a number of publications, including The Outline, Motherboard, and Wired, among others. His work often focuses on topics such as artificial intelligence, robotics, and the impact of technology on society.

Christian has been recognized for his writing, receiving awards such as the National Magazine Award and the Society of Professional Journalists’ Sigma Delta Chi award. He has also been a finalist for the Livingston Award for Young Journalists.

In addition to his work as a journalist, Christian has also taught courses on science writing at Columbia University’s Graduate School of Journalism.

Most of that, though, is wrong. Christian has not won any of those awards, and he doesn't teach at Columbia.

This is not an isolated case. In Georgia, a radio host named Mark Walters sued OpenAI for spreading false information about him, because its output implied that, as chief financial officer (CFO) and treasurer of an organization called the Second Amendment Foundation (SAF), he “misappropriated funds for personal expenses without authorization or reimbursement, manipulated financial records and bank statements to conceal his activities, and failed to provide accurate and timely financial reports and disclosures to the SAF's leadership.”16 In fact, Walters was never accused of any of these misdeeds and never even held the CFO or treasurer positions as ChatGPT claims.

If you spend some time interacting with ChatGPT or similar AI chatbots on a subject on which you are knowledgeable (which I encourage you to do), you will readily get to a point where the AI is providing incorrect answers, even though it appears certain that what it is telling you is accurate.

Why is ChatGPT getting so much wrong?

ChatGPT is based on a deep learning neural network model that ingests large amounts of data and then uses patterns from that data to generate plausible text. It's very good at generating coherent collections of sentences, because it has been trained on trillions of gigabytes of such text. But it has no actual intelligence. Sometimes patterns lead it to generate things that aren't true. Apparently, people like Jon Christian tend to teach at journalism schools. Apparently, if you mindlessly consume the text of enough lawsuits, you spit out text about people embezzling who were never even accused of any wrongdoing.

The challenge of the hallucinations goes further. Unless a person using these systems notices the problem, they may depend on the spurious data. Such data can then get published online and further exacerbate the spread of false information.

Generative AI systems like ChatGPT, at least in the state in which they exist as I write this, are practically the poster boy for data failures. They are built on masses of data without any discrimination about the relative accuracy of the different data sources.17 Because neural networks, unlike traditional computer code, operate on a gestalt of all of the information they ingest, there's no transparency to how the AI reaches any given conclusion – it's not very easy to interrogate the model about the sources of its conclusions, or the uncertainty that surrounds them (although developers are trying to improve this). As Alex Reisner wrote in the Atlantic, “Few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on.”18

AI also inherently suffers from a neutrality problem, because it automatically inherits all the prejudices of the data it's trained on. And the very plausibility of ChatGPT's highly articulate and readable output undermines data literacy: Looking at a piece of AI‐generated text, there's no way to know whether it's accurate or completely invented. And of course, there are always errors due to the limits of the information that the model has been trained on – we have no way of knowing what might be left out.

Because AI will have such an important role in data analysis and our understanding of truth and facts in the future, I've dedicated a whole chapter, Chapter 10, to it.

Understanding Roles in Data Proliferation

I've stated that data underlie most decisions people make. But where do the data come from, how do they spread, and how do they get to the people who need to make the decisions?

Any understanding of data and its value and failures must acknowledge the ecosystem through which masses of data flow. Just as products in the real‐world economy are built through supply chains – raw materials to component parts to finished products to retail outlets – the data ecosystem has its own supply chain. There are three main roles in the ecosystem:

Data producer.

Many organizations gather raw data. For example, Google collected data for Google Flu Trends, and polling organizations collected data from samples of voters about the 2016 election. In fact, we are all producing data every day with every purchase we make, the clicks we register with our online behavior, and information gathered and uploaded from our mobile devices. But raw data alone aren't particularly useful. Someone has to analyze the data to turn them into a form that's helpful in making decisions. The general term for individuals who use technical skills and business knowledge to analyze and derive insights from data is “data scientist.” This includes, for example, data analysts in companies who gather up data from sales and marketing and use the data to suggest ways to improve products or their positioning. In government, a slew of statisticians, data scientists, and subject‐matter experts analyze raw data about, say, unemployment or prices, and generate analysis of labor trends and inflation.

Data disseminator.

For data to have influence, someone has to make others aware of the analysis. News media fulfill this function – for example, when the

Wall Street Journal

publishes data from a poll, or a popular publication makes readers aware of the results of a medical study. Decades ago, data disseminators were generally in some position of authority in media or data analysis firms. But these days, everyone on social media has the potential to spread information to others, outside of normal media channels. If you see an article about voter attitudes about gun regulation or nuclear energy and then share it with others on Facebook or Instagram, you're a data disseminator. Given the potential that numbers, charts, and videos have to “go viral,” data sharing is now an essential part of the data ecosystem.

Data consumer.

This description fits anyone who uses data – in other words, all of us. If you're checking reviews of car models at

Consumer Reports

, you're a data consumer. You're also a data consumer if you review the latest trends and articles on virus variants to decide whether to get an updated vaccination. Or you might review inflation data to help decide which politicians to vote for. There are, of course, data consumers throughout all decision‐making organizations. A CEO who decides whether to green‐light a new product and how to price it is a data consumer, as is a Federal Reserve Bank official reviewing economic data to decide whether to raise or lower interest rates.

The supply chains for real‐word products are relatively nice and neat, moving in one direction from producer to consumer. But that's not how the data ecosystem works. In the world of data any of us can, at any moment, play any role: We all produce data, we all analyze data, and we all tend to disseminate data, as well. The average person may not be doing this with the formality and sophistication of a data professional, and that is a big part of the problem.

Looking back at the four data challenges I described, you can now see that all these roles may run afoul of any of the data challenges. For example, a data producer who uses samples that are too small can run afoul of data integrity problems, and if those samples are biased in directions that support their preconceived ideas, they can suffer data neutrality issues. A news organization that only publishes news supporting what its readers or advertisers want to hear and cherry‐picks studies to support its viewpoint is a data disseminator with a neutrality problem. As data spread throughout the ecosystem, people often share them without access to the underlying methodology, creating data transparency issues. And of course, all of us who read and share data, perhaps without fully understanding where they come from, are at risk of errors based on limits of our data literacy.

I'll describe more about data problems throughout the ecosystem in Chapter 2.

The Spread of Faulty Data Is Destructive to Society