63,99 €
Provides an important framework for data analysts in assessing the quality of data and its potential to provide meaningful insights through analysis
Analytics and statistical analysis have become pervasive topics, mainly due to the growing availability of data and analytic tools. Technology, however, fails to deliver insights with added value if the quality of the information it generates is not assured. Information Quality (InfoQ) is a tool developed by the authors to assess the potential of a dataset to achieve a goal of interest, using data analysis. Whether the information quality of a dataset is sufficient is of practical importance at many stages of the data analytics journey, from the pre-data collection stage to the post-data collection and post-analysis stages. It is also critical to various stakeholders: data collection agencies, analysts, data scientists, and management.
This book:
This book will be beneficial for researchers in academia and in industry, analysts, consultants, and agencies that collect and analyse data as well as undergraduate and postgraduate courses involving data analysis.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 637
Veröffentlichungsjahr: 2016
Cover
Title Page
Foreword
About the authors
Preface
References
Quotes about the book
About the companion website
Part I: THE INFORMATION QUALITY FRAMEWORK
1 Introduction to information quality
1.1 Introduction
1.2 Components of InfoQ
1.3 Definition of information quality
1.4 Examples from online auction studies
1.5 InfoQ and study quality
1.6 Summary
References
2 Quality of goal, data quality, and analysis quality
2.1 Introduction
2.2 Data quality
2.3 Analysis quality
2.4 Quality of utility
2.5 Summary
References
3 Dimensions of information quality and InfoQ assessment
3.1 Introduction
3.2 The eight dimensions of InfoQ
3.3 Assessing InfoQ
3.4 Example: InfoQ assessment of online auction experimental data
3.5 Summary
References
4 InfoQ at the study design stage
4.1 Introduction
4.2 Primary versus secondary data and experiments versus observational data
4.3 Statistical design of experiments
4.4 Clinical trials and experiments with human subjects
4.5 Design of observational studies: Survey sampling
4.6 Computer experiments (simulations)
4.7 Multiobjective studies
4.8 Summary
References
5 InfoQ at the postdata collection stage
5.1 Introduction
5.2 Postdata collection data
5.3 Data cleaning and preprocessing
5.4 Reweighting and bias adjustment
5.5 Meta‐analysis
5.6 Retrospective experimental design analysis
5.7 Models that account for data “loss”: Censoring and truncation
5.8 Summary
References
Part II: APPLICATIONS OF InfoQ
6 Education
6.1 Introduction
6.2 Test scores in schools
6.3 Value‐added models for educational assessment
6.4 Assessing understanding of concepts
6.5 Summary
Appendix: MERLO implementation for an introduction to statistics course
References
7 Customer surveys
7.1 Introduction
7.2 Design of customer surveys
7.3 InfoQ components
7.4 Models for customer survey data analysis
7.5 InfoQ evaluation
7.6 Summary
Appendix: A posteriori InfoQ improvement for survey nonresponse selection bias
References
8 Healthcare
8.1 Introduction
8.2 Institute of medicine reports
8.3 Sant’Anna di Pisa report on the Tuscany healthcare system
8.4 The haemodialysis case study
8.5 The Geriatric Medical Center case study
8.6 Report of cancer incidence cluster
8.7 Summary
References
9 Risk management
9.1 Introduction
9.2 Financial engineering, risk management, and Taleb’s quadrant
9.3 Risk management of OSS
9.4 Risk management of a telecommunication system supplier
9.5 Risk management in enterprise system implementation
9.6 Summary
References
10 Official statistics
10.1 Introduction
10.2 Information quality and official statistics
10.3 Quality standards for official statistics
10.4 Standards for customer surveys
10.5 Integrating official statistics with administrative data for enhanced InfoQ
10.6 Summary
References
Part III: IMPLEMENTING InfoQ
11 InfoQ and reproducible research
11.1 Introduction
11.2 Definitions of reproducibility, repeatability, and replicability
11.3 Reproducibility and repeatability in GR&R
11.4 Reproducibility and repeatability in animal behavior studies
11.5 Replicability in genome‐wide association studies
11.6 Reproducibility, repeatability, and replicability: the InfoQ lens
11.7 Summary
Appendix: Gauge repeatability and reproducibility study design and analysis
References
12 InfoQ in review processes of scientific publications
12.1 Introduction
12.2 Current guidelines in applied journals
12.3 InfoQ guidelines for reviewers
12.4 Summary
References
13 Integrating InfoQ into data science analytics programs, research methods courses, and more
13.1 Introduction
13.2 Experience from InfoQ integrations in existing courses
13.3 InfoQ as an integrating theme in analytics programs
13.4 Designing a new analytics course (or redesigning an existing course)
13.5 A one‐day InfoQ workshop
13.6 Summary
Acknowledgements
References
14 InfoQ support with R
14.1 Introduction
14.2 Examples of information quality with R
14.3 Components and dimensions of InfoQ and R
14.4 Summary
References
15 InfoQ support with Minitab
15.1 Introduction
15.2 Components and dimensions of InfoQ and Minitab
15.3 Examples of InfoQ with Minitab
15.4 Summary
References
16 InfoQ support with JMP
16.1 Introduction
16.2 Example 1: Controlling a film deposition process
16.3 Example 2: Predicting water quality in the Savannah River Basin
16.4 A JMP application to score the InfoQ dimensions
16.5 JMP capabilities and InfoQ
16.6 Summary
References
Index
End User License Agreement
Chapter 04
Table 4.1 Statistical strategies for increasing InfoQ given a priori causes at the design stage.
Chapter 05
Table 5.1 Statistical strategies for increasing InfoQ given a posteriori causes at the postdata collection stage and approaches for increasing InfoQ.
Chapter 06
Table 6.1 InfoQ assessment for MAP report.
Table 6.2 InfoQ assessment for student’s lifelong earning study.
Table 6.3 InfoQ assessment for VAM (based on ASA statement).
Table 6.4 MERLO recognition scores for ten concepts taught in an Italian middle school.
Table 6.5 Grouping of MERLO recognition scores using the Tukey method and 95% confidence.
Table 6.6 InfoQ assessment for MERLO.
Table 6.7 Scoring of InfoQ dimensions of examples from education.
Chapter 07
Table 7.1 Main deliverables in an Internet‐based ACSS project.
Table 7.2 Service level agreements for Internet‐based customer satisfaction surveys.
Table 7.3 A typical ACSS activity plan.
Table 7.4 InfoQ score of various models used in the analysis of customer surveys.
Table A Postdata collection correction for nonresponse bias in a customer satisfaction survey using adjusted residuals.
Chapter 08
Table 8.1 InfoQ components for IOM‐related studies.
Table 8.2 InfoQ dimensions and ratings for Stelfox et al. (2006) data and for the IOM reports.
Table 8.3 InfoQ components for Sant’Anna di Pisa study.
Table 8.4 InfoQ dimensions and ratings on 5‐point scale for Sant’Anna di Pisa study.
Table 8.5 InfoQ components for the haemodialysis decision support system.
Table 8.6 Marginal posterior distributions for the
j
‐th patient’s risk profile (True = risk has materialized).
Table 8.7 Posterior distributions of outcome measures for two patients.
Table 8.8 InfoQ dimensions and ratings on 5‐point scale for haemodialysis study.
Table 8.9 InfoQ components for the two NataGer projects data.
Table 8.10 InfoQ dimensions and ratings on 5‐point scale for the two NataGer projects.
Table 8.11 InfoQ components of cancer incidence report.
Table 8.12 InfoQ dimensions and ratings of cancer incidence study by Rottenberg et al. (2013).
Table 8.13 Scoring of InfoQ dimensions for each of the four healthcare cases studies.
Chapter 09
Table 9.1 Log of technicians’ on‐site interventions (techdb).
Table 9.2 Balance sheet indicators for a given costumer of the VNO (balance).
Table 9.3 Classification of 264 CEAO chains by aspect and division (output from MINITAB version 12.1).
Table 9.4 Scoring of InfoQ dimensions of the five risk management cases studies.
Chapter 10
Table 10.1 Relationship between NCSES standards and InfoQ dimensions. Shaded cells indicate an existing relationship.
Table 10.2 Relationship between ISO 10004 guidelines and InfoQ dimensions. Shaded cells indicate an existing relationship.
Table 10.3 Scores for InfoQ dimensions for Stella education case study.
Table 10.4 Scores for InfoQ dimensions for the NHTSA safety case study.
Chapter 11
Table 11.1 Terminology in GR&R studies.
Table 11.2 Terminology in animal experiments.
Table 11.3 Terminology in genome‐wide association studies.
Table A ANOVA table of GR&R experiments.
Chapter 12
Table 12.1 List of journals published by American Statistical Association (ASA) Referee guidelines web pages were not found for any of these journals.
Table 12.2 Partial list of journals published by American Society for Quality (ASQ) Referee guidelines web pages were not found for any of these journals. The same lack of guidelines applies to all other ASQ journals (http://asq.org/pub/).
Table 12.3 List of journals published by the Institute of Mathematical Statistics (IMS) and URLs for referee guidelines (accessed July 7, 2014).
Table 12.4 List of journals published by the Royal Statistical Society (RSS) and URLs for referee guidelines (accessed July 7, 2014).
Table 12.5 List of journals in machine learning and URLs for referee guidelines (accessed July 7, 2014).
Table 12.6 Reviewing guidelines for major data mining conference (accessed July 7, 2014).
Table 12.7 List of top scientific journals and URLs for referee guidelines (accessed July 7, 2014).
Table 12.8 Questionnaire for reviewers of applied research submission.
Chapter 15
Table 15.1 InfoQ assessment for Example 1.
Table 15.2 Results of the factorial experimental design of the steering wheels.
Table 15.3 InfoQ assessment for Example 2.
Chapter 16
Table 16.1 Synopsis of Example 1.
Table 16.2 InfoQ assessment for Example 1.
Table 16.3 Synopsis of Example 2.
Table 16.4 Ys for the PLS model.
Table 16.5 InfoQ assessment for Example 2.
Chapter 01
Figure 1.1 The four InfoQ components.
Figure 1.2 Price curves for the last day of four seven‐day auctions (x‐axis denotes day of auction). Current auction price (line with circles), functional price curve (smooth line) and forecasted price curve (broken line).
Chapter 03
Figure 3.1 Timeline of study, from data collection to study deployment.
Chapter 04
Figure 4.1 JMP screenshot of a 2
7−3
fractional factorial experiment with the piston simulator described in Kenett and Zacks (2014).
Figure 4.2 JMP screenshot of a definitive screening design experiment with the piston simulator described in Kenett and Zacks (2014).
Figure 4.3 JMP screenshot of fraction of design space plots and design diagnostics of fractional (left) and definite screening designs (right).
Chapter 05
Figure 5.1 Illustration of right, left, and interval censoring. Each line denotes the lifetime of the observation.
Chapter 06
Figure 6.1 The Missouri Assessment Program test report for fictional student Sara Armstrong.
Figure 6.2 SAT Critical Reading skills.
Figure 6.3 Earning per teacher value‐added score.
Figure 6.4 Test scores by school by high value‐added teacher score.
Figure 6.5 Template for constructing an item family in MERLO.
Figure 6.6 Example of MERLO item (mathematics/functions).
Figure 6.7 Box plots of MERLO recognition scores in ten mathematical topics taught in an Italian middle school. Asterisks represent outliers beyond three standard deviation of mean.
Figure 6.8 Confidence intervals for difference in MERLO recognition scores between topics.
Chapter 07
Figure 7.1 SERVQUAL gap model.
Figure 7.2 Bayesian network of responses to satisfaction questions from various topics, overall satisfaction, repurchasing intentions, recommendation level, and country of respondent.
Chapter 08
Figure 8.1 Bayesian network of patient haemodialysis treatment.
Figure 8.2 Visual board display designed to help reduce patients’ falls.
Figure 8.3 Prioritization tool for potential causes for bedsores occurrence.
Chapter 09
Figure 9.1 Bayesian network linking risk drivers with the activeness of risk indicators.
Figure 9.2 Social network based on email communication between OSS contributors and committers.
Figure 9.3 Simplex representation of association rules of event categories in telecom case study.
Figure 9.4 A sample CEAO chain.
Figure 9.5 Correspondence analysis of CEAO chains in five divisions by aspect. K&S = knowledge and skills; Mgmt = management; P = process; S = structure; S&G = strategy and goals; SD = social dynamics.
Chapter 10
Figure 10.1 BN for the Stella dataset.
Figure 10.2 BN is conditioned on a value of lastsal which is similar to the salary value of the Graduates dataset.
Figure 10.3 BN is conditioned on a low value of begsal and emp and for a high value of yPhD.
Figure 10.4 BN for the Graduates dataset.
Figure 10.5 BN is conditioned on a high value of msalary.
Figure 10.6 BN is conditioned on a high value of mdipl and nemp and for a low value of ystjob.
Figure 10.7 BN for the Vehicle Safety dataset.
Figure 10.8 BN for the Crash Test dataset.
Figure 10.9 BN for the Crash Test dataset is conditioned on a high value of Wt and Year.
Figure 10.10 BN for the Crash Test dataset is conditioned on a low value of Wt and Year.
Chapter 13
Figure 13.1 Google Trends data on “data science course.”
Figure 13.2 InfoQ evaluation form for an empirical study on air quality. The complete information and additional studies for evaluation are available at goo.gl/erNPF.
Chapter 14
Figure 14.1 An example of RStudio window.
Figure 14.2 An example of R Commander window.
Figure 14.3 Wordclouds for the two datasets.
Figure 14.4 Comparison (Expo 2015 = dark, Expo 2020 = light) and commonality clouds.
Figure 14.5 ExpoBarometro results.
Figure 14.6
SensoMineR
menu in Excel.
Figure 14.7 Assessment of the performance of the panel with the
panelperf()
and
coltable()
functions.
Figure 14.8 Representation of the perfumes and the sensory attributes on the first two dimensions resulting from
PCA()
on adjusted means of ANOVA models.
Figure 14.9 Representation of the perfumes on the first two dimensions resulting from
PCA()
in which each product is associated with a confidence ellipse.
Figure 14.10 Representation of the perfumes and the sensory attributes on the first two dimensions resulting from
MFA()
of both experts and consumers data.
Figure 14.11 Visualization of the hedonic scores given by the panelists.
Figure 14.12 Nights spent in tourist accommodation establishments by NUTS level 2 region, 2013 (million nights spent by residents and nonresidents).
Figure 14.13 Bayesian network.
Figure 14.14 Distribution of the overall satisfaction for each level of each variable.
Chapter 15
Figure 15.1 Minitab user interface, with session and worksheet windows.
Figure 15.2 Some menu options for basic statistical analysis and quality tools.
Figure 15.3 A screenshot of Minitab help.
Figure 15.4 A histogram (left) and its corresponding stem‐and‐leaf graph (right), of heartbeats per minute of students in a class.
Figure 15.5 An example of a DDE connection between Excel and Minitab.
Figure 15.6 An example of the number of defects during a month, showed in a time series plot.
Figure 15.7 An example of a Pareto chart with all the data together (top) and stratifying by month (bottom).
Figure 15.8 A screenshot showing different types of control charts in Minitab.
Figure 15.9 A screenshot with different modeling possibilities in Minitab.
Figure 15.10 Output from the power and sample size procedure for the comparison of means test.
Figure 15.11 The menu option for a Gage R&R study to validate the measurement system in Minitab.
Figure 15.12 Representation of results (using histograms) in the case study of the bakery.
Figure 15.13 Schematic representation of the data collection procedure for the glass bottles case study.
Figure 15.14 Representation of results (using a multivari chart) in the case study of the glass bottles.
Figure 15.15 Matrix plot of all variables in the power plant case study.
Figure 15.16 Scatterplot of yield versus power (with outlier) in the power plant case study.
Figure 15.17 Scatterplot of yield versus power (without outlier) in the power plant case study.
Figure 15.18 Dotplot of factor form in the power plant case study.
Figure 15.19 Dotplot of the logarithm of factor form in the power plant case study.
Figure 15.20 Interaction plot for pressure and temperature in the steering wheels case study.
Figure 15.21 Normal probability plot of the effects in the steering wheels case study.
Figure 15.22 Interaction plot for ratio and weather in the steering wheels case study.
Chapter 16
Figure 16.1 Statistical discovery.
Figure 16.2 The LPCVD data (partial view).
Figure 16.3 Pattern of missing thickness data.
Figure 16.4 Map of all the thickness values.
Figure 16.5 XBar‐R chart of film thickness.
Figure 16.6 Three‐way chart of film thickness.
Figure 16.7 The water quality data.
Figure 16.8 Field stations in the Savannah River Basin.
Figure 16.9 Bivariate correlation of Ys.
Figure 16.10 The PLS personality of fit model.
Figure 16.11 Fitting and comparing multiple PLS models.
Figure 16.12 The dual role of terms in a PLS model.
Figure 16.13 Interactively profiling four Ys in the space of 12 Xs.
Figure 16.14 Prediction accuracy of the final PLS model for test data.
Figure 16.15 InfoQ assessment of Example 2 with uncertainty.
Cover
Table of Contents
Begin Reading
iii
iv
v
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
219
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
Ron S. Kenett
KPA, Israel and University of Turin, Italy
Galit Shmueli
National Tsing Hua University, Taiwan
This edition first published 2017© 2017 John Wiley & Sons, Ltd
Registered officeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging‐in‐Publication Data
Names: Kenett, Ron. | Shmueli, Galit, 1971–Title: Information quality : the potential of data and analytics to generate knowledge / Ron S. Kenett, Dr. Galit Shmueli.Description: Chichester, West Sussex : John Wiley & Sons, Inc., 2017. | Includes bibliographical references and index.Identifiers: LCCN 2016022699| ISBN 9781118874448 (cloth) | ISBN 9781118890653 (epub)Subjects: LCSH: Data mining. | Mathematical statistics.Classification: LCC QA276 .K4427 2017 | DDC 006.3/12–dc23LC record available at https://lccn.loc.gov/2016022699
A catalogue record for this book is available from the British Library.
To Sima; our children Dolav, Ariel, Dror, and Yoed; and their families and especially their children, Yonatan, Alma, Tomer, Yadin, Aviv, Gili, Matan, and Eden, they are my source of pride and motivation.
And to the memory of my dear friend, Roberto Corradetti, who dedicated his career to applied statistics.
RSK
To my family, mentors, colleagues, and students who’ve sparked and nurtured the creation of new knowledge and innovative thinking
GS
I am often invited to assess research proposals. Included amongst the questions I have to ask myself in such assessments are: Are the goals stated sufficiently clearly? Does the study have a good chance of achieving the stated goals? Will the researchers be able to obtain sufficient quality data for the project? Are the analysis methods adequate to answer the questions? And so on. These questions are fundamental, not merely for research proposals, but for any empirical study – for any study aimed at extracting useful information from evidence or data. And yet they are rarely overtly stated. They tend to lurk in the background, with the capability of springing into the foreground to bite those who failed to think them through.
These questions are precisely the sorts of questions addressed by the InfoQ – Information Quality – framework. Answering such questions allows funding bodies, corporations, national statistical institutes, and other organisations to rank proposals, balance costs against success probability, and also to identify the weaknesses and hence improve proposals and their chance of yielding useful and valuable information. In a context of increasing constraints on financial resources, it is critical that money is well spent, so that maximising the chance that studies will obtain useful information is becoming more and more important. The InfoQ framework provides a structure for maximising these chances.
A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This is all very well – it is certainly vital that such material be covered. After all, without an understanding of the basic tools, no analysis, no knowledge extraction would be possible. But such a narrow focus typically fails to place such work in the broader context, without which its chances of success are damaged. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.
But the book goes beyond merely providing a framework. It also delves into the details of these overlooked aspects of data analysis. It discusses the fact that the same data may be high quality for one purpose and low for another, and that the adequacy of an analysis depends on the data and the goal, as well as depending on other less obvious aspects, such as the accessibility, completeness, and confidentiality of the data. And it illustrates the ideas with a series of illuminating applications.
With computers increasingly taking on the mechanical burden of data analytics the opportunities are becoming greater for us to shift our attention to the higher order aspects of analysis: to precise formulation of the questions, to consideration of data quality to answer those questions, to choice of the best method for the aims, taking account of the entire context of the analysis. In doing so we improve the quality of the conclusions we reach. And this, in turn, leads to improved decisions ‐ for researchers, policy makers, managers, and others. This book will provide an important tool in this process.
David J. Hand
Imperial College London
Ron S. Kenett is chairman of the KPA Group; research professor, University of Turin, Italy; visiting professor at the Hebrew University Institute for Drug Research, Jerusalem, Israel and at the Faculty of Economics, Ljubljana University, Slovenia. He is past president of the Israel Statistical Association (ISA) and of the European Network for Business and Industrial Statistics (ENBIS). Ron authored and coauthored over 200 papers and 12 books on topics ranging from industrial statistics, customer surveys, multivariate quality control, risk management, biostatistics and statistical methods in healthcare to performance appraisal systems and integrated management models. The KPA Group he formed in 1990 is a leading Israeli firm focused on generating insights through analytics with international customers such as hp, 3M, Teva, Perrigo, Roche, Intel, Amdocs, Stratasys, Israel Aircraft Industries, the Israel Electricity Corporation, ICL, start‐ups, banks, and healthcare providers. He was awarded the 2013 Greenfield Medal by the Royal Statistical Society in recognition for excellence in contributions to the applications of statistics. Among his many activities he is member of the National Public Advisory Council for Statistics Israel; member of the Executive Academic Council, Wingate Academic College; and board member of several pharmaceutical and Internet product companies.
Galit Shmueli is distinguished professor at National Tsing Hua University’s Institute of Service Science. She is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored and coauthored over 70 journal articles, book chapters, books, and textbooks, including Data Mining for Business Analytics, Modeling Online Auctions and Getting Started with Business Analytics. Her research is published in top journals in statistics, management, marketing, information systems, and more. Professor Shmueli has designed and instructed business analytics courses and programs since 2004 at the University of Maryland, the Indian School of Business, Statistics.com, and National Tsing Hua University, Taiwan. She has also taught engineering statistics courses at the Israel Institute of Technology and at Carnegie Mellon University.
This book is about a strategic and tactical approach to data analysis where providing added value by turning numbers into insights is the main goal of an empirical study. In our long‐time experience as applied statisticians and data mining researchers (“data scientists”), we focused on developing methods for data analysis and applying them to real problems. Our experience has been, however, that data analysis is part of a bigger process that begins with problem elicitation that consists of defining unstructured problems and ends with decisions on action items and interventions that reflect on the true impact of a study.
In 2006, the first author published a paper on the statistical education bias where, typically, in courses on statistics and data analytics, only statistical methods are taught, without reference to the statistical analysis process (Kenett and Thyregod, 2006).
In 2010, the second author published a paper showing the differences between statistical modeling aimed at prediction goals versus modeling designed to explain causal effects (Shmueli, 2010), the implication being that the goal of a study should affect the way a study is performed, from data collection to data pre‐processing, exploration, modeling, validation, and deployment. A related paper (Shmueli and Koppius, 2011) focused on the role of predictive analytics in theory building and scientific development in the explanatory‐dominated social sciences and management research fields.
In 2014, we published “On Information Quality” (Kenett and Shmueli, 2014), a paper designed to lay out the foundation for a holistic approach to data analysis (using statistical modeling, data mining approaches, or any other data analysis methods) by structuring the main ingredients of what turns numbers into information. We called the approach information quality (InfoQ) and identified four InfoQ components and eight InfoQ dimensions.
Our main thesis is that data analysis, and especially the fields of statistics and data science, need to adapt to modern challenges and technologies by developing structured methods that provide a broad life cycle view, that is, from numbers to insights. This life cycle view needs to be focused on generating InfoQ as a key objective (for more on this see Kenett, 2015).
This book, Information Quality: The Potential of Data and Analytics to Generate Knowledge, offers an extensive treatment of InfoQ and the InfoQ framework. It is aimed at motivating researchers to further develop InfoQ elements and at students in programs that teach them how to make sure their analytic or statistical work is generating information of high quality.
Addressing this mixed community has been a challenge. On the one hand, we wanted to provide academic considerations, and on the other hand, we wanted to present examples and cases that motivate students and practitioners and give them guidance in their own specific projects.
We try to achieve this mix of objectives by combining Part I, which is mostly methodological, with Part II which is based on examples and case studies.
In Part III, we treat additional topics relevant to InfoQ such as reproducible research, the review of scientific and applied research publications, the incorporation of InfoQ in academic and professional development programs, and how three leading software platforms, R, MINITAB, and JMP support InfoQ implementations.
Researchers interested in applied statistics methods and strategies will most likely start in Part I and then move to Part II to see illustrations of the InfoQ framework applied in different domains. Practitioners and students learning how to turn numbers into information can start in a relevant chapter of Part II and move back to Part I.
A teacher or designer of a course on data analytics, applied statistics, or data science can build on examples in Part II and consolidate the approach by covering Chapter 13 and the chapters in Part I. Chapter 13 on “Integrating InfoQ into data science analytics programs, research methods courses and more” was specially prepared for this audience. We also developed five case studies that can be used by teachers as a rating‐based InfoQ assessment exercise (available at http://infoq.galitshmueli.com/class‐assignment).
In developing InfoQ, we received generous inputs from many people. In particular, we would like to acknowledge insightful comments by Sir David Cox, Shelley Zacks, Benny Kedem, Shirley Coleman, David Banks, Bill Woodall, Ron Snee, Peter Bruce, Shawndra Hill, Christine Anderson Cook, Ray Chambers, Fritz Sheuren, Ernest Foreman, Philip Stark, and David Steinberg. The motivation to apply InfoQ to the review of papers (Chapter 12) came from a comment by Ross Sparks who wrote to us: “I really like your framework for evaluating information quality and I have started to use it to assess papers that I am asked to review. Particularly applied papers.” In preparing the material, we benefited from comprehensive editorial inputs by Raquelle Azran and Noa Shmueli who generously provided us their invaluable expertise—we would like to thank them and recognize their help in improving the text language and style.
The last three chapters were contributed by colleagues. They create a bridge between theory and practice showing how InfoQ is supported by R, MINITAB, and JMP. We thank the authors of these chapters, Silvia Salini, Federica Cugnata, Elena Siletti, Ian Cox, Pere Grima, Lluis Marco‐Almagro, and Xavier Tort‐Martorell, for their effort, which helped make this work both theoretical and practical.
We are especially thankful to Professor David J. Hand for preparing the foreword of the book. David has been a source of inspiration to us for many years and his contribution highlights the key parts of our work.
In the course of writing this book and developing the InfoQ framework, the first author benefited from numerous discussions with colleagues at the University of Turin, in particular with a great visionary of the role of applied statistics in modern business and industry, the late Professor Roberto Corradetti. Roberto has been a close friend and has greatly influenced this work by continuously emphasizing the need for statistical work to be appreciated by its customers in business and industry. In addition, the financial support of the Diego de Castro Foundation that he managed has provided the time to work in a stimulating academic environment at both the Faculty of Economics and the “Giuseppe Peano” Department of Mathematics of UNITO, the University of Turin. The contributions of Roberto Corradetti cannot be underestimated and are humbly acknowledged. Roberto passed away in June 2015 and left behind a great void. The second author thanks participants of the 2015 Statistical Challenges in eCommerce Research Symposium, where she presented the keynote address on InfoQ, for their feedback and enthusiasm regarding the importance of the InfoQ framework to current social science and management research.
Finally we acknowledge with pleasure the professional help of the Wiley personnel including Heather Kay, Alison Oliver and Adalfin Jayasingh and thank them for their encouragements, comments, and input that were instrumental in improving the form and content of the book.
Ron S. Kenett and Galit Shmueli
Kenett, R.S. (2015) Statistics: a life cycle view (with discussion).
Quality Engineering
, 27(1), pp. 111–129.
Kenett, R.S. and Shmueli, G. (2014) On information quality (with discussion).
Journal of the Royal Statistical Society, Series A
, 177(1), pp. 3–38.
Kenett, R.S. and Thyregod, P. (2006) Aspects of statistical consulting not taught by academia.
Statistica Neerlandica
, 60(3), pp. 396–412.
Shmueli, G. (2010) To explain or to predict?
Statistical Science
, 25, pp. 289–310.
Shmueli, G. and Koppius, O.R. (2011) Predictive analytics in information systems research.
MIS Quarterly
, 35(3), pp. 553–572.
What experts say about Information Quality: The Potential of Data and Analytics to Generate Knowledge:
A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.
David Hand
Imperial College, London, UK
There is an important distinction between data and information. Data become information only when they serve to inform, but what is the potential of data to inform? With the work Kenett and Shmueli have done, we now have a general framework to answer that question. This framework is relevant to the whole analysis process, showing the potential to achieve higher‐quality information at each step.
John Sall
SAS Institute, Cary, NC, USA
The authors have a rare quality: being able to present deep thoughts and sound approaches in a way practitioners can feel comfortable and understand when reading their work and, at the same time, researchers are compelled to think about how they do their work.
Fabrizio Ruggeri
Consiglio Nazionale delle RicercheIstituto di Matematica Applicata e Tecnologie Informatiche, Milan, Italy
No amount of technique can make irrelevant data fit for purpose, eliminate unknown biases, or compensate for data paucity. Useful, reliable inferences require balancing real‐world and theoretical considerations and recognizing that goals, data, analysis, and costs are necessarily connected. Too often, books on statistics and data analysis put formulae in the limelight at the expense of more important questions about the relevance and limitations of data and the purpose of the analysis. This book elevates these crucial issues to their proper place and provides a systematic structure (and examples) to help practitioners see the larger context of statistical questions and, thus, to do more valuable work.
Phillip Stark
University of California, Berkeley, USA
…the “Q” issue is front and centre for anyone (or any agency) hoping to benefit from the data tsunami that is said to be driving things now … And so the book will be very timely.
Ray Chambers
University of Wollongong, Australia
Kenett and Shmueli shed light on the biggest contributor to erroneous conclusions in research ‐ poor information quality coming out of a study. This issue ‐ made worse by the advent of Big Data ‐ has received too little attention in the literature and the classroom. Information quality issues can completely undermine the utility and credibility of a study, yet researchers typically deal with it in an ad‐hoc, offhand fashion, often when it is too late. Information Quality offers a sensible framework for ensuring that the data going into a study can effectively answer the questions being asked.
Peter Bruce
The Institute for Statistics Education
Policy makers rely on high quality and relevant data to make decisions and it is important that, as more and different types of data become available, we are mindful of all aspects of the quality of the information provided. This includes not only statistical quality, but other dimensions as outlined in this book including, very importantly, whether the data and analyses answer the relevant questions
John Pullinger
National Statistician, UK Statistics Authority, London, UK
This impressive book fills a gap in the teaching of statistical methodology. It deals with a neglected topic in statistical textbooks: the quality of the information provided by the producers of statistical projects and used by the customers of statistical data from surveys, administrative data etc. The emphasis in the book on: defining, discussing, analyzing the goal of the project at a preliminary stage and not less important at the analysis stage and use of the results obtained is of a major importance.
Moshe Sikron
Former Government Statistician of Israel, Jerusalem, Israel
Ron Kenett and Galit Shmueli belong to a class of practitioners who go beyond methodological prowess into questioning what purpose should be served by a data based analysis, and what could be done to gauge the fitness of the analysis to meet its purpose. This kind of insight is all the more urgent given the present climate of controversy surrounding science’s own quality control mechanism. In fact science used in support to economic or policy decision – be it natural or social science ‐ has an evident sore point precisely in the sort of statistical and mathematical modelling where the approach they advocate – Information Quality or InfoQ – is more needed. A full chapter is specifically devoted to the contribution InfoQ can make to clarify aspect of reproducibility, repeatability, and replicability of scientific research and publications. InfoQ is an empirical and flexible construct with practically infinite application in data analysis. In a context of policy, one can deploy InfoQ to compare different evidential bases pro or against a policy, or different options in an impact assessment case. InfoQ is a holistic construct encompassing the data, the method and the goal of the analysis. It goes beyond the dimensions of data quality met in official statistics and resemble more holistic concepts of performance such as analysis pedigrees (NUSAP) and sensitivity auditing. Thus InfoQ includes consideration of analysis’ Generalizability and Action Operationalization. The latter include both action operationalization (to what extent concrete actions can be derived from the information provided by a study) and construct operationalization (to what extent a construct under analysis is effectively captured by the selected variables for a given goal). A desirable feature of InfoQ is that it demands multidisciplinary skills, which may force statisticians to move out of their comfort zone into the real world. The book illustrates the eight dimensions of InfoQ with a wealth of examples. A recommended read for applied statisticians and econometricians who care about the implications of their work.
Andrea Saltelli
European Centre for Governance in Complexity
Kenett and Shmueli have made a significant contribution to the profession by drawing attention to what is frequently the most important but overlooked aspect of analytics; information quality. For example, statistics textbooks too often assume that data consist of random samples and are measured without error, and data science competitions implicitly assume that massive data sets contain high‐quality data and are exactly the data needed for the problem at hand. In reality, of course, random samples are the exception rather than the rule, and many data sets, even very large ones, are not worth the effort required to analyze them. Analytics is akin to mining, not to alchemy; the methods can only extract what is there to begin with. Kenett and Shmueli made clear the point that obtaining good data typically requires significant effort. Fortunately, they present metrics to help analysts understand the limitations of the information in hand, and how to improve it going forward. Kudos to the authors for this important contribution.
Roger Hoerl
Union College, Schenectady, NY USA
Don’t forget to visit the companion website for this book:
www.wiley.com/go/information_quality
Here you will find valuable material designed to enhance your learning, including:
The JMP add‐in presented in
Chapter 16
Five case studies that can be used as exercises of InfoQ assessment
A set of presentations on InfoQ
Scan this QR code to visit the companion website.
Suppose you are conducting a study on online auctions and consider purchasing a dataset from eBay, the online auction platform, for the purpose of your study. The data vendor offers you four options that are within your budget:
Data on all the online auctions that took place in January 2012
Data on all the online auctions, for cameras only, that took place in 2012
Data on all the online auctions, for cameras only, that will take place in the next year
Data on a random sample of online auctions that took place in 2012
Which option would you choose? Perhaps none of these options are of value? Of course, the answer depends on the goal of the study. But it also depends on other considerations such as the analysis methods and tools that you will be using, the quality of the data, and the utility that you are trying to derive from the analysis. In the words of David Hand (2008):
Statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.
While those experienced with data analysis will find this dilemma familiar, the statistics and related literature do not provide guidance on how to approach this question in a methodical fashion and how to evaluate the value of a dataset in such a scenario.
Statistics, data mining, econometrics, and related areas are disciplines that are focused on extracting knowledge from data. They provide a toolkit for testing hypotheses of interest, predicting new observations, quantifying population effects, and summarizing data efficiently. In these empirical fields, measurable data is used to derive knowledge. Yet, a clean, exact, and complete dataset, which is analyzed professionally, might contain no useful information for the problem under investigation. In contrast, a very “dirty” dataset, with missing values and incomplete coverage, can contain useful information for some goals. In some cases, available data can even be misleading (Patzer, 1995, p. 14):
Data may be of little or no value, or even negative value, if they misinform.
The focus of this book is on assessing the potential of a particular dataset for achieving a given analysis goal by employing data analysis methods and considering a given utility. We call this concept information quality (InfoQ). We propose a formal definition of InfoQ and provide guidelines for its assessment. Our objective is to offer a general framework that applies to empirical research. Such element has not received much attention in the body of knowledge of the statistics profession and can be considered a contribution to both the theory and the practice of applied statistics (Kenett, 2015).
A framework for assessing InfoQ is needed both when designing a study to produce findings of high InfoQ as well as at the postdesign stage, after the data has been collected. Questions regarding the value of data to be collected, or that have already been collected, have important implications both in academic research and in practice. With this motivation in mind, we construct the concept of InfoQ and then operationalize it so that it can be implemented in practice.
In this book, we address and tackle a high‐level issue at the core of any data analysis. Rather than concentrate on a specific set of methods or applications, we consider a general concept that underlies any empirical analysis. The InfoQ framework therefore contributes to the literature on statistical strategy, also known as metastatistics (see Hand, 1994).
Our definition of InfoQ involves four major components that are present in every data analysis: an analysis goal, a dataset, an analysis method, and a utility (Kenett and Shmueli, 2014). The discussion and assessment of InfoQ require examining and considering the complete set of its components as well as the relationships between the components. In such an evaluation we also consider eight dimensions that deconstruct the InfoQ concept. These dimensions are presented in Chapter 3. We start our introduction of InfoQ by defining each of its components.
Before describing each of the four InfoQ components, we introduce the following notation and definitions to help avoid confusion:
g
denotes a specific analysis goal.
X
denotes the available dataset.
f
is an empirical analysis method.
U
is a utility measure.
We use subscript indices to indicate alternatives. For example, to convey K different analysis goals, we use g1, g2,…, gK; J different methods of analysis are denoted f1, f2,…, fJ.
Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we can think of the InfoQ framework as one for evaluating the application of a technology (data analysis) to a resource (data) for a given purpose.
Data analysis is used for a variety of purposes in research and in industry. The term “goal” can refer to two goals: the high‐level goal of the study (the “domain goal”) and the empirical goal (the “analysis goal”). One starts from the domain goal and then converts it into an analysis goal. A classic example is translating a hypothesis driven by a theory into a set of statistical hypotheses.
There are various classifications of study goals; some classifications span both the domain and analysis goals, while other classification systems focus on describing different analysis goals.
One classification approach divides the domain and analysis goals into three general classes: causal explanation, empirical prediction, and description (see Shmueli, 2010; Shmueli and Koppius, 2011). Causal explanation is concerned with establishing and quantifying the causal relationship between inputs and outcomes of interest. Lab experiments in the life sciences are often intended to establish causal relationships. Academic research in the social sciences is typically focused on causal explanation. In the social science context, the causality structure is based on a theoretical model that establishes the causal effect of some constructs (abstract concepts) on other constructs. The data collection stage is therefore preceded by a construct operationalization stage, where the researcher establishes which measurable variables can represent the constructs of interest. An example is investigating the causal effect of parents’ intelligence on their children’s intelligence. The construct “intelligence” can be measured in various ways, such as via IQ tests. The goal of empirical prediction differs from causal explanation. Examples include forecasting future values of a time series and predicting the output value for new observations given a set of input variables. Examples include recommendation systems on various websites, which are aimed at predicting services or products that the user is most likely to be interested in. Predictions of the economy are another type of predictive goal, with forecasts of particular economic measures or indices being of interest. Finally, descriptive goals include quantifying and testing for population effects by using data summaries, graphical visualizations, statistical models, and statistical tests.
A different, but related goal classification approach (Deming, 1953) introduces the distinction between enumerative studies, aimed at answering the question “how many?,” and analytic studies, aimed at answering the question “why?”
A third classification (Tukey, 1977) classifies studies into exploratory and confirmatory data analysis.
Our use of the term “goal” includes all these different types of goals and goal classifications. For examples of such goals in the context of customer satisfaction surveys, see Chapter 7 and Kenett and Salini (2012).
Data is a broadly defined term that includes any type of data intended to be used in the empirical analysis. Data can arise from different collection instruments: surveys, laboratory tests, field experiments, computer experiments, simulations, web searches, mobile recordings, observational studies, and more. Data can be primary, collected specifically for the purpose of the study, or secondary, collected for a different reason. Data can be univariate or multivariate, discrete, continuous, or mixed. Data can contain semantic unstructured information in the form of text, images, audio, and video. Data can have various structures, including cross‐sectional data, time series, panel data, networked data, geographic data, and more. Data can include information from a single source or from multiple sources. Data can be of any size (from a single observation in case studies to “big data” with zettabytes) and any dimension.
We use the general term data analysis to encompass any empirical analysis applied to data. This includes statistical models and methods (parametric, semiparametric, nonparametric, Bayesian and classical, etc.), data mining algorithms, econometric models, graphical methods, and operations research methods (such as simplex optimization). Methods can be as simple as summary statistics or complex multilayer models, computationally simple or computationally intensive.
The extent to which the analysis goal is achieved is typically measured by some performance measure. We call this measure “utility.” As with the study goal, utility refers to two dimensions: the utility from the domain point of view and the operationalized measurable utility measure. As with the goal, the linkage between the domain utility and the analysis utility measure should be properly established so that the analysis utility can be used to infer about the domain utility.
In predictive studies, popular utility measures are predictive accuracy, lift, and expected cost per prediction. In descriptive studies, utility is often assessed based on goodness‐of‐fit measures. In causal explanatory modeling, statistical significance, statistical power, and strength‐of‐fit measures (e.g., R2) are common.
Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we consider the utility of applying a technology f to a resource X for a given purpose g. In particular, we focus on the question: What is the potential of a particular dataset to achieve a particular goal using a given data analysis method and utility? To formalize this question, we define the concept of InfoQ as
The quality of information, InfoQ, is determined by the quality of its components g (“quality of goal definition”), X (“data quality”), f (“analysis quality”), and U (“quality of utility measure”) as well as by the relationships between them. (See Figure 1.1 for a visual representation of InfoQ components.)
Figure 1.1The four InfoQ components.
Let us recall the four options of eBay datasets we described at the beginning of the chapter. In order to evaluate the InfoQ of each of these datasets, we would have to specify the study goal, the intended data analysis, and the utility measure.
To better illustrate the role that the different components play, let us examine four studies in the field of online auctions, each using data to address a particular goal.
Econometricians are interested in determining factors that affect the final price of an online auction. Although game theory provides an underlying theoretical causal model of price in offline auctions, the online environment differs in substantial ways. Online auction platforms such as eBay.com have lowered the entry barrier for sellers and buyers to participate in auctions. Auction rules and settings can differ from classic on‐ground auctions, and so can dynamics between bidders.
Let us examine the study “Public versus Secret Reserve Prices in eBay Auctions: Results from a Pokémon Field Experiment” (Katkar and Reiley, 2006) which investigated the effect of two types of reserve prices on the final auction price. A reserve price is a value that is set by the seller at the start of the auction. If the final price does not exceed the reserve price, the auction does not transact. On eBay, sellers can choose to place a public reserve price that is visible to bidders or an invisible secret reserve price, where bidders see only that there is a reserve price but do not know its value.
The researchers’ goal is stated as follows:
We ask, empirically, whether the seller is made better or worse off by setting a secret reserve above a low minimum bid, versus the option of making the reserve public by using it as the minimum bid level.
This question is then converted into the statistical goal (g) of testing a hypothesis “that secret reserve prices actually do produce higher expected revenues.”
The researchers proceed by setting up auctions for Pokémon cards1 on eBay.com and auctioning off 50 matched pairs of Pokémon cards, half with secret reserves and half with equivalently high public minimum bids. The resulting dataset included information about bids, bidders, and the final price in each of the 100 auctions, as well as whether the auction had a secret or public reserve price. The dataset also included information about the sellers’ choices, such as the start and close time of each auction, the shipping costs, etc. This dataset constitutes X.
The researchers decided to “measure the effects of a secret reserve price (relative to an equivalent public reserve) on three different independent variables: the probability of the auction resulting in a sale, the number of bids received, and the price received for the card in the auction.” This was done via linear regression models ( f ). For example, the sale/no sale outcome was regressed on the type of reserve (public/private) and other control variables, and the statistical significance of the reserve variable was examined.
The authors conclude “The average drop in the probability of sale when using a secret reserve is statistically significant.” Using another linear regression model with price as the dependent variable, statistical significance (the p‐value) of the regression coefficient was used to test the presence of an effect for a private or public reserve price, and the regression coefficient value was used to quantify the magnitude of the effect, concluding that “a secret‐reserve auction will generate a price $0.63 lower, on average, than will a public‐reserve auction.” Hence, the utility (U) in this study relies mostly on statistical significance and p‐values as well as the practical interpretation of the magnitude of a regression coefficient.
What is the quality of the information contained in this study’s dataset for testing the effect of private versus public reserve price on the final price, using regression models and statistical significance? The authors compare the advantages of their experimental design for answering their question of interest with designs of previous studies using observational data:
With enough [observational] data and enough identifying econometric assumptions, one could conceivably tease out an empirical measurement of the reserve price effect from eBay field data… Such structural models make strong identifying assumptions in order to recover economic unobservables (such as bidders’ private information about the item’s value)… In contrast, our research project is much less ambitious, for we focus only on the effect of secret reserve prices relative to public reserve prices (starting bids). Our experiment allows us to carry out this measurement in a manner that is as simple, direct, and assumption‐free as possible.
In other words, with a simple two‐level experiment, the authors aim to answer a specific research question (g1) in a robust manner, rather than build an extensive theoretical economic model (g2) that is based on many assumptions.
Interestingly, when comparing their conclusions against prior literature on the effect of reserve prices in a study that used observational data, the authors mention that they find an opposite effect:
Our results are somewhat inconsistent with those of Bajari and Hortaçsu…. Perhaps Bajari and Hortaçsu have made an inaccurate modeling assumption, or perhaps there is some important difference between bidding for coin sets and bidding for Pokémon cards.
This discrepancy even leads the researchers to propose a new dataset that can help tackle the original goal with less confounding:
A new experiment, auctioning one hundred items each in the $100 range, for example could shed some important light on this question.
This means that the InfoQ of the Pokémon card auction dataset is considered lower than that of a more expensive item.
1 The Pokémon trading card game was one of the largest collectible toy crazes of 1999 and 2000. Introduced in early 1999, Pokémon game cards appeal both to game players and to collectors. Source: Katkar and Reiley (2006). © National Bureau of Economic Research.
