Information Quality - Ron S. Kenett - E-Book

Information Quality E-Book

Ron S. Kenett

0,0
63,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Provides an important framework for data analysts in assessing the quality of data and its potential to provide meaningful insights through analysis

Analytics and statistical analysis have become pervasive topics, mainly due to the growing availability of data and analytic tools. Technology, however, fails to deliver insights with added value if the quality of the information it generates is not assured. Information Quality (InfoQ) is a tool developed by the authors to assess the potential of a dataset to achieve a goal of interest, using data analysis.  Whether the information quality of a dataset is sufficient is of practical importance at many stages of the data analytics journey, from the pre-data collection stage to the post-data collection and post-analysis stages. It is also critical to various stakeholders: data collection agencies, analysts, data scientists, and management.

 This book:

  • Explains how to integrate the notions of goal, data, analysis and utility that are the main building blocks of data analysis within any domain.
  • Presents a framework for integrating domain knowledge with data analysis.
  • Provides a combination of both methodological and practical aspects of data analysis.
  • Discusses issues surrounding the implementation and integration of InfoQ in both academic programmes and business / industrial projects.
  • Showcases numerous case studies in a variety of application areas such as education, healthcare, official statistics, risk management and marketing surveys.
  • Presents a review of software tools from the InfoQ perspective along with example datasets on an accompanying website.

 This book will be beneficial for researchers in academia and in industry, analysts, consultants, and agencies that collect and analyse data as well as undergraduate and postgraduate courses involving data analysis.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 637

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Foreword

About the authors

Preface

References

Quotes about the book

About the companion website

Part I: THE INFORMATION QUALITY FRAMEWORK

1 Introduction to information quality

1.1 Introduction

1.2 Components of InfoQ

1.3 Definition of information quality

1.4 Examples from online auction studies

1.5 InfoQ and study quality

1.6 Summary

References

2 Quality of goal, data quality, and analysis quality

2.1 Introduction

2.2 Data quality

2.3 Analysis quality

2.4 Quality of utility

2.5 Summary

References

3 Dimensions of information quality and InfoQ assessment

3.1 Introduction

3.2 The eight dimensions of InfoQ

3.3 Assessing InfoQ

3.4 Example: InfoQ assessment of online auction experimental data

3.5 Summary

References

4 InfoQ at the study design stage

4.1 Introduction

4.2 Primary versus secondary data and experiments versus observational data

4.3 Statistical design of experiments

4.4 Clinical trials and experiments with human subjects

4.5 Design of observational studies: Survey sampling

4.6 Computer experiments (simulations)

4.7 Multiobjective studies

4.8 Summary

References

5 InfoQ at the postdata collection stage

5.1 Introduction

5.2 Postdata collection data

5.3 Data cleaning and preprocessing

5.4 Reweighting and bias adjustment

5.5 Meta‐analysis

5.6 Retrospective experimental design analysis

5.7 Models that account for data “loss”: Censoring and truncation

5.8 Summary

References

Part II: APPLICATIONS OF InfoQ

6 Education

6.1 Introduction

6.2 Test scores in schools

6.3 Value‐added models for educational assessment

6.4 Assessing understanding of concepts

6.5 Summary

Appendix: MERLO implementation for an introduction to statistics course

References

7 Customer surveys

7.1 Introduction

7.2 Design of customer surveys

7.3 InfoQ components

7.4 Models for customer survey data analysis

7.5 InfoQ evaluation

7.6 Summary

Appendix: A posteriori InfoQ improvement for survey nonresponse selection bias

References

8 Healthcare

8.1 Introduction

8.2 Institute of medicine reports

8.3 Sant’Anna di Pisa report on the Tuscany healthcare system

8.4 The haemodialysis case study

8.5 The Geriatric Medical Center case study

8.6 Report of cancer incidence cluster

8.7 Summary

References

9 Risk management

9.1 Introduction

9.2 Financial engineering, risk management, and Taleb’s quadrant

9.3 Risk management of OSS

9.4 Risk management of a telecommunication system supplier

9.5 Risk management in enterprise system implementation

9.6 Summary

References

10 Official statistics

10.1 Introduction

10.2 Information quality and official statistics

10.3 Quality standards for official statistics

10.4 Standards for customer surveys

10.5 Integrating official statistics with administrative data for enhanced InfoQ

10.6 Summary

References

Part III: IMPLEMENTING InfoQ

11 InfoQ and reproducible research

11.1 Introduction

11.2 Definitions of reproducibility, repeatability, and replicability

11.3 Reproducibility and repeatability in GR&R

11.4 Reproducibility and repeatability in animal behavior studies

11.5 Replicability in genome‐wide association studies

11.6 Reproducibility, repeatability, and replicability: the InfoQ lens

11.7 Summary

Appendix: Gauge repeatability and reproducibility study design and analysis

References

12 InfoQ in review processes of scientific publications

12.1 Introduction

12.2 Current guidelines in applied journals

12.3 InfoQ guidelines for reviewers

12.4 Summary

References

13 Integrating InfoQ into data science analytics programs, research methods courses, and more

13.1 Introduction

13.2 Experience from InfoQ integrations in existing courses

13.3 InfoQ as an integrating theme in analytics programs

13.4 Designing a new analytics course (or redesigning an existing course)

13.5 A one‐day InfoQ workshop

13.6 Summary

Acknowledgements

References

14 InfoQ support with R

14.1 Introduction

14.2 Examples of information quality with R

14.3 Components and dimensions of InfoQ and R

14.4 Summary

References

15 InfoQ support with Minitab

15.1 Introduction

15.2 Components and dimensions of InfoQ and Minitab

15.3 Examples of InfoQ with Minitab

15.4 Summary

References

16 InfoQ support with JMP

16.1 Introduction

16.2 Example 1: Controlling a film deposition process

16.3 Example 2: Predicting water quality in the Savannah River Basin

16.4 A JMP application to score the InfoQ dimensions

16.5 JMP capabilities and InfoQ

16.6 Summary

References

Index

End User License Agreement

List of Tables

Chapter 04

Table 4.1 Statistical strategies for increasing InfoQ given a priori causes at the design stage.

Chapter 05

Table 5.1 Statistical strategies for increasing InfoQ given a posteriori causes at the postdata collection stage and approaches for increasing InfoQ.

Chapter 06

Table 6.1 InfoQ assessment for MAP report.

Table 6.2 InfoQ assessment for student’s lifelong earning study.

Table 6.3 InfoQ assessment for VAM (based on ASA statement).

Table 6.4 MERLO recognition scores for ten concepts taught in an Italian middle school.

Table 6.5 Grouping of MERLO recognition scores using the Tukey method and 95% confidence.

Table 6.6 InfoQ assessment for MERLO.

Table 6.7 Scoring of InfoQ dimensions of examples from education.

Chapter 07

Table 7.1 Main deliverables in an Internet‐based ACSS project.

Table 7.2 Service level agreements for Internet‐based customer satisfaction surveys.

Table 7.3 A typical ACSS activity plan.

Table 7.4 InfoQ score of various models used in the analysis of customer surveys.

Table A Postdata collection correction for nonresponse bias in a customer satisfaction survey using adjusted residuals.

Chapter 08

Table 8.1 InfoQ components for IOM‐related studies.

Table 8.2 InfoQ dimensions and ratings for Stelfox et al. (2006) data and for the IOM reports.

Table 8.3 InfoQ components for Sant’Anna di Pisa study.

Table 8.4 InfoQ dimensions and ratings on 5‐point scale for Sant’Anna di Pisa study.

Table 8.5 InfoQ components for the haemodialysis decision support system.

Table 8.6 Marginal posterior distributions for the

j

‐th patient’s risk profile (True = risk has materialized).

Table 8.7 Posterior distributions of outcome measures for two patients.

Table 8.8 InfoQ dimensions and ratings on 5‐point scale for haemodialysis study.

Table 8.9 InfoQ components for the two NataGer projects data.

Table 8.10 InfoQ dimensions and ratings on 5‐point scale for the two NataGer projects.

Table 8.11 InfoQ components of cancer incidence report.

Table 8.12 InfoQ dimensions and ratings of cancer incidence study by Rottenberg et al. (2013).

Table 8.13 Scoring of InfoQ dimensions for each of the four healthcare cases studies.

Chapter 09

Table 9.1 Log of technicians’ on‐site interventions (techdb).

Table 9.2 Balance sheet indicators for a given costumer of the VNO (balance).

Table 9.3 Classification of 264 CEAO chains by aspect and division (output from MINITAB version 12.1).

Table 9.4 Scoring of InfoQ dimensions of the five risk management cases studies.

Chapter 10

Table 10.1 Relationship between NCSES standards and InfoQ dimensions. Shaded cells indicate an existing relationship.

Table 10.2 Relationship between ISO 10004 guidelines and InfoQ dimensions. Shaded cells indicate an existing relationship.

Table 10.3 Scores for InfoQ dimensions for Stella education case study.

Table 10.4 Scores for InfoQ dimensions for the NHTSA safety case study.

Chapter 11

Table 11.1 Terminology in GR&R studies.

Table 11.2 Terminology in animal experiments.

Table 11.3 Terminology in genome‐wide association studies.

Table A ANOVA table of GR&R experiments.

Chapter 12

Table 12.1 List of journals published by American Statistical Association (ASA) Referee guidelines web pages were not found for any of these journals.

Table 12.2 Partial list of journals published by American Society for Quality (ASQ) Referee guidelines web pages were not found for any of these journals. The same lack of guidelines applies to all other ASQ journals (http://asq.org/pub/).

Table 12.3 List of journals published by the Institute of Mathematical Statistics (IMS) and URLs for referee guidelines (accessed July 7, 2014).

Table 12.4 List of journals published by the Royal Statistical Society (RSS) and URLs for referee guidelines (accessed July 7, 2014).

Table 12.5 List of journals in machine learning and URLs for referee guidelines (accessed July 7, 2014).

Table 12.6 Reviewing guidelines for major data mining conference (accessed July 7, 2014).

Table 12.7 List of top scientific journals and URLs for referee guidelines (accessed July 7, 2014).

Table 12.8 Questionnaire for reviewers of applied research submission.

Chapter 15

Table 15.1 InfoQ assessment for Example 1.

Table 15.2 Results of the factorial experimental design of the steering wheels.

Table 15.3 InfoQ assessment for Example 2.

Chapter 16

Table 16.1 Synopsis of Example 1.

Table 16.2 InfoQ assessment for Example 1.

Table 16.3 Synopsis of Example 2.

Table 16.4 Ys for the PLS model.

Table 16.5 InfoQ assessment for Example 2.

List of Illustrations

Chapter 01

Figure 1.1 The four InfoQ components.

Figure 1.2 Price curves for the last day of four seven‐day auctions (x‐axis denotes day of auction). Current auction price (line with circles), functional price curve (smooth line) and forecasted price curve (broken line).

Chapter 03

Figure 3.1 Timeline of study, from data collection to study deployment.

Chapter 04

Figure 4.1 JMP screenshot of a 2

7−3

fractional factorial experiment with the piston simulator described in Kenett and Zacks (2014).

Figure 4.2 JMP screenshot of a definitive screening design experiment with the piston simulator described in Kenett and Zacks (2014).

Figure 4.3 JMP screenshot of fraction of design space plots and design diagnostics of fractional (left) and definite screening designs (right).

Chapter 05

Figure 5.1 Illustration of right, left, and interval censoring. Each line denotes the lifetime of the observation.

Chapter 06

Figure 6.1 The Missouri Assessment Program test report for fictional student Sara Armstrong.

Figure 6.2 SAT Critical Reading skills.

Figure 6.3 Earning per teacher value‐added score.

Figure 6.4 Test scores by school by high value‐added teacher score.

Figure 6.5 Template for constructing an item family in MERLO.

Figure 6.6 Example of MERLO item (mathematics/functions).

Figure 6.7 Box plots of MERLO recognition scores in ten mathematical topics taught in an Italian middle school. Asterisks represent outliers beyond three standard deviation of mean.

Figure 6.8 Confidence intervals for difference in MERLO recognition scores between topics.

Chapter 07

Figure 7.1 SERVQUAL gap model.

Figure 7.2 Bayesian network of responses to satisfaction questions from various topics, overall satisfaction, repurchasing intentions, recommendation level, and country of respondent.

Chapter 08

Figure 8.1 Bayesian network of patient haemodialysis treatment.

Figure 8.2 Visual board display designed to help reduce patients’ falls.

Figure 8.3 Prioritization tool for potential causes for bedsores occurrence.

Chapter 09

Figure 9.1 Bayesian network linking risk drivers with the activeness of risk indicators.

Figure 9.2 Social network based on email communication between OSS contributors and committers.

Figure 9.3 Simplex representation of association rules of event categories in telecom case study.

Figure 9.4 A sample CEAO chain.

Figure 9.5 Correspondence analysis of CEAO chains in five divisions by aspect. K&S = knowledge and skills; Mgmt = management; P = process; S = structure; S&G = strategy and goals; SD = social dynamics.

Chapter 10

Figure 10.1 BN for the Stella dataset.

Figure 10.2 BN is conditioned on a value of lastsal which is similar to the salary value of the Graduates dataset.

Figure 10.3 BN is conditioned on a low value of begsal and emp and for a high value of yPhD.

Figure 10.4 BN for the Graduates dataset.

Figure 10.5 BN is conditioned on a high value of msalary.

Figure 10.6 BN is conditioned on a high value of mdipl and nemp and for a low value of ystjob.

Figure 10.7 BN for the Vehicle Safety dataset.

Figure 10.8 BN for the Crash Test dataset.

Figure 10.9 BN for the Crash Test dataset is conditioned on a high value of Wt and Year.

Figure 10.10 BN for the Crash Test dataset is conditioned on a low value of Wt and Year.

Chapter 13

Figure 13.1 Google Trends data on “data science course.”

Figure 13.2 InfoQ evaluation form for an empirical study on air quality. The complete information and additional studies for evaluation are available at goo.gl/erNPF.

Chapter 14

Figure 14.1 An example of RStudio window.

Figure 14.2 An example of R Commander window.

Figure 14.3 Wordclouds for the two datasets.

Figure 14.4 Comparison (Expo 2015 = dark, Expo 2020 = light) and commonality clouds.

Figure 14.5 ExpoBarometro results.

Figure 14.6

SensoMineR

menu in Excel.

Figure 14.7 Assessment of the performance of the panel with the

panelperf()

and

coltable()

functions.

Figure 14.8 Representation of the perfumes and the sensory attributes on the first two dimensions resulting from

PCA()

on adjusted means of ANOVA models.

Figure 14.9 Representation of the perfumes on the first two dimensions resulting from

PCA()

in which each product is associated with a confidence ellipse.

Figure 14.10 Representation of the perfumes and the sensory attributes on the first two dimensions resulting from

MFA()

of both experts and consumers data.

Figure 14.11 Visualization of the hedonic scores given by the panelists.

Figure 14.12 Nights spent in tourist accommodation establishments by NUTS level 2 region, 2013 (million nights spent by residents and nonresidents).

Figure 14.13 Bayesian network.

Figure 14.14 Distribution of the overall satisfaction for each level of each variable.

Chapter 15

Figure 15.1 Minitab user interface, with session and worksheet windows.

Figure 15.2 Some menu options for basic statistical analysis and quality tools.

Figure 15.3 A screenshot of Minitab help.

Figure 15.4 A histogram (left) and its corresponding stem‐and‐leaf graph (right), of heartbeats per minute of students in a class.

Figure 15.5 An example of a DDE connection between Excel and Minitab.

Figure 15.6 An example of the number of defects during a month, showed in a time series plot.

Figure 15.7 An example of a Pareto chart with all the data together (top) and stratifying by month (bottom).

Figure 15.8 A screenshot showing different types of control charts in Minitab.

Figure 15.9 A screenshot with different modeling possibilities in Minitab.

Figure 15.10 Output from the power and sample size procedure for the comparison of means test.

Figure 15.11 The menu option for a Gage R&R study to validate the measurement system in Minitab.

Figure 15.12 Representation of results (using histograms) in the case study of the bakery.

Figure 15.13 Schematic representation of the data collection procedure for the glass bottles case study.

Figure 15.14 Representation of results (using a multivari chart) in the case study of the glass bottles.

Figure 15.15 Matrix plot of all variables in the power plant case study.

Figure 15.16 Scatterplot of yield versus power (with outlier) in the power plant case study.

Figure 15.17 Scatterplot of yield versus power (without outlier) in the power plant case study.

Figure 15.18 Dotplot of factor form in the power plant case study.

Figure 15.19 Dotplot of the logarithm of factor form in the power plant case study.

Figure 15.20 Interaction plot for pressure and temperature in the steering wheels case study.

Figure 15.21 Normal probability plot of the effects in the steering wheels case study.

Figure 15.22 Interaction plot for ratio and weather in the steering wheels case study.

Chapter 16

Figure 16.1 Statistical discovery.

Figure 16.2 The LPCVD data (partial view).

Figure 16.3 Pattern of missing thickness data.

Figure 16.4 Map of all the thickness values.

Figure 16.5 XBar‐R chart of film thickness.

Figure 16.6 Three‐way chart of film thickness.

Figure 16.7 The water quality data.

Figure 16.8 Field stations in the Savannah River Basin.

Figure 16.9 Bivariate correlation of Ys.

Figure 16.10 The PLS personality of fit model.

Figure 16.11 Fitting and comparing multiple PLS models.

Figure 16.12 The dual role of terms in a PLS model.

Figure 16.13 Interactively profiling four Ys in the space of 12 Xs.

Figure 16.14 Prediction accuracy of the final PLS model for test data.

Figure 16.15 InfoQ assessment of Example 2 with uncertainty.

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

v

ix

x

xi

xii

xiii

xiv

xv

xvi

xvii

xviii

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

219

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

Information Quality

The Potential of Data and Analytics to Generate Knowledge

 

 

Ron S. Kenett

KPA, Israel and University of Turin, Italy

Galit Shmueli

National Tsing Hua University, Taiwan

 

 

 

 

 

 

This edition first published 2017© 2017 John Wiley & Sons, Ltd

Registered officeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging‐in‐Publication Data

Names: Kenett, Ron. | Shmueli, Galit, 1971–Title: Information quality : the potential of data and analytics to generate knowledge / Ron S. Kenett, Dr. Galit Shmueli.Description: Chichester, West Sussex : John Wiley & Sons, Inc., 2017. | Includes bibliographical references and index.Identifiers: LCCN 2016022699| ISBN 9781118874448 (cloth) | ISBN 9781118890653 (epub)Subjects: LCSH: Data mining. | Mathematical statistics.Classification: LCC QA276 .K4427 2017 | DDC 006.3/12–dc23LC record available at https://lccn.loc.gov/2016022699

A catalogue record for this book is available from the British Library.

 

 

 

To Sima; our children Dolav, Ariel, Dror, and Yoed; and their families and especially their children, Yonatan, Alma, Tomer, Yadin, Aviv, Gili, Matan, and Eden, they are my source of pride and motivation.

And to the memory of my dear friend, Roberto Corradetti, who dedicated his career to applied statistics.

RSK

To my family, mentors, colleagues, and students who’ve sparked and nurtured the creation of new knowledge and innovative thinking

GS

Foreword

I am often invited to assess research proposals. Included amongst the questions I have to ask myself in such assessments are: Are the goals stated sufficiently clearly? Does the study have a good chance of achieving the stated goals? Will the researchers be able to obtain sufficient quality data for the project? Are the analysis methods adequate to answer the questions? And so on. These questions are fundamental, not merely for research proposals, but for any empirical study – for any study aimed at extracting useful information from evidence or data. And yet they are rarely overtly stated. They tend to lurk in the background, with the capability of springing into the foreground to bite those who failed to think them through.

These questions are precisely the sorts of questions addressed by the InfoQ – Information Quality – framework. Answering such questions allows funding bodies, corporations, national statistical institutes, and other organisations to rank proposals, balance costs against success probability, and also to identify the weaknesses and hence improve proposals and their chance of yielding useful and valuable information. In a context of increasing constraints on financial resources, it is critical that money is well spent, so that maximising the chance that studies will obtain useful information is becoming more and more important. The InfoQ framework provides a structure for maximising these chances.

A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This is all very well – it is certainly vital that such material be covered. After all, without an understanding of the basic tools, no analysis, no knowledge extraction would be possible. But such a narrow focus typically fails to place such work in the broader context, without which its chances of success are damaged. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.

But the book goes beyond merely providing a framework. It also delves into the details of these overlooked aspects of data analysis. It discusses the fact that the same data may be high quality for one purpose and low for another, and that the adequacy of an analysis depends on the data and the goal, as well as depending on other less obvious aspects, such as the accessibility, completeness, and confidentiality of the data. And it illustrates the ideas with a series of illuminating applications.

With computers increasingly taking on the mechanical burden of data analytics the opportunities are becoming greater for us to shift our attention to the higher order aspects of analysis: to precise formulation of the questions, to consideration of data quality to answer those questions, to choice of the best method for the aims, taking account of the entire context of the analysis. In doing so we improve the quality of the conclusions we reach. And this, in turn, leads to improved decisions ‐ for researchers, policy makers, managers, and others. This book will provide an important tool in this process.

David J. Hand

Imperial College London

About the authors

Ron S. Kenett is chairman of the KPA Group; research professor, University of Turin, Italy; visiting professor at the Hebrew University Institute for Drug Research, Jerusalem, Israel and at the Faculty of Economics, Ljubljana University, Slovenia. He is past president of the Israel Statistical Association (ISA) and of the European Network for Business and Industrial Statistics (ENBIS). Ron authored and coauthored over 200 papers and 12 books on topics ranging from industrial statistics, customer surveys, multivariate quality control, risk management, biostatistics and statistical methods in healthcare to performance appraisal systems and integrated management models. The KPA Group he formed in 1990 is a leading Israeli firm focused on generating insights through analytics with international customers such as hp, 3M, Teva, Perrigo, Roche, Intel, Amdocs, Stratasys, Israel Aircraft Industries, the Israel Electricity Corporation, ICL, start‐ups, banks, and healthcare providers. He was awarded the 2013 Greenfield Medal by the Royal Statistical Society in recognition for excellence in contributions to the applications of statistics. Among his many activities he is member of the National Public Advisory Council for Statistics Israel; member of the Executive Academic Council, Wingate Academic College; and board member of several pharmaceutical and Internet product companies.

Galit Shmueli is distinguished professor at National Tsing Hua University’s Institute of Service Science. She is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored and coauthored over 70 journal articles, book chapters, books, and textbooks, including Data Mining for Business Analytics, Modeling Online Auctions and Getting Started with Business Analytics. Her research is published in top journals in statistics, management, marketing, information systems, and more. Professor Shmueli has designed and instructed business analytics courses and programs since 2004 at the University of Maryland, the Indian School of Business, Statistics.com, and National Tsing Hua University, Taiwan. She has also taught engineering statistics courses at the Israel Institute of Technology and at Carnegie Mellon University.

Preface

This book is about a strategic and tactical approach to data analysis where providing added value by turning numbers into insights is the main goal of an empirical study. In our long‐time experience as applied statisticians and data mining researchers (“data scientists”), we focused on developing methods for data analysis and applying them to real problems. Our experience has been, however, that data analysis is part of a bigger process that begins with problem elicitation that consists of defining unstructured problems and ends with decisions on action items and interventions that reflect on the true impact of a study.

In 2006, the first author published a paper on the statistical education bias where, typically, in courses on statistics and data analytics, only statistical methods are taught, without reference to the statistical analysis process (Kenett and Thyregod, 2006).

In 2010, the second author published a paper showing the differences between statistical modeling aimed at prediction goals versus modeling designed to explain causal effects (Shmueli, 2010), the implication being that the goal of a study should affect the way a study is performed, from data collection to data pre‐processing, exploration, modeling, validation, and deployment. A related paper (Shmueli and Koppius, 2011) focused on the role of predictive analytics in theory building and scientific development in the explanatory‐dominated social sciences and management research fields.

In 2014, we published “On Information Quality” (Kenett and Shmueli, 2014), a paper designed to lay out the foundation for a holistic approach to data analysis (using statistical modeling, data mining approaches, or any other data analysis methods) by structuring the main ingredients of what turns numbers into information. We called the approach information quality (InfoQ) and identified four InfoQ components and eight InfoQ dimensions.

Our main thesis is that data analysis, and especially the fields of statistics and data science, need to adapt to modern challenges and technologies by developing structured methods that provide a broad life cycle view, that is, from numbers to insights. This life cycle view needs to be focused on generating InfoQ as a key objective (for more on this see Kenett, 2015).

This book, Information Quality: The Potential of Data and Analytics to Generate Knowledge, offers an extensive treatment of InfoQ and the InfoQ framework. It is aimed at motivating researchers to further develop InfoQ elements and at students in programs that teach them how to make sure their analytic or statistical work is generating information of high quality.

Addressing this mixed community has been a challenge. On the one hand, we wanted to provide academic considerations, and on the other hand, we wanted to present examples and cases that motivate students and practitioners and give them guidance in their own specific projects.

We try to achieve this mix of objectives by combining Part I, which is mostly methodological, with Part II which is based on examples and case studies.

In Part III, we treat additional topics relevant to InfoQ such as reproducible research, the review of scientific and applied research publications, the incorporation of InfoQ in academic and professional development programs, and how three leading software platforms, R, MINITAB, and JMP support InfoQ implementations.

Researchers interested in applied statistics methods and strategies will most likely start in Part I and then move to Part II to see illustrations of the InfoQ framework applied in different domains. Practitioners and students learning how to turn numbers into information can start in a relevant chapter of Part II and move back to Part I.

A teacher or designer of a course on data analytics, applied statistics, or data science can build on examples in Part II and consolidate the approach by covering Chapter 13 and the chapters in Part I. Chapter 13 on “Integrating InfoQ into data science analytics programs, research methods courses and more” was specially prepared for this audience. We also developed five case studies that can be used by teachers as a rating‐based InfoQ assessment exercise (available at http://infoq.galitshmueli.com/class‐assignment).

In developing InfoQ, we received generous inputs from many people. In particular, we would like to acknowledge insightful comments by Sir David Cox, Shelley Zacks, Benny Kedem, Shirley Coleman, David Banks, Bill Woodall, Ron Snee, Peter Bruce, Shawndra Hill, Christine Anderson Cook, Ray Chambers, Fritz Sheuren, Ernest Foreman, Philip Stark, and David Steinberg. The motivation to apply InfoQ to the review of papers (Chapter 12) came from a comment by Ross Sparks who wrote to us: “I really like your framework for evaluating information quality and I have started to use it to assess papers that I am asked to review. Particularly applied papers.” In preparing the material, we benefited from comprehensive editorial inputs by Raquelle Azran and Noa Shmueli who generously provided us their invaluable expertise—we would like to thank them and recognize their help in improving the text language and style.

The last three chapters were contributed by colleagues. They create a bridge between theory and practice showing how InfoQ is supported by R, MINITAB, and JMP. We thank the authors of these chapters, Silvia Salini, Federica Cugnata, Elena Siletti, Ian Cox, Pere Grima, Lluis Marco‐Almagro, and Xavier Tort‐Martorell, for their effort, which helped make this work both theoretical and practical.

We are especially thankful to Professor David J. Hand for preparing the foreword of the book. David has been a source of inspiration to us for many years and his contribution highlights the key parts of our work.

In the course of writing this book and developing the InfoQ framework, the first author benefited from numerous discussions with colleagues at the University of Turin, in particular with a great visionary of the role of applied statistics in modern business and industry, the late Professor Roberto Corradetti. Roberto has been a close friend and has greatly influenced this work by continuously emphasizing the need for statistical work to be appreciated by its customers in business and industry. In addition, the financial support of the Diego de Castro Foundation that he managed has provided the time to work in a stimulating academic environment at both the Faculty of Economics and the “Giuseppe Peano” Department of Mathematics of UNITO, the University of Turin. The contributions of Roberto Corradetti cannot be underestimated and are humbly acknowledged. Roberto passed away in June 2015 and left behind a great void. The second author thanks participants of the 2015 Statistical Challenges in eCommerce Research Symposium, where she presented the keynote address on InfoQ, for their feedback and enthusiasm regarding the importance of the InfoQ framework to current social science and management research.

Finally we acknowledge with pleasure the professional help of the Wiley personnel including Heather Kay, Alison Oliver and Adalfin Jayasingh and thank them for their encouragements, comments, and input that were instrumental in improving the form and content of the book.

Ron S. Kenett and Galit Shmueli

References

Kenett, R.S. (2015) Statistics: a life cycle view (with discussion).

Quality Engineering

, 27(1), pp. 111–129.

Kenett, R.S. and Shmueli, G. (2014) On information quality (with discussion).

Journal of the Royal Statistical Society, Series A

, 177(1), pp. 3–38.

Kenett, R.S. and Thyregod, P. (2006) Aspects of statistical consulting not taught by academia.

Statistica Neerlandica

, 60(3), pp. 396–412.

Shmueli, G. (2010) To explain or to predict?

Statistical Science

, 25, pp. 289–310.

Shmueli, G. and Koppius, O.R. (2011) Predictive analytics in information systems research.

MIS Quarterly

, 35(3), pp. 553–572.

Quotes about the book

What experts say about Information Quality: The Potential of Data and Analytics to Generate Knowledge:

A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.

David Hand

Imperial College, London, UK

There is an important distinction between data and information. Data become information only when they serve to inform, but what is the potential of data to inform? With the work Kenett and Shmueli have done, we now have a general framework to answer that question. This framework is relevant to the whole analysis process, showing the potential to achieve higher‐quality information at each step.

John Sall

SAS Institute, Cary, NC, USA

The authors have a rare quality: being able to present deep thoughts and sound approaches in a way practitioners can feel comfortable and understand when reading their work and, at the same time, researchers are compelled to think about how they do their work.

Fabrizio Ruggeri

Consiglio Nazionale delle RicercheIstituto di Matematica Applicata e Tecnologie Informatiche, Milan, Italy

No amount of technique can make irrelevant data fit for purpose, eliminate unknown biases, or compensate for data paucity. Useful, reliable inferences require balancing real‐world and theoretical considerations and recognizing that goals, data, analysis, and costs are necessarily connected. Too often, books on statistics and data analysis put formulae in the limelight at the expense of more important questions about the relevance and limitations of data and the purpose of the analysis. This book elevates these crucial issues to their proper place and provides a systematic structure (and examples) to help practitioners see the larger context of statistical questions and, thus, to do more valuable work.

Phillip Stark

University of California, Berkeley, USA

…the “Q” issue is front and centre for anyone (or any agency) hoping to benefit from the data tsunami that is said to be driving things now … And so the book will be very timely.

Ray Chambers

University of Wollongong, Australia

Kenett and Shmueli shed light on the biggest contributor to erroneous conclusions in research ‐ poor information quality coming out of a study. This issue ‐ made worse by the advent of Big Data ‐ has received too little attention in the literature and the classroom. Information quality issues can completely undermine the utility and credibility of a study, yet researchers typically deal with it in an ad‐hoc, offhand fashion, often when it is too late. Information Quality offers a sensible framework for ensuring that the data going into a study can effectively answer the questions being asked.

Peter Bruce

The Institute for Statistics Education

Policy makers rely on high quality and relevant data to make decisions and it is important that, as more and different types of data become available, we are mindful of all aspects of the quality of the information provided. This includes not only statistical quality, but other dimensions as outlined in this book including, very importantly, whether the data and analyses answer the relevant questions

John Pullinger

National Statistician, UK Statistics Authority, London, UK

This impressive book fills a gap in the teaching of statistical methodology. It deals with a neglected topic in statistical textbooks: the quality of the information provided by the producers of statistical projects and used by the customers of statistical data from surveys, administrative data etc. The emphasis in the book on: defining, discussing, analyzing the goal of the project at a preliminary stage and not less important at the analysis stage and use of the results obtained is of a major importance.

Moshe Sikron

Former Government Statistician of Israel, Jerusalem, Israel

Ron Kenett and Galit Shmueli belong to a class of practitioners who go beyond methodological prowess into questioning what purpose should be served by a data based analysis, and what could be done to gauge the fitness of the analysis to meet its purpose. This kind of insight is all the more urgent given the present climate of controversy surrounding science’s own quality control mechanism. In fact science used in support to economic or policy decision – be it natural or social science ‐ has an evident sore point precisely in the sort of statistical and mathematical modelling where the approach they advocate – Information Quality or InfoQ – is more needed. A full chapter is specifically devoted to the contribution InfoQ can make to clarify aspect of reproducibility, repeatability, and replicability of scientific research and publications. InfoQ is an empirical and flexible construct with practically infinite application in data analysis. In a context of policy, one can deploy InfoQ to compare different evidential bases pro or against a policy, or different options in an impact assessment case. InfoQ is a holistic construct encompassing the data, the method and the goal of the analysis. It goes beyond the dimensions of data quality met in official statistics and resemble more holistic concepts of performance such as analysis pedigrees (NUSAP) and sensitivity auditing. Thus InfoQ includes consideration of analysis’ Generalizability and Action Operationalization. The latter include both action operationalization (to what extent concrete actions can be derived from the information provided by a study) and construct operationalization (to what extent a construct under analysis is effectively captured by the selected variables for a given goal). A desirable feature of InfoQ is that it demands multidisciplinary skills, which may force statisticians to move out of their comfort zone into the real world. The book illustrates the eight dimensions of InfoQ with a wealth of examples. A recommended read for applied statisticians and econometricians who care about the implications of their work.

Andrea Saltelli

European Centre for Governance in Complexity

Kenett and Shmueli have made a significant contribution to the profession by drawing attention to what is frequently the most important but overlooked aspect of analytics; information quality. For example, statistics textbooks too often assume that data consist of random samples and are measured without error, and data science competitions implicitly assume that massive data sets contain high‐quality data and are exactly the data needed for the problem at hand. In reality, of course, random samples are the exception rather than the rule, and many data sets, even very large ones, are not worth the effort required to analyze them. Analytics is akin to mining, not to alchemy; the methods can only extract what is there to begin with. Kenett and Shmueli made clear the point that obtaining good data typically requires significant effort. Fortunately, they present metrics to help analysts understand the limitations of the information in hand, and how to improve it going forward. Kudos to the authors for this important contribution.

Roger Hoerl

Union College, Schenectady, NY USA

About the companion website

Don’t forget to visit the companion website for this book:

www.wiley.com/go/information_quality

Here you will find valuable material designed to enhance your learning, including:

The JMP add‐in presented in

Chapter 16

Five case studies that can be used as exercises of InfoQ assessment

A set of presentations on InfoQ

Scan this QR code to visit the companion website.

Part ITHE INFORMATION QUALITY FRAMEWORK

1Introduction to information quality

1.1 Introduction

Suppose you are conducting a study on online auctions and consider purchasing a dataset from eBay, the online auction platform, for the purpose of your study. The data vendor offers you four options that are within your budget:

Data on all the online auctions that took place in January 2012

Data on all the online auctions, for cameras only, that took place in 2012

Data on all the online auctions, for cameras only, that will take place in the next year

Data on a random sample of online auctions that took place in 2012

Which option would you choose? Perhaps none of these options are of value? Of course, the answer depends on the goal of the study. But it also depends on other considerations such as the analysis methods and tools that you will be using, the quality of the data, and the utility that you are trying to derive from the analysis. In the words of David Hand (2008):

Statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.

While those experienced with data analysis will find this dilemma familiar, the statistics and related literature do not provide guidance on how to approach this question in a methodical fashion and how to evaluate the value of a dataset in such a scenario.

Statistics, data mining, econometrics, and related areas are disciplines that are focused on extracting knowledge from data. They provide a toolkit for testing hypotheses of interest, predicting new observations, quantifying population effects, and summarizing data efficiently. In these empirical fields, measurable data is used to derive knowledge. Yet, a clean, exact, and complete dataset, which is analyzed professionally, might contain no useful information for the problem under investigation. In contrast, a very “dirty” dataset, with missing values and incomplete coverage, can contain useful information for some goals. In some cases, available data can even be misleading (Patzer, 1995, p. 14):

Data may be of little or no value, or even negative value, if they misinform.

The focus of this book is on assessing the potential of a particular dataset for achieving a given analysis goal by employing data analysis methods and considering a given utility. We call this concept information quality (InfoQ). We propose a formal definition of InfoQ and provide guidelines for its assessment. Our objective is to offer a general framework that applies to empirical research. Such element has not received much attention in the body of knowledge of the statistics profession and can be considered a contribution to both the theory and the practice of applied statistics (Kenett, 2015).

A framework for assessing InfoQ is needed both when designing a study to produce findings of high InfoQ as well as at the postdesign stage, after the data has been collected. Questions regarding the value of data to be collected, or that have already been collected, have important implications both in academic research and in practice. With this motivation in mind, we construct the concept of InfoQ and then operationalize it so that it can be implemented in practice.

In this book, we address and tackle a high‐level issue at the core of any data analysis. Rather than concentrate on a specific set of methods or applications, we consider a general concept that underlies any empirical analysis. The InfoQ framework therefore contributes to the literature on statistical strategy, also known as metastatistics (see Hand, 1994).

1.2 Components of InfoQ

Our definition of InfoQ involves four major components that are present in every data analysis: an analysis goal, a dataset, an analysis method, and a utility (Kenett and Shmueli, 2014). The discussion and assessment of InfoQ require examining and considering the complete set of its components as well as the relationships between the components. In such an evaluation we also consider eight dimensions that deconstruct the InfoQ concept. These dimensions are presented in Chapter 3. We start our introduction of InfoQ by defining each of its components.

Before describing each of the four InfoQ components, we introduce the following notation and definitions to help avoid confusion:

g

denotes a specific analysis goal.

X

denotes the available dataset.

f

is an empirical analysis method.

U

is a utility measure.

We use subscript indices to indicate alternatives. For example, to convey K different analysis goals, we use g1, g2,…, gK; J different methods of analysis are denoted f1, f2,…, fJ.

Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we can think of the InfoQ framework as one for evaluating the application of a technology (data analysis) to a resource (data) for a given purpose.

1.2.1 Goal (g)

Data analysis is used for a variety of purposes in research and in industry. The term “goal” can refer to two goals: the high‐level goal of the study (the “domain goal”) and the empirical goal (the “analysis goal”). One starts from the domain goal and then converts it into an analysis goal. A classic example is translating a hypothesis driven by a theory into a set of statistical hypotheses.

There are various classifications of study goals; some classifications span both the domain and analysis goals, while other classification systems focus on describing different analysis goals.

One classification approach divides the domain and analysis goals into three general classes: causal explanation, empirical prediction, and description (see Shmueli, 2010; Shmueli and Koppius, 2011). Causal explanation is concerned with establishing and quantifying the causal relationship between inputs and outcomes of interest. Lab experiments in the life sciences are often intended to establish causal relationships. Academic research in the social sciences is typically focused on causal explanation. In the social science context, the causality structure is based on a theoretical model that establishes the causal effect of some constructs (abstract concepts) on other constructs. The data collection stage is therefore preceded by a construct operationalization stage, where the researcher establishes which measurable variables can represent the constructs of interest. An example is investigating the causal effect of parents’ intelligence on their children’s intelligence. The construct “intelligence” can be measured in various ways, such as via IQ tests. The goal of empirical prediction differs from causal explanation. Examples include forecasting future values of a time series and predicting the output value for new observations given a set of input variables. Examples include recommendation systems on various websites, which are aimed at predicting services or products that the user is most likely to be interested in. Predictions of the economy are another type of predictive goal, with forecasts of particular economic measures or indices being of interest. Finally, descriptive goals include quantifying and testing for population effects by using data summaries, graphical visualizations, statistical models, and statistical tests.

A different, but related goal classification approach (Deming, 1953) introduces the distinction between enumerative studies, aimed at answering the question “how many?,” and analytic studies, aimed at answering the question “why?”

A third classification (Tukey, 1977) classifies studies into exploratory and confirmatory data analysis.

Our use of the term “goal” includes all these different types of goals and goal classifications. For examples of such goals in the context of customer satisfaction surveys, see Chapter 7 and Kenett and Salini (2012).

1.2.2 Data (X)

Data is a broadly defined term that includes any type of data intended to be used in the empirical analysis. Data can arise from different collection instruments: surveys, laboratory tests, field experiments, computer experiments, simulations, web searches, mobile recordings, observational studies, and more. Data can be primary, collected specifically for the purpose of the study, or secondary, collected for a different reason. Data can be univariate or multivariate, discrete, continuous, or mixed. Data can contain semantic unstructured information in the form of text, images, audio, and video. Data can have various structures, including cross‐sectional data, time series, panel data, networked data, geographic data, and more. Data can include information from a single source or from multiple sources. Data can be of any size (from a single observation in case studies to “big data” with zettabytes) and any dimension.

1.2.3 Analysis (f)

We use the general term data analysis to encompass any empirical analysis applied to data. This includes statistical models and methods (parametric, semiparametric, nonparametric, Bayesian and classical, etc.), data mining algorithms, econometric models, graphical methods, and operations research methods (such as simplex optimization). Methods can be as simple as summary statistics or complex multilayer models, computationally simple or computationally intensive.

1.2.4 Utility (U)

The extent to which the analysis goal is achieved is typically measured by some performance measure. We call this measure “utility.” As with the study goal, utility refers to two dimensions: the utility from the domain point of view and the operationalized measurable utility measure. As with the goal, the linkage between the domain utility and the analysis utility measure should be properly established so that the analysis utility can be used to infer about the domain utility.

In predictive studies, popular utility measures are predictive accuracy, lift, and expected cost per prediction. In descriptive studies, utility is often assessed based on goodness‐of‐fit measures. In causal explanatory modeling, statistical significance, statistical power, and strength‐of‐fit measures (e.g., R2) are common.

1.3 Definition of information quality

Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we consider the utility of applying a technology f to a resource X for a given purpose g. In particular, we focus on the question: What is the potential of a particular dataset to achieve a particular goal using a given data analysis method and utility? To formalize this question, we define the concept of InfoQ as

The quality of information, InfoQ, is determined by the quality of its components g (“quality of goal definition”), X (“data quality”), f (“analysis quality”), and U (“quality of utility measure”) as well as by the relationships between them. (See Figure 1.1 for a visual representation of InfoQ components.)

Figure 1.1The four InfoQ components.

1.4 Examples from online auction studies

Let us recall the four options of eBay datasets we described at the beginning of the chapter. In order to evaluate the InfoQ of each of these datasets, we would have to specify the study goal, the intended data analysis, and the utility measure.

To better illustrate the role that the different components play, let us examine four studies in the field of online auctions, each using data to address a particular goal.

Case study 1 Determining factors affecting the final price of an auction

Econometricians are interested in determining factors that affect the final price of an online auction. Although game theory provides an underlying theoretical causal model of price in offline auctions, the online environment differs in substantial ways. Online auction platforms such as eBay.com have lowered the entry barrier for sellers and buyers to participate in auctions. Auction rules and settings can differ from classic on‐ground auctions, and so can dynamics between bidders.

Let us examine the study “Public versus Secret Reserve Prices in eBay Auctions: Results from a Pokémon Field Experiment” (Katkar and Reiley, 2006) which investigated the effect of two types of reserve prices on the final auction price. A reserve price is a value that is set by the seller at the start of the auction. If the final price does not exceed the reserve price, the auction does not transact. On eBay, sellers can choose to place a public reserve price that is visible to bidders or an invisible secret reserve price, where bidders see only that there is a reserve price but do not know its value.

STUDY GOAL (g)

The researchers’ goal is stated as follows:

We ask, empirically, whether the seller is made better or worse off by setting a secret reserve above a low minimum bid, versus the option of making the reserve public by using it as the minimum bid level.

This question is then converted into the statistical goal (g) of testing a hypothesis “that secret reserve prices actually do produce higher expected revenues.”

DATA (X)

The researchers proceed by setting up auctions for Pokémon cards1 on eBay.com and auctioning off 50 matched pairs of Pokémon cards, half with secret reserves and half with equivalently high public minimum bids. The resulting dataset included information about bids, bidders, and the final price in each of the 100 auctions, as well as whether the auction had a secret or public reserve price. The dataset also included information about the sellers’ choices, such as the start and close time of each auction, the shipping costs, etc. This dataset constitutes X.

DATA ANALYSIS (f)

The researchers decided to “measure the effects of a secret reserve price (relative to an equivalent public reserve) on three different independent variables: the probability of the auction resulting in a sale, the number of bids received, and the price received for the card in the auction.” This was done via linear regression models ( f ). For example, the sale/no sale outcome was regressed on the type of reserve (public/private) and other control variables, and the statistical significance of the reserve variable was examined.

UTILITY (U)

The authors conclude “The average drop in the probability of sale when using a secret reserve is statistically significant.” Using another linear regression model with price as the dependent variable, statistical significance (the p‐value) of the regression coefficient was used to test the presence of an effect for a private or public reserve price, and the regression coefficient value was used to quantify the magnitude of the effect, concluding that “a secret‐reserve auction will generate a price $0.63 lower, on average, than will a public‐reserve auction.” Hence, the utility (U) in this study relies mostly on statistical significance and p‐values as well as the practical interpretation of the magnitude of a regression coefficient.

INFOQ COMPONENTS EVALUATION

What is the quality of the information contained in this study’s dataset for testing the effect of private versus public reserve price on the final price, using regression models and statistical significance? The authors compare the advantages of their experimental design for answering their question of interest with designs of previous studies using observational data:

With enough [observational] data and enough identifying econometric assumptions, one could conceivably tease out an empirical measurement of the reserve price effect from eBay field data… Such structural models make strong identifying assumptions in order to recover economic unobservables (such as bidders’ private information about the item’s value)… In contrast, our research project is much less ambitious, for we focus only on the effect of secret reserve prices relative to public reserve prices (starting bids). Our experiment allows us to carry out this measurement in a manner that is as simple, direct, and assumption‐free as possible.

In other words, with a simple two‐level experiment, the authors aim to answer a specific research question (g1) in a robust manner, rather than build an extensive theoretical economic model (g2) that is based on many assumptions.

Interestingly, when comparing their conclusions against prior literature on the effect of reserve prices in a study that used observational data, the authors mention that they find an opposite effect:

Our results are somewhat inconsistent with those of Bajari and Hortaçsu…. Perhaps Bajari and Hortaçsu have made an inaccurate modeling assumption, or perhaps there is some important difference between bidding for coin sets and bidding for Pokémon cards.

This discrepancy even leads the researchers to propose a new dataset that can help tackle the original goal with less confounding:

A new experiment, auctioning one hundred items each in the $100 range, for example could shed some important light on this question.

This means that the InfoQ of the Pokémon card auction dataset is considered lower than that of a more expensive item.

1 The Pokémon trading card game was one of the largest collectible toy crazes of 1999 and 2000. Introduced in early 1999, Pokémon game cards appeal both to game players and to collectors. Source: Katkar and Reiley (2006). © National Bureau of Economic Research.

Case study 2 Predicting the final price of an auction at the start of the auction