E-Book
71,99 €

Statistical Data Cleaning with Applications in R E-Book

Mark van der Loo

0,0

71,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

A comprehensive guide to automated statistical data cleaning

The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. Statistical Data Cleaning brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy.

Key features:

Focuses on the automation of data cleaning methods, including both theory and applications written in R.
- Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis.
- Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring.
- Supported by an accompanying website featuring data and R code.

This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 573

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Title Page

Foreword

What You Will Find in this Book

For Who Is this Book?

Acknowledgments

About the Companion Website

Chapter 1: Data Cleaning

1.1 The Statistical Value Chain

1.2 Notation and Conventions Used in this Book

Chapter 2: A Brief Introduction to R

2.1 R on the Command Line

2.2 Vectors

2.3 Data Frames

2.4 Special Values

2.5 Getting Data into and out of R

2.6 Functions

2.7 Packages Used in this Book

Chapter 3: Technical Representation of Data

3.1 Numeric Data

3.2 Text Data

3.3 Times and Dates

3.4 Notes on Locale Settings

Chapter 4: Data Structure

4.1 Introduction

4.2 Tabular Data

4.3 Matrix Data

4.4 Time Series

4.5 Graph Data

4.6 Web Data

4.7 Other Data

4.8 Tidying Tabular Data

Chapter 5: Cleaning Text Data

5.1 Character Normalization

5.2 Pattern Matching with Regular Expressions

5.3 Common String Processing Tasks in R

5.4 Approximate Text Matching

Chapter 6: Data Validation

6.1 Introduction

6.2 A First Look at the

validate

Package

6.3 Defining Data Validation

6.4 A Formal Typology of Data Validation Functions

Chapter 7: Localizing Errors in Data Records

7.1 Error Localization

7.2 Error Localization with R

7.3 Error Localization as MIP-Problem

7.4 Numerical Stability Issues

7.5 Practical Issues

7.6 Conclusion

Appendix 7.A: Derivation of Eq. (7.33)

Chapter 8: Rule Set Maintenance and Simplification

8.1 Quality of Validation Rules

8.2 Rules in the Language of Logic

8.3 Rule Set Issues

8.4 Detection and Simplification Procedure

8.5 Conclusion

Chapter 9: Methods Based on Models for Domain Knowledge

9.1 Correction with Data Modifying Rules

9.2 Rule-Based Correction with

dcmodify

9.3 Deductive Correction

Chapter 10: Imputation and Adjustment

10.1 Missing Data

10.2 Model-Based Imputation

10.3 Model-Based Imputation in R

10.4 Donor Imputation with R

10.5 Other Methods in the

simputation

Package

10.6 Imputation Based on the EM Algorithm

10.7 Sampling Variance under Imputation

10.8 Multiple Imputations

10.9 Analytic Approaches to Estimate Variance of Imputation

10.10 Choosing an Imputation Method

10.11 Constraint Value Adjustment

Chapter 11: Example: A Small Data-Cleaning System

11.1 Setup

11.2 Monitoring Changes in Data

11.3 Integration and Automation

References

Index

End User License Agreement

Pages

xii

xiii

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

287

288

289

290

291

292

293

294

295

296

297

298

299

300

Guide

Cover

Table of Contents

Foreword

Begin Reading

List of Illustrations

Chapter 1: Data Cleaning

Figure 1.1 Part of a statistical value chain, showing five different levels of statistical value going from raw data to statistical product.

Chapter 3: Technical Representation of Data

Figure 3.1 Terminology of the Unicode Character Encoding Model (UCEM, Whistler

et al

. (2008)) and terms used in this book. The fine-grained mappings defined in the Unicode standard are typeset in italics. Terminology used in this book is printed in bold. In Unicode, a character encoding system is defined in three steps, where the first step (definition of character repertoire connected to code points) is fixed in the standard. The ensuing translation steps may be defined by programmers conforming to the standard or one can use one of the mappings developed by the Unicode consortium, such as UTF-8.

Figure 3.2 Percentages of websites using encodings as a function of time. Shown here are three encodings most popular at the time of writing.

Figure 3.3 Relation between TAI, UTC, and POSIX time.

Chapter 4: Data Structure

Figure 4.1 UK lung deaths.

Figure 4.2 Directed graph.

Chapter 5: Cleaning Text Data

Figure 5.1 Results of exact or inexact matching against a lookup table with raw and normalized text data. The combination of normalization and approximate matching works best.

Figure 5.2 Hierarchical clustering (‘single’ method) of five strings based on the optimal string alignment distance matrix.

Figure 5.3 Benchmark of times to compute string distance measures in the stringdist package (smaller is faster). In this benchmark, the lower triangular of distance matrix was computed 250 times using stringdist::stringdistmatrix. The durations were measured with the microbenchmark package. The strings consisted of randomly generated names of lengths varying between 8 and 28 characters.

Figure 5.4 Benchmarks of computing string distance measures in the stringdist package (smaller is faster). In this benchmark, lower triangular distance matrices were computed between random strings of lengths varying from 8 to 256 characters. The lines represent median relative times over 10 repetitions. Results have been normalized to the maximum median time. Results for the soundex distance were not plotted as it nearly coincides with that of the Hamming distance.

Chapter 6: Data Validation

Figure 6.1 The various times involved in a measurement process. A population member exists over the period . At the time of measurement , a value of is observed pertaining to the period . In principle, may be before, within, or after this period. Also, instead of a period, one may choose a moment in time by letting .

Figure 6.2 Defining rules and their properties in YAML format.

Figure 6.3 Free-form and YAML-structured rule definitions can be mixed in a single file.

Chapter 7: Localizing Errors in Data Records

Figure 7.1 (a) In gray, the area of valid combinations according to the rule set of Eq. (7.36), black dots represent valid combinations, and crosses represent invalid combinations. The -axis is part of the valid area. (b) The valid area is defined by and , with an appropriate choice for .

Figure 7.2 (a) In gray, the area of valid combinations according to the rule set of Eq. (7.39). Black dots represent valid combinations, and crosses represent invalid combinations. The -axis is part of the valid area. (b) The valid area is defined by , , and with appropriate choices for and .

Chapter 9: Methods Based on Models for Domain Knowledge

Figure 9.1 Idempotency of a modifying function. The full rectangle represents the measurement domain . The portion in the rectangle outside of is valid according to the condition in . The region is considered invalid. The function maps an invalid record to a valid one and a valid record to itself.

Figure 9.2 Commutativity of idempotent modifying functions. The full rectangle including subsets and corresponds to the measurement domain . The idempotent modifying functions and commute on but not on .

Chapter 10: Imputation and Adjustment

Figure 10.1 Percentages of missing values per variable (a) and occurrence of missing data patterns (b) in the retailers dataset, plotted with VIM::aggr.

Figure 10.2 Parallel boxplots, comparing the distribution of

staff

conditional on the missingness of other variables.

Figure 10.3 Marginplot of

other.rev

against

staff

Figure 10.4 Three - and -functions used in -estimation (plotted so that the scales of the axes are comparable). For comparison, the traditional function is plotted with dotted lines.

Figure 10.5 (a) A decision tree for estimating number of staff in the retailers dataset, computed with rpart. (b) The space partitioning. The vertical bars indicate the limits represented in the top node and the subdivision indicated by its left child node. In the middle region, white dots indicate records where size=="sc2".

Figure 10.6 Estimation using multiple imputation.

Figure 10.7 Observed and multiple imputed staff numbers and turnovers. The data is taken from the retailers dataset of the validate package. Imputations are generated with Amelia::amelia.

Figure 10.8 Result of a call to Amelia::overimpute(out, var="staff") vertical bars indicate 90% confidence intervals. A perfect model would have all points on the diagonal line. The colors indicate the fraction of missing values in the record to which the record pertains.

Figure 10.9 Valid regions (shaded) for restrictions (10.20) and (10.20). The black dots represents .

Chapter 11: Example: A Small Data-Cleaning System

Figure 11.1 The validation rules in rules.txt.

Figure 11.2 Modifying rules in modify.txt.

Figure 11.3 The retailers data after error localization.

Figure 11.4 A classification of cell status when comparing a dataset after some processing with the dataset before processing. The ‘total’ is equal to the number of cells in a dataset.

Figure 11.5 A classification of changes in rule violation status before and after a data-cleaning step.

Figure 11.6 Progression of the number of violations and the number of unverifiable validations as a function of the step number in the data-cleaning process.

Figure 11.7 The completed data-cleaning script, automated using docopt.

List of Tables

Chapter 3: Technical Representation of Data

Table 3.1 Conventions for integer representation in three-bit representation

Table 3.2 Text constants in R

Table 3.3 Functions opening connections in base R

Table 3.4 ISO 8601 notational formats for calendar dates on the Gregorian calendar

Table 3.5 ISO 8601 notational formats for time of day in the 24- h system

Table 3.6 A selection of time and date conversion codes accepted by R's time and date conversion function like strptime and format

Chapter 4: Data Structure

Table 4.1 Often encountered data types and their structure

Table 4.2 Table ‘Person’ with two records

Table 4.3 Relational operators and their dplyr verb

Table 4.4 Number of lung deaths in “messy” format

Table 4.5 Number of lung deaths in “tidy” format

Chapter 5: Cleaning Text Data

Table 5.1 Character ranges in the POSIX standard (version 2), with ranges corresponding to symbols of the ASCII alphabet

Table 5.2 Extended regular expression notation for repetition operators

Table 5.3 A selection of basic text processing functions in base R and in stringi.

Table 5.4 Edit-based string distances

Table 5.5 Methods supported by the stringdist package

Chapter 6: Data Validation

Table 6.1 Classification of validation functions, based on the combination of data being validated, comes from a single () or multiple () domains , times of measurement , statistical objects , or variables

Chapter 7: Localizing Errors in Data Records

Table 7.1 Numerical parameters for MIP-based error localization

Chapter 10: Imputation and Adjustment

Table 10.1 Number of multiple imputations , as recommended by Graham

et al

. (2007)

Table 10.2 A classification of imputation methods with examples

Statistical Data Cleaning with Applications in R

Mark van der Loo

Statistics Netherlands The Netherlands

Edwin de Jonge

Statistics Netherlands The Netherlands

This edition first published 2018

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Mark van der Loo and Edwin de Jonge to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Loo, Mark van der, 1976- author.| Jonge, Edwin de, 1972- author.

Title: Statistical data cleaning with applications in R / by Mark van der Loo, Edwin de Jonge

Description: Hoboken, NJ : John Wiley & Sons, 2018. | Includes bibliographical references and index. |

Identifiers: LCCN 2017049091 (print) | LCCN 2017059014 (ebook) | ISBN 9781118897140 (pdf) | ISBN 9781118897133 (epub) | ISBN 9781118897157 (cloth)

Subjects: LCSH: Statistics-Data processing. | R (Computer program language)

Classification: LCC QA276.45.R3 (ebook) | LCC QA276.45.R3 J663 2018 (print) | DDC 519.50285/5133-dc23

LC record available at https://lccn.loc.gov/2017049091

Cover design by Wiley

Foreword

Data cleaning is often the most time-consuming part of data analysis. Although it has been recognized as a separate topic for a long time in Official Statistics (where it is called ‘data editing’) and also has been studied in relation to databases, literature aimed at the larger statistical community is limited. This is why, when the publisher invited us to expand our tutorial “An introduction to data cleaning with R”, which we developed for the useR!2013 conference, into a book, we grabbed the opportunity with both hands. On the one hand, we felt that some of the methods that have been developed in the Official Statistics community over the last five decades deserved a wider audience. Perhaps, this book can help with that. On the other hand, we hope that this book will help in bringing some (often pre-existing) techniques into the Official Statistics community, as we move from survey-based data sources to administrative and “big” data sources.

For us, it would also be a nice way to systematize our knowledge and the software we have written on this topic. Looking back, we ended up not only writing this book, but also redeveloping and generalizing much of the data cleaning R packages we had written before. One of the reasons for this is that we discovered nice ways to generalize and expand our software and methods, and another is that we wished to connect to the recently emerged “tidyverse” style of interfaces to the R functionality.

What You Will Find in this Book

This book contains a selection of topics that we found to be useful while developing data cleaning (data editing) systems. The range is very broad, ranging from topics related to computer science, numerical methods, technical standards, statistics and data modeling, and programming.

This book covers topics in “technical data cleaning”, including conversion and interpretation of numerical, text, and date types. The technical standards related to these data types are also covered in some detail. On the data content side of things, topics include data validation (data checking), error localization, various methods for error correction, and missing value imputation.

Wherever possible, the theory discussed in this book is illustrated with an executable R code. We have also included exercises throughout the book which we hope will guide the reader in further understanding both the software and the methods.

The mix of topics reflects both the breadth of the subject and of course the interests and expertise of the authors. The list of missing topics is of course much larger than that what is treated, but perhaps the most important ones are cleaning of time series objects and outlier detection.

For Who Is this Book?

Readers of this book are expected to have basic knowledge of mathematics and statistics and also some programming experience. We assume concepts such as expectation values, variance, and basic calculus and linear algebra as previous knowledge. It is beneficial to have at least some knowledge of R, since this is the language used in this book, but for convenience and reference, a short chapter explaining the basics is included.

Acknowledgments

This book would have not been possible without the work of many others. We would like to thank our colleagues at Statistics Netherlands for fruitful discussions on data validation, imputation, and error localization. Some of the chapters in this book are based on papers and reports written with co-authors. We thank Jeroen Pannekoek, Sander Scholtus, and Jacco Daalmans for their pleasant and fruitful collaboration. We are greatly indebted by the R core team, package developers, and the very supportive R community for their relentless efforts.

Finally, we would like to thank our families for their love and support.

June 2017

Mark and Edwin

About the Companion Website

Do not forget to visit the companion website for this book:

www.data-cleaning.org

There you will find valuable materials designed to enhance your learning, including:

supplementary materials

Chapter 1Data Cleaning

1.1 The Statistical Value Chain

The purpose of data cleaning is to bring data up to a level of quality such that it can reliably be used for the production of statistical models or statements. The necessary level of quality needed to create some statistical output is determined by a simple cost-benefit question: when is statistical output fit for use, and how much effort will it cost to bring the data up to that level?

One useful way to get a hold on this question is to think of data analyses in terms of a value chain. A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. The idea of a statistical value chain has become a common term in the official statistics community over the past two decades or so, although a single common definition seems to be lacking.1 Roughly then, a statistical value chain is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described (Renssen and Van Delden 2008). There are many ways to go about this, but for these authors, the picture shown in Figure 1.1 has proven to be fairly generic and useful to organize thoughts around a statistical production process.

Figure 1.1 Part of a statistical value chain, showing five different levels of statistical value going from raw data to statistical product.

A nice trait of the schema in Figure 1.1 is that it naturally introduces activities that are typically categorized as ‘data cleaning’ into the statistical production process. From the left, we start with raw data. This must be worked up to satisfy (enough) technical standards so it can serve as input for consistency checks, data correction, and imputation procedures. Once this has been achieved, the data may be considered valid (enough) for the production of statistical parameters. These must then still be formatted to be ready for deployment as output.

One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear. It is more common to clean data, create some aggregates, notice that something is wrong, and go back. The purpose of the value chain is more to keep an overview of where activities take place (e.g., by putting them in separate scripts) than to prescribe a linear order of the actual workflow. In practice, a workflow cycles multiple times through the subsequent stages of the value chain, until the quality of its output is good enough. In the following sections we will discuss each stage in a little more detail.

1.1.1 Raw Data

With raw data, we mean the data as it arrives at the desk of the analyst. The state of such data may of course vary enormously, depending on the data source. In any case, we take it as a given that the person responsible for analysis has little or no influence over how the data was gathered. The first step consists of making the data readily accessible and understandable. To be precise after the first processing steps, we demand that each value in the data be identified with the real-world object they represent (person, company, something else), for each value it is known what variable it represents (age, income, etc.), and the value is stored in the appropriate technical format (number and string).

Depending on the technical raw data format, the activities necessary to achieve the desired technical format typically include file conversion, string normalization (such as encoding conversion), and standardization and conversion of numerical values. Joining the data against a backbone population register of known statistical objects, possibly using procedures that can handle inexact matches of keys, is also considered a part of such a procedure. These procedures are treated in Chapters 3–5.

1.1.2 Input Data

Input data are data where each value is stored in the correct type and identified with the variable it represents and the statistical entity it pertains to. In many cases, such a dataset can be represented in tabular format, rows representing entities and columns representing variables. In the R community, this has come to be known as tidy data (Wickham, 2014b). Here, we leave the format open. Many data can be usefully represented in the form of a tree or graph (e.g., web pages and XML structures). As long as all the elements are readily identified and of the correct format, it can serve as Input data.

Once a dataset is at the level of input data, the treatment of missing values, implausible values, and implausible value combinations must take place. This process is commonly referred to as data editing and imputation. It differs from the previous steps in that it focuses on the consistency of data with respect to domain knowledge. Such domain knowledge can often be expressed as a set of rules, such as age >= 0, mean(profit) > 0, or if ( age < 15 ) has_job = FALSE. A substantial part of this book (Chapters 6–8) is devoted to defining, applying, and maintaining such rules so that data cleaning can be automated and hence be executed in a reproducible way. Moreover, in Chapter 7, we look into methodology that allows one to pick out a minimum number of fields in a record that may be altered or imputed such that all the rules can be satisfied. In Chapter 9, we will have a formal look on how data modification using knowledge rules can be safely automated, and Chapter 10 treats missing value imputation.

1.1.3 Valid Data

Data are valid once they are trusted to faithfully represent the variables and objects they represent. Making sure that data satisfies the domain knowledge expressed in the form of a set of validation rules is one reproducible way of doing so. Often this is complemented by some form of expert review, for example, based on various visualizations or reviewing of aggregate values by domain experts.

Once data is deemed valid, the statistics can be produced by known modeling and inference techniques. Depending on the preceding data cleaning process, these techniques may need those procedures into account, for example, when estimating variance after certain imputation procedures.

1.1.4 Statistics

Statistics are simply estimates of the output variables of interest. Often, these are simple aggregates (totals and means), but in principle they can consist of more complex parameters such as regression model coefficients or a trained machine learning model.

1.1.5 Output

Output is where the analysis stops. It is created by taking the statistics and preparing them for dissemination. This may involve technical formatting, for example, to make numbers available through a (web) API or layout formatting, for example, by preparing a report or visualization. In the case of technical formatting, a technical validation step may again be necessary, for example, by checking the output format against some (json or XML) schema. In general, the output of one analyst is raw data for another.

1.2 Notation and Conventions Used in this Book

The topics discussed in this book relate to a variety of subfields in mathematics, logic, statistics, computer science, and programming. This broad range of fields makes coming up with a consistent notation for different variable types and concepts a bit of a challenge, but we have attempted to use a consistent notation throughout.

General Mathematics and Logic

We follow the conventional notation and use , , and to denote the natural, integer, and real numbers. The symbols , , stand for logical disjunction, conjunction, and negation, respectively. In the context of logic, is used to denote ‘exclusive or’. Sometimes, it is useful to distinguish between a definition and an equality. In these cases, a definition will be denoted using .

Linear Algebra

Vectors are denoted in lowercase bold, usually . Vectors are column vectors unless noted otherwise. The symbols and denote vectors of which the coefficients are all 1 or all 0, respectively. Matrices are denoted in uppercase bold, usually . The identity matrix is denoted by . Transposition is indicated with superscript and matrix multiplication is implied by juxtaposition, for example, . The standard Euclidean norm of a vector is denoted by , and if another norm is used, this is indicated with a subscript. For example, denotes the norm of . In the context of linear algebra, and denote the direct (tensor) product and the direct sum, respectively.

Probability and Statistics

Random variables are denoted in capitals, usually . Probabilities and probability densities are both denoted by . Expected value, variance, and covariance are notated as , , and , respectively. Estimates are accented with a hat, for example, denotes the estimated expected value for .

Code and Variables

R code is presented in fixed width font sections. Output is prefixed with two hashes.

age <- sample(100,25,replace=TRUE) mean(age) ## [1] 52.44

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: