71,99 €
A comprehensive guide to automated statistical data cleaning
The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. Statistical Data Cleaning brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy.
Key features:
This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 573
Veröffentlichungsjahr: 2018
Cover
Title Page
Copyright
Foreword
What You Will Find in this Book
For Who Is this Book?
Acknowledgments
About the Companion Website
Chapter 1: Data Cleaning
1.1 The Statistical Value Chain
1.2 Notation and Conventions Used in this Book
Chapter 2: A Brief Introduction to R
2.1 R on the Command Line
2.2 Vectors
2.3 Data Frames
2.4 Special Values
2.5 Getting Data into and out of R
2.6 Functions
2.7 Packages Used in this Book
Chapter 3: Technical Representation of Data
3.1 Numeric Data
3.2 Text Data
3.3 Times and Dates
3.4 Notes on Locale Settings
Chapter 4: Data Structure
4.1 Introduction
4.2 Tabular Data
4.3 Matrix Data
4.4 Time Series
4.5 Graph Data
4.6 Web Data
4.7 Other Data
4.8 Tidying Tabular Data
Chapter 5: Cleaning Text Data
5.1 Character Normalization
5.2 Pattern Matching with Regular Expressions
5.3 Common String Processing Tasks in R
5.4 Approximate Text Matching
Chapter 6: Data Validation
6.1 Introduction
6.2 A First Look at the
validate
Package
6.3 Defining Data Validation
6.4 A Formal Typology of Data Validation Functions
Chapter 7: Localizing Errors in Data Records
7.1 Error Localization
7.2 Error Localization with R
7.3 Error Localization as MIP-Problem
7.4 Numerical Stability Issues
7.5 Practical Issues
7.6 Conclusion
Appendix 7.A: Derivation of Eq. (7.33)
Chapter 8: Rule Set Maintenance and Simplification
8.1 Quality of Validation Rules
8.2 Rules in the Language of Logic
8.3 Rule Set Issues
8.4 Detection and Simplification Procedure
8.5 Conclusion
Chapter 9: Methods Based on Models for Domain Knowledge
9.1 Correction with Data Modifying Rules
9.2 Rule-Based Correction with
dcmodify
9.3 Deductive Correction
Chapter 10: Imputation and Adjustment
10.1 Missing Data
10.2 Model-Based Imputation
10.3 Model-Based Imputation in R
10.4 Donor Imputation with R
10.5 Other Methods in the
simputation
Package
10.6 Imputation Based on the EM Algorithm
10.7 Sampling Variance under Imputation
10.8 Multiple Imputations
10.9 Analytic Approaches to Estimate Variance of Imputation
10.10 Choosing an Imputation Method
10.11 Constraint Value Adjustment
Chapter 11: Example: A Small Data-Cleaning System
11.1 Setup
11.2 Monitoring Changes in Data
11.3 Integration and Automation
References
Index
End User License Agreement
xi
xii
xiii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
287
288
289
290
291
292
293
294
295
296
297
298
299
300
Cover
Table of Contents
Foreword
Begin Reading
Chapter 1: Data Cleaning
Figure 1.1 Part of a statistical value chain, showing five different levels of statistical value going from raw data to statistical product.
Chapter 3: Technical Representation of Data
Figure 3.1 Terminology of the Unicode Character Encoding Model (UCEM, Whistler
et al
. (2008)) and terms used in this book. The fine-grained mappings defined in the Unicode standard are typeset in italics. Terminology used in this book is printed in bold. In Unicode, a character encoding system is defined in three steps, where the first step (definition of character repertoire connected to code points) is fixed in the standard. The ensuing translation steps may be defined by programmers conforming to the standard or one can use one of the mappings developed by the Unicode consortium, such as UTF-8.
Figure 3.2 Percentages of websites using encodings as a function of time. Shown here are three encodings most popular at the time of writing.
Figure 3.3 Relation between TAI, UTC, and POSIX time.
Chapter 4: Data Structure
Figure 4.1 UK lung deaths.
Figure 4.2 Directed graph.
Chapter 5: Cleaning Text Data
Figure 5.1 Results of exact or inexact matching against a lookup table with raw and normalized text data. The combination of normalization and approximate matching works best.
Figure 5.2 Hierarchical clustering (‘single’ method) of five strings based on the optimal string alignment distance matrix.
Figure 5.3 Benchmark of times to compute string distance measures in the stringdist package (smaller is faster). In this benchmark, the lower triangular of distance matrix was computed 250 times using stringdist::stringdistmatrix. The durations were measured with the microbenchmark package. The strings consisted of randomly generated names of lengths varying between 8 and 28 characters.
Figure 5.4 Benchmarks of computing string distance measures in the stringdist package (smaller is faster). In this benchmark, lower triangular distance matrices were computed between random strings of lengths varying from 8 to 256 characters. The lines represent median relative times over 10 repetitions. Results have been normalized to the maximum median time. Results for the soundex distance were not plotted as it nearly coincides with that of the Hamming distance.
Chapter 6: Data Validation
Figure 6.1 The various times involved in a measurement process. A population member exists over the period . At the time of measurement , a value of is observed pertaining to the period . In principle, may be before, within, or after this period. Also, instead of a period, one may choose a moment in time by letting .
Figure 6.2 Defining rules and their properties in YAML format.
Figure 6.3 Free-form and YAML-structured rule definitions can be mixed in a single file.
Chapter 7: Localizing Errors in Data Records
Figure 7.1 (a) In gray, the area of valid combinations according to the rule set of Eq. (7.36), black dots represent valid combinations, and crosses represent invalid combinations. The -axis is part of the valid area. (b) The valid area is defined by and , with an appropriate choice for .
Figure 7.2 (a) In gray, the area of valid combinations according to the rule set of Eq. (7.39). Black dots represent valid combinations, and crosses represent invalid combinations. The -axis is part of the valid area. (b) The valid area is defined by , , and with appropriate choices for and .
Chapter 9: Methods Based on Models for Domain Knowledge
Figure 9.1 Idempotency of a modifying function. The full rectangle represents the measurement domain . The portion in the rectangle outside of is valid according to the condition in . The region is considered invalid. The function maps an invalid record to a valid one and a valid record to itself.
Figure 9.2 Commutativity of idempotent modifying functions. The full rectangle including subsets and corresponds to the measurement domain . The idempotent modifying functions and commute on but not on .
Chapter 10: Imputation and Adjustment
Figure 10.1 Percentages of missing values per variable (a) and occurrence of missing data patterns (b) in the retailers dataset, plotted with VIM::aggr.
Figure 10.2 Parallel boxplots, comparing the distribution of
staff
conditional on the missingness of other variables.
Figure 10.3 Marginplot of
other.rev
against
staff
.
Figure 10.4 Three - and -functions used in -estimation (plotted so that the scales of the axes are comparable). For comparison, the traditional function is plotted with dotted lines.
Figure 10.5 (a) A decision tree for estimating number of staff in the retailers dataset, computed with rpart. (b) The space partitioning. The vertical bars indicate the limits represented in the top node and the subdivision indicated by its left child node. In the middle region, white dots indicate records where size=="sc2".
Figure 10.6 Estimation using multiple imputation.
Figure 10.7 Observed and multiple imputed staff numbers and turnovers. The data is taken from the retailers dataset of the validate package. Imputations are generated with Amelia::amelia.
Figure 10.8 Result of a call to Amelia::overimpute(out, var="staff") vertical bars indicate 90% confidence intervals. A perfect model would have all points on the diagonal line. The colors indicate the fraction of missing values in the record to which the record pertains.
Figure 10.9 Valid regions (shaded) for restrictions (10.20) and (10.20). The black dots represents .
Chapter 11: Example: A Small Data-Cleaning System
Figure 11.1 The validation rules in rules.txt.
Figure 11.2 Modifying rules in modify.txt.
Figure 11.3 The retailers data after error localization.
Figure 11.4 A classification of cell status when comparing a dataset after some processing with the dataset before processing. The ‘total’ is equal to the number of cells in a dataset.
Figure 11.5 A classification of changes in rule violation status before and after a data-cleaning step.
Figure 11.6 Progression of the number of violations and the number of unverifiable validations as a function of the step number in the data-cleaning process.
Figure 11.7 The completed data-cleaning script, automated using docopt.
Chapter 3: Technical Representation of Data
Table 3.1 Conventions for integer representation in three-bit representation
Table 3.2 Text constants in R
Table 3.3 Functions opening connections in base R
Table 3.4 ISO 8601 notational formats for calendar dates on the Gregorian calendar
Table 3.5 ISO 8601 notational formats for time of day in the 24- h system
Table 3.6 A selection of time and date conversion codes accepted by R's time and date conversion function like strptime and format
Chapter 4: Data Structure
Table 4.1 Often encountered data types and their structure
Table 4.2 Table ‘Person’ with two records
Table 4.3 Relational operators and their dplyr verb
Table 4.4 Number of lung deaths in “messy” format
Table 4.5 Number of lung deaths in “tidy” format
Chapter 5: Cleaning Text Data
Table 5.1 Character ranges in the POSIX standard (version 2), with ranges corresponding to symbols of the ASCII alphabet
Table 5.2 Extended regular expression notation for repetition operators
Table 5.3 A selection of basic text processing functions in base R and in stringi.
Table 5.4 Edit-based string distances
Table 5.5 Methods supported by the stringdist package
Chapter 6: Data Validation
Table 6.1 Classification of validation functions, based on the combination of data being validated, comes from a single () or multiple () domains , times of measurement , statistical objects , or variables
Chapter 7: Localizing Errors in Data Records
Table 7.1 Numerical parameters for MIP-based error localization
Chapter 10: Imputation and Adjustment
Table 10.1 Number of multiple imputations , as recommended by Graham
et al
. (2007)
Table 10.2 A classification of imputation methods with examples
Mark van der Loo
Statistics Netherlands The Netherlands
Edwin de Jonge
Statistics Netherlands The Netherlands
This edition first published 2018
© 2018 John Wiley and Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Mark van der Loo and Edwin de Jonge to be identified as the authors of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Loo, Mark van der, 1976- author.| Jonge, Edwin de, 1972- author.
Title: Statistical data cleaning with applications in R / by Mark van der Loo, Edwin de Jonge
Description: Hoboken, NJ : John Wiley & Sons, 2018. | Includes bibliographical references and index. |
Identifiers: LCCN 2017049091 (print) | LCCN 2017059014 (ebook) | ISBN 9781118897140 (pdf) | ISBN 9781118897133 (epub) | ISBN 9781118897157 (cloth)
Subjects: LCSH: Statistics-Data processing. | R (Computer program language)
Classification: LCC QA276.45.R3 (ebook) | LCC QA276.45.R3 J663 2018 (print) | DDC 519.50285/5133-dc23
LC record available at https://lccn.loc.gov/2017049091
Cover design by Wiley
Cover image: © enot-poloskun/Gettyimages; © 3alexd/Gettyimages
Data cleaning is often the most time-consuming part of data analysis. Although it has been recognized as a separate topic for a long time in Official Statistics (where it is called ‘data editing’) and also has been studied in relation to databases, literature aimed at the larger statistical community is limited. This is why, when the publisher invited us to expand our tutorial “An introduction to data cleaning with R”, which we developed for the useR!2013 conference, into a book, we grabbed the opportunity with both hands. On the one hand, we felt that some of the methods that have been developed in the Official Statistics community over the last five decades deserved a wider audience. Perhaps, this book can help with that. On the other hand, we hope that this book will help in bringing some (often pre-existing) techniques into the Official Statistics community, as we move from survey-based data sources to administrative and “big” data sources.
For us, it would also be a nice way to systematize our knowledge and the software we have written on this topic. Looking back, we ended up not only writing this book, but also redeveloping and generalizing much of the data cleaning R packages we had written before. One of the reasons for this is that we discovered nice ways to generalize and expand our software and methods, and another is that we wished to connect to the recently emerged “tidyverse” style of interfaces to the R functionality.
This book contains a selection of topics that we found to be useful while developing data cleaning (data editing) systems. The range is very broad, ranging from topics related to computer science, numerical methods, technical standards, statistics and data modeling, and programming.
This book covers topics in “technical data cleaning”, including conversion and interpretation of numerical, text, and date types. The technical standards related to these data types are also covered in some detail. On the data content side of things, topics include data validation (data checking), error localization, various methods for error correction, and missing value imputation.
Wherever possible, the theory discussed in this book is illustrated with an executable R code. We have also included exercises throughout the book which we hope will guide the reader in further understanding both the software and the methods.
The mix of topics reflects both the breadth of the subject and of course the interests and expertise of the authors. The list of missing topics is of course much larger than that what is treated, but perhaps the most important ones are cleaning of time series objects and outlier detection.
Readers of this book are expected to have basic knowledge of mathematics and statistics and also some programming experience. We assume concepts such as expectation values, variance, and basic calculus and linear algebra as previous knowledge. It is beneficial to have at least some knowledge of R, since this is the language used in this book, but for convenience and reference, a short chapter explaining the basics is included.
This book would have not been possible without the work of many others. We would like to thank our colleagues at Statistics Netherlands for fruitful discussions on data validation, imputation, and error localization. Some of the chapters in this book are based on papers and reports written with co-authors. We thank Jeroen Pannekoek, Sander Scholtus, and Jacco Daalmans for their pleasant and fruitful collaboration. We are greatly indebted by the R core team, package developers, and the very supportive R community for their relentless efforts.
Finally, we would like to thank our families for their love and support.
June 2017
Mark and Edwin
Do not forget to visit the companion website for this book:
www.data-cleaning.org
There you will find valuable materials designed to enhance your learning, including:
supplementary materials
The purpose of data cleaning is to bring data up to a level of quality such that it can reliably be used for the production of statistical models or statements. The necessary level of quality needed to create some statistical output is determined by a simple cost-benefit question: when is statistical output fit for use, and how much effort will it cost to bring the data up to that level?
One useful way to get a hold on this question is to think of data analyses in terms of a value chain. A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. The idea of a statistical value chain has become a common term in the official statistics community over the past two decades or so, although a single common definition seems to be lacking.1 Roughly then, a statistical value chain is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described (Renssen and Van Delden 2008). There are many ways to go about this, but for these authors, the picture shown in Figure 1.1 has proven to be fairly generic and useful to organize thoughts around a statistical production process.
Figure 1.1 Part of a statistical value chain, showing five different levels of statistical value going from raw data to statistical product.
A nice trait of the schema in Figure 1.1 is that it naturally introduces activities that are typically categorized as ‘data cleaning’ into the statistical production process. From the left, we start with raw data. This must be worked up to satisfy (enough) technical standards so it can serve as input for consistency checks, data correction, and imputation procedures. Once this has been achieved, the data may be considered valid (enough) for the production of statistical parameters. These must then still be formatted to be ready for deployment as output.
One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear. It is more common to clean data, create some aggregates, notice that something is wrong, and go back. The purpose of the value chain is more to keep an overview of where activities take place (e.g., by putting them in separate scripts) than to prescribe a linear order of the actual workflow. In practice, a workflow cycles multiple times through the subsequent stages of the value chain, until the quality of its output is good enough. In the following sections we will discuss each stage in a little more detail.
With raw data, we mean the data as it arrives at the desk of the analyst. The state of such data may of course vary enormously, depending on the data source. In any case, we take it as a given that the person responsible for analysis has little or no influence over how the data was gathered. The first step consists of making the data readily accessible and understandable. To be precise after the first processing steps, we demand that each value in the data be identified with the real-world object they represent (person, company, something else), for each value it is known what variable it represents (age, income, etc.), and the value is stored in the appropriate technical format (number and string).
Depending on the technical raw data format, the activities necessary to achieve the desired technical format typically include file conversion, string normalization (such as encoding conversion), and standardization and conversion of numerical values. Joining the data against a backbone population register of known statistical objects, possibly using procedures that can handle inexact matches of keys, is also considered a part of such a procedure. These procedures are treated in Chapters 3–5.
Input data are data where each value is stored in the correct type and identified with the variable it represents and the statistical entity it pertains to. In many cases, such a dataset can be represented in tabular format, rows representing entities and columns representing variables. In the R community, this has come to be known as tidy data (Wickham, 2014b). Here, we leave the format open. Many data can be usefully represented in the form of a tree or graph (e.g., web pages and XML structures). As long as all the elements are readily identified and of the correct format, it can serve as Input data.
Once a dataset is at the level of input data, the treatment of missing values, implausible values, and implausible value combinations must take place. This process is commonly referred to as data editing and imputation. It differs from the previous steps in that it focuses on the consistency of data with respect to domain knowledge. Such domain knowledge can often be expressed as a set of rules, such as age >= 0, mean(profit) > 0, or if ( age < 15 ) has_job = FALSE. A substantial part of this book (Chapters 6–8) is devoted to defining, applying, and maintaining such rules so that data cleaning can be automated and hence be executed in a reproducible way. Moreover, in Chapter 7, we look into methodology that allows one to pick out a minimum number of fields in a record that may be altered or imputed such that all the rules can be satisfied. In Chapter 9, we will have a formal look on how data modification using knowledge rules can be safely automated, and Chapter 10 treats missing value imputation.
Data are valid once they are trusted to faithfully represent the variables and objects they represent. Making sure that data satisfies the domain knowledge expressed in the form of a set of validation rules is one reproducible way of doing so. Often this is complemented by some form of expert review, for example, based on various visualizations or reviewing of aggregate values by domain experts.
Once data is deemed valid, the statistics can be produced by known modeling and inference techniques. Depending on the preceding data cleaning process, these techniques may need those procedures into account, for example, when estimating variance after certain imputation procedures.
Statistics are simply estimates of the output variables of interest. Often, these are simple aggregates (totals and means), but in principle they can consist of more complex parameters such as regression model coefficients or a trained machine learning model.
Output is where the analysis stops. It is created by taking the statistics and preparing them for dissemination. This may involve technical formatting, for example, to make numbers available through a (web) API or layout formatting, for example, by preparing a report or visualization. In the case of technical formatting, a technical validation step may again be necessary, for example, by checking the output format against some (json or XML) schema. In general, the output of one analyst is raw data for another.
The topics discussed in this book relate to a variety of subfields in mathematics, logic, statistics, computer science, and programming. This broad range of fields makes coming up with a consistent notation for different variable types and concepts a bit of a challenge, but we have attempted to use a consistent notation throughout.
We follow the conventional notation and use , , and to denote the natural, integer, and real numbers. The symbols , , stand for logical disjunction, conjunction, and negation, respectively. In the context of logic, is used to denote ‘exclusive or’. Sometimes, it is useful to distinguish between a definition and an equality. In these cases, a definition will be denoted using .
Vectors are denoted in lowercase bold, usually . Vectors are column vectors unless noted otherwise. The symbols and denote vectors of which the coefficients are all 1 or all 0, respectively. Matrices are denoted in uppercase bold, usually . The identity matrix is denoted by . Transposition is indicated with superscript and matrix multiplication is implied by juxtaposition, for example, . The standard Euclidean norm of a vector is denoted by , and if another norm is used, this is indicated with a subscript. For example, denotes the norm of . In the context of linear algebra, and denote the direct (tensor) product and the direct sum, respectively.
Random variables are denoted in capitals, usually . Probabilities and probability densities are both denoted by . Expected value, variance, and covariance are notated as , , and , respectively. Estimates are accented with a hat, for example, denotes the estimated expected value for .
R code is presented in fixed width font sections. Output is prefixed with two hashes.
age <- sample(100,25,replace=TRUE) mean(age) ## [1] 52.44
