60,99 €
The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in R
Every experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R.
Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling. They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.
A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 588
Veröffentlichungsjahr: 2017
Cover
Title Page
Copyright
Dedication
About the Authors
Preface
Acknowledgments
About the Companion Website
chapter 1: R
1.1 Introduction
1.2 Data
1.3 The Very Basics of R
1.4 Running an R Session
1.5 Getting Help
1.6 How to Use This Book
Chapter 2: R Data, Part 1: Vectors
2.1 Vectors
2.2 Data Types
2.3 Subsets of Vectors
2.4 Missing Data (
NA
) and Other Special Values
2.5 The
table()
Function
2.6 Other Actions on Vectors
2.7 Long Vectors and Big Data
2.8 Chapter Summary and Critical Data Handling Tools
Chapter 3: R Data, Part 2: More Complicated Structures
3.1 Introduction
3.2 Matrices
3.3 Lists
3.4 Data Frames
3.5 Operating on Lists and Data Frames
3.6 Date and Time Objects
3.7 Other Actions on Data Frames
3.8 Handling Big Data
3.9 Chapter Summary and Critical Data Handling Tools
chapter 4: R Data, Part 3: Text and Factors
4.1 Character Data
4.2 Converting Numbers into Text
4.3 Constructing Character Strings: Paste in Action
4.4 Regular Expressions
4.5 UTF-8 and Other Non-ASCII Characters
4.6 Factors
4.7 R Object Names and Commands as Text
4.8 Chapter Summary and Critical Data Handling Tools
Chapter 5: Writing Functions and Scripts
5.1 Functions
5.2 Scripts and Shell Scripts
5.3 Error Handling and Debugging
5.4 Interacting with the Operating System
5.5 Speeding Things Up
5.6 Chapter Summary and Critical Data Handling Tools
Chapter 6: Getting Data into and out of R
6.1 Reading Tabular ASCII Data into Data Frames
6.2 Reading Large, Non-Tabular, or Non-ASCII Data
6.3 Reading Data From Relational Databases
6.4 Handling Large Numbers of Input Files
6.5 Other Formats
6.6 Reading and Writing R Data Directly
6.7 Chapter Summary and Critical Data Handling Tools
Chapter 7: Data Handling in Practice
7.1 Acquiring and Reading Data
7.2 Cleaning Data
7.3 Combining Data
7.4 Transactional Data
7.5 Preparing Data
7.6 Documentation and Reproducibility
7.7 The Role of Judgment
7.8 Data Cleaning in Action
7.9 Chapter Summary and Critical Data Handling Tools
Chapter 8: Extended Exercise
8.1 Introduction to the Problem
8.2 The Data
8.3 Five Important Fields
8.4 Loan and Application Portfolios
8.5 Scores
8.6 Co-borrower Scores
8.7 Updated KScores
8.8 Loans to Be Excluded
8.9 Response Variable
8.10 Assembling the Final Data Sets
Appendix A: Hints and Pseudocode
A.1 Loan Portfolios
A.2 Scores Database
A.3 Co-borrower Scores
A.4 Updated KScores
A.5 Excluder Files
A.6 Payment Matrix
A.7 Starting the Modeling Process
Bibliography
Index
End User License Agreement
xv
xvii
xix
xxi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
Cover
Table of Contents
Preface
Begin Reading
Chapter 5: Writing Functions and Scripts
Figure 5.1 Vectorized and non-vectorized code example.
Chapter 7: Data Handling in Practice
Figure 7.1 Example population flowchart.
Chapter 8: Extended Exercise
Figure 8.1 Schematic of the data cleaning process for the example data.
chapter 4: R Data, Part 3: Text and Factors
Table 4.1 Special characters in R (POSIX) regular expressions
Chapter 8: Extended Exercise
Table 8.1 Columns required for predictive model
Table 8.2 State abbreviations by census region
Table 8.3 CScore table
Table 8.4 CusApp table
Samuel E. Buttrey and Lyn R. Whitaker
Naval Postgraduate School, California, United States
This edition first published 2018
© 2018 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Samuel E. Buttrey and Lyn R. Whitaker to be identified as the authors of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data applied for
Hardback ISBN: 9781119080022
Cover Design: Wiley
Cover Image: © Nongkran_ch/Gettyimages
To Elinda and Mike
Samuel E. Buttrey received a bachelor's degree in statistics from Princeton University in 1983. After 8 years as a Wall Street computer systems analyst, he returned to graduate school and received MA and PhD degrees in statistics from the University of California at Berkeley, the latter in 1996. In that year, he joined the faculty of the Department of Operations Research at the Naval Postgraduate School in Monterey, California. He has published papers on nearest-neighbor and other classification methods and on applied problems ranging from numismatics and oceanography to human vision. He has also published papers describing his implementations of algorithms in software. His interests include classification, computationally intensive methods, and statistical graphics, and most recently, inter-point distance measures for mixed categorical and numeric data. He lives in Pacific Grove, California, with wife Elinda, son John, and some cats.
Lyn R. Whitaker received a bachelor's degree in genetics in 1978 and a PhD in statistics from the University of California, Davis, in 1985. She was an Assistant Professor in the Department of Statistics and Applied Probability at the University of California at Santa Barbara from 1985 to 1988, and joined the faculty of the Department of Operations Research at the Naval Postgraduate School in 1988. Her interests are applied statistics relevant to defense issues. These include unsupervised methods for large and messy data, the statistical aspects of reliability and survival analysis, and most recently, jointly with Buttrey, development and use of inter-point distances for mixed data types. She resides in Monterey, California, with husband Mike, father Fred, and, occasionally, children Alex, Lee, and Mary.
Statisticians use data to build models, and they use models to describe the world and to make predictions about what will happen next. There has been a large number of very good books that describe statistical modeling, but these modeling efforts usually start with a set of “clean,” well-behaved data in which nothing is missing or anomalous.
In real life, data is messy. There will be missing values, impossible values, and typographical errors. Data is gathered from multiple sources, leading to both duplication and inconsistency. Data that should be categorical is coded as numeric; data that should be numeric can appear categorical; data can be hidden inside free-form text; and data can be in the form of dates in a wide number of possible formats. We estimate that 80% of the time taken in any data analysis problem is taken up just in reading and preparing the data. So, any analyst needs to know how to acquire data and how to prepare it for modeling, and the steps taken should be automatic, as far as possible, and reproducible.
This book describes how to handle data using the R software. R is the most widely used software in statistics, and it has the advantage of being free, open-source, and available on every major computing platform. Whatever software you use, you will find yourself facing the issues of acquiring, cleaning, and merging data, and documenting the steps you took. We hope this book will help you do these things efficiently.
Sam Buttrey and Lyn Whitaker
Monterey, California, USA November 30, 2016
Our book is about how to use R to process data. We use R because it is powerful, versatile, and extensible. We thank the developers of R for their service to the statistical community for producing a high-quality open-source piece of software. We also thank the long list of colleagues and students who have helped frame our thinking about questions of statistics and data.
Don't forget to visit the companion website for this book:
www.wiley.com/go/buttrey/datascientistsguide
There you will find valuable material designed to enhance your learning, including:
A complete listing of all the R code in the Book
Example datasets used in the Exercises
