A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R - Samuel E. Buttrey - E-Book

A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R E-Book

Samuel E. Buttrey

0,0
60,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in R

Every experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R. 

Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling.  They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.

  • The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining data
  • Begins with the basics and walks readers through all the steps necessary to get data ready for the modeling process
  • Provides expert guidance on how to document the processes described so that they are reproducible
  • Written by seasoned professionals, it provides both introductory and advanced techniques
  • Features case studies with supporting data and R code, hosted on a companion website

A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 588

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

About the Authors

Preface

Acknowledgments

About the Companion Website

chapter 1: R

1.1 Introduction

1.2 Data

1.3 The Very Basics of R

1.4 Running an R Session

1.5 Getting Help

1.6 How to Use This Book

Chapter 2: R Data, Part 1: Vectors

2.1 Vectors

2.2 Data Types

2.3 Subsets of Vectors

2.4 Missing Data (

NA

) and Other Special Values

2.5 The

table()

Function

2.6 Other Actions on Vectors

2.7 Long Vectors and Big Data

2.8 Chapter Summary and Critical Data Handling Tools

Chapter 3: R Data, Part 2: More Complicated Structures

3.1 Introduction

3.2 Matrices

3.3 Lists

3.4 Data Frames

3.5 Operating on Lists and Data Frames

3.6 Date and Time Objects

3.7 Other Actions on Data Frames

3.8 Handling Big Data

3.9 Chapter Summary and Critical Data Handling Tools

chapter 4: R Data, Part 3: Text and Factors

4.1 Character Data

4.2 Converting Numbers into Text

4.3 Constructing Character Strings: Paste in Action

4.4 Regular Expressions

4.5 UTF-8 and Other Non-ASCII Characters

4.6 Factors

4.7 R Object Names and Commands as Text

4.8 Chapter Summary and Critical Data Handling Tools

Chapter 5: Writing Functions and Scripts

5.1 Functions

5.2 Scripts and Shell Scripts

5.3 Error Handling and Debugging

5.4 Interacting with the Operating System

5.5 Speeding Things Up

5.6 Chapter Summary and Critical Data Handling Tools

Chapter 6: Getting Data into and out of R

6.1 Reading Tabular ASCII Data into Data Frames

6.2 Reading Large, Non-Tabular, or Non-ASCII Data

6.3 Reading Data From Relational Databases

6.4 Handling Large Numbers of Input Files

6.5 Other Formats

6.6 Reading and Writing R Data Directly

6.7 Chapter Summary and Critical Data Handling Tools

Chapter 7: Data Handling in Practice

7.1 Acquiring and Reading Data

7.2 Cleaning Data

7.3 Combining Data

7.4 Transactional Data

7.5 Preparing Data

7.6 Documentation and Reproducibility

7.7 The Role of Judgment

7.8 Data Cleaning in Action

7.9 Chapter Summary and Critical Data Handling Tools

Chapter 8: Extended Exercise

8.1 Introduction to the Problem

8.2 The Data

8.3 Five Important Fields

8.4 Loan and Application Portfolios

8.5 Scores

8.6 Co-borrower Scores

8.7 Updated KScores

8.8 Loans to Be Excluded

8.9 Response Variable

8.10 Assembling the Final Data Sets

Appendix A: Hints and Pseudocode

A.1 Loan Portfolios

A.2 Scores Database

A.3 Co-borrower Scores

A.4 Updated KScores

A.5 Excluder Files

A.6 Payment Matrix

A.7 Starting the Modeling Process

Bibliography

Index

End User License Agreement

Pages

xv

xvii

xix

xxi

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

Guide

Cover

Table of Contents

Preface

Begin Reading

List of Illustrations

Chapter 5: Writing Functions and Scripts

Figure 5.1 Vectorized and non-vectorized code example.

Chapter 7: Data Handling in Practice

Figure 7.1 Example population flowchart.

Chapter 8: Extended Exercise

Figure 8.1 Schematic of the data cleaning process for the example data.

List of Tables

chapter 4: R Data, Part 3: Text and Factors

Table 4.1 Special characters in R (POSIX) regular expressions

Chapter 8: Extended Exercise

Table 8.1 Columns required for predictive model

Table 8.2 State abbreviations by census region

Table 8.3 CScore table

Table 8.4 CusApp table

A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R

 

Samuel E. Buttrey and Lyn R. Whitaker

Naval Postgraduate School, California, United States

 

 

 

 

 

This edition first published 2018

© 2018 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Samuel E. Buttrey and Lyn R. Whitaker to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data applied for

Hardback ISBN: 9781119080022

Cover Design: Wiley

Cover Image: © Nongkran_ch/Gettyimages

To Elinda and Mike

About the Authors

Samuel E. Buttrey received a bachelor's degree in statistics from Princeton University in 1983. After 8 years as a Wall Street computer systems analyst, he returned to graduate school and received MA and PhD degrees in statistics from the University of California at Berkeley, the latter in 1996. In that year, he joined the faculty of the Department of Operations Research at the Naval Postgraduate School in Monterey, California. He has published papers on nearest-neighbor and other classification methods and on applied problems ranging from numismatics and oceanography to human vision. He has also published papers describing his implementations of algorithms in software. His interests include classification, computationally intensive methods, and statistical graphics, and most recently, inter-point distance measures for mixed categorical and numeric data. He lives in Pacific Grove, California, with wife Elinda, son John, and some cats.

Lyn R. Whitaker received a bachelor's degree in genetics in 1978 and a PhD in statistics from the University of California, Davis, in 1985. She was an Assistant Professor in the Department of Statistics and Applied Probability at the University of California at Santa Barbara from 1985 to 1988, and joined the faculty of the Department of Operations Research at the Naval Postgraduate School in 1988. Her interests are applied statistics relevant to defense issues. These include unsupervised methods for large and messy data, the statistical aspects of reliability and survival analysis, and most recently, jointly with Buttrey, development and use of inter-point distances for mixed data types. She resides in Monterey, California, with husband Mike, father Fred, and, occasionally, children Alex, Lee, and Mary.

Preface

Statisticians use data to build models, and they use models to describe the world and to make predictions about what will happen next. There has been a large number of very good books that describe statistical modeling, but these modeling efforts usually start with a set of “clean,” well-behaved data in which nothing is missing or anomalous.

In real life, data is messy. There will be missing values, impossible values, and typographical errors. Data is gathered from multiple sources, leading to both duplication and inconsistency. Data that should be categorical is coded as numeric; data that should be numeric can appear categorical; data can be hidden inside free-form text; and data can be in the form of dates in a wide number of possible formats. We estimate that 80% of the time taken in any data analysis problem is taken up just in reading and preparing the data. So, any analyst needs to know how to acquire data and how to prepare it for modeling, and the steps taken should be automatic, as far as possible, and reproducible.

This book describes how to handle data using the R software. R is the most widely used software in statistics, and it has the advantage of being free, open-source, and available on every major computing platform. Whatever software you use, you will find yourself facing the issues of acquiring, cleaning, and merging data, and documenting the steps you took. We hope this book will help you do these things efficiently.

Sam Buttrey and Lyn Whitaker

Monterey, California, USA November 30, 2016

Acknowledgments

Our book is about how to use R to process data. We use R because it is powerful, versatile, and extensible. We thank the developers of R for their service to the statistical community for producing a high-quality open-source piece of software. We also thank the long list of colleagues and students who have helped frame our thinking about questions of statistics and data.

About the Companion Website

Don't forget to visit the companion website for this book:

www.wiley.com/go/buttrey/datascientistsguide

There you will find valuable material designed to enhance your learning, including:

A complete listing of all the R code in the Book

Example datasets used in the Exercises