31,19 €
Data cleaning is the all-important first step to successful data science, data analysis, and machine learning. If you work with any kind of data, this book is your go-to resource, arming you with the insights and heuristics experienced data scientists had to learn the hard way.
In a light-hearted and engaging exploration of different tools, techniques, and datasets real and fictitious, Python veteran David Mertz teaches you the ins and outs of data preparation and the essential questions you should be asking of every piece of data you work with.
Using a mixture of Python, R, and common command-line tools, Cleaning Data for Effective Data Science follows the data cleaning pipeline from start to end, focusing on helping you understand the principles underlying each step of the process. You'll look at data ingestion of a vast range of tabular, hierarchical, and other data formats, impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features. The long-form exercises at the end of each chapter let you get hands-on with the skills you've acquired along the way, also providing a valuable resource for academic courses.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 646
Veröffentlichungsjahr: 2021
Cleaning Data for Effective Data Science
Doing the other 80% of the work with Python, R, and command-line tools
David Mertz
BIRMINGHAM—MUMBAI
Cleaning Data for Effective Data Science
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Producer: Shailesh Jain
Acquisition Editor – Peer Reviews: Saby D’silva
Project Editor: Rianna Rodrigues
Content Development Editor: Lucy Wan
Copy Editor: Safis Editing
Technical Editor: Aditya Sawant
Proofreader: Safis Editing
Indexer: Priyanka Dhadke
Presentation Designer: Pranit Padwal
First published: March 2021
Production reference: 1260321
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80107-129-1
www.packt.com
David Mertz, Ph.D. is the founder of KDM Training, a partnership dedicated to educating developers and data scientists in machine learning and scientific computing. He created a data science training program for Anaconda Inc. and was a senior trainer for them. With the advent of deep neural networks, he has turned to training our robot overlords as well.
He previously worked for 8 years with D. E. Shaw Research and was also a Director of the Python Software Foundation for 6 years. David remains co-chair of its Trademarks Committee and Scientific Python Working Group. His columns, Charming Python and XML Matters, were once the most widely read articles in the Python world.
I give great thanks to those people who have helped make this book better.
First and foremost, I am thankful for the careful attention and insightful suggestions of my development editor Lucy Wan, and technical reviewer Miki Tebeka. Other colleagues and friends who have read and provided helpful comments on parts of this book, while it was in progress, include Micah Dubinko, Vladimir Shulyak, Laura Richter, Alessandra Smith, Mary Ann Sushinsky, Tim Churches, and Paris Finley.
The text in front of you is better for their kindnesses and intelligence; all error and deficits remain mine entirely.
I also thank the thousands of contributors who have created the Free Software I used in the creation of this book, and in so much other work I do. No proprietary software was used by the author at any point in the production of this book. The operating system, text editors, plot creation tools, fonts, programming languages, shells, command-line tools, and all other software used belongs to our human community rather than to any exclusive private entity.
Miki Tebeka is the CEO of 353solutions, and he has a passion for teaching and mentoring. He teaches many workshops on various technical subjects all over the world and also mentored many young developers on their way to success. Miki is involved in open source, has several projects of his own, and contributed to several more, including the Python project and the Go project. He has been writing software for 25 years.
Miki wrote Forging Python, Python Brain Teasers, Go Brain Teasers, Pandas Brain Teasers and is an author in LinkedIn Learning. He’s an organizer of the Go Israel Meetup, GopherCon Israel, and PyData Israel Conference.
Preface
PART I: Data Ingestion
Tabular Formats
Tidying Up
CSV
Sanity Checks
The Good, the Bad, and the Textual Data
The Bad
The Good
Spreadsheets Considered Harmful
SQL RDBMS
Massaging Data Types
Repeating in R
Where SQL Goes Wrong (and How to Notice It)
Other Formats
HDF5 and NetCDF-4
Tools and Libraries
SQLite
Apache Parquet
Data Frames
Spark/Scala
Pandas and Derived Wrappers
Vaex
Data Frames in R (Tidyverse)
Data Frames in R (data.table)
Bash for Fun
Exercises
Tidy Data from Excel
Tidy Data from SQL
Denouement
Hierarchical Formats
JSON
What JSON Looks Like
NaN Handling and Data Types
JSON Lines
GeoJSON
Tidy Geography
JSON Schema
XML
User Records
Keyhole Markup Language
Configuration Files
INI and Flat Custom Formats
TOML
Yet Another Markup Language
NoSQL Databases
Document-Oriented Databases
Missing Fields
Denormalization and Its Discontents
Key/Value Stores
Exercises
Exploring Filled Area
Create a Relational Model
Denouement
Repurposing Data Sources
Web Scraping
HTML Tables
Non-Tabular Data
Command-Line Scraping
Portable Document Format
Image Formats
Pixel Statistics
Channel Manipulation
Metadata
Binary Serialized Data Structures
Custom Text Formats
A Structured Log
Character Encodings
Exercises
Enhancing the NPY Parser
Scraping Web Traffic
Denouement
PART II: The Vicissitudes of Error
Anomaly Detection
Missing Data
SQL
Hierarchical Formats
Sentinels
Miscoded Data
Fixed Bounds
Outliers
Z-Score
Interquartile Range
Multivariate Outliers
Exercises
A Famous Experiment
Misspelled Words
Denouement
Data Quality
Missing Data
Biasing Trends
Understanding Bias
Detecting Bias
Comparison to Baselines
Benford’s Law
Class Imbalance
Normalization and Scaling
Applying a Machine Learning Model
Scaling Techniques
Factor and Sample Weighting
Cyclicity and Autocorrelation
Domain Knowledge Trends
Discovered Cycles
Bespoke Validation
Collation Validation
Transcription Validation
Exercises
Data Characterization
Oversampled Polls
Denouement
PART III: Rectification and Creation
Value Imputation
Typical-Value Imputation
Typical Tabular Data
Locality Imputation
Trend Imputation
Types of Trends
A Larger Coarse Time Series
Understanding the Data
Removing Unusable Data
Imputing Consistency
Interpolation
Non-Temporal Trends
Sampling
Undersampling
Oversampling
Exercises
Alternate Trend Imputation
Balancing Multiple Features
Denouement
Feature Engineering
Date/Time Fields
Creating Datetimes
Imposing Regularity
Duplicated Timestamps
Adding Timestamps
String Fields
Fuzzy Matching
Explicit Categories
String Vectors
Decompositions
Rotation and Whitening
Dimensionality Reduction
Visualization
Quantization and Binarization
One-Hot Encoding
Polynomial Features
Generating Synthetic Features
Feature Selection
Exercises
Intermittent Occurrences
Characterizing Levels
Denouement
PART IV: Ancillary Matters
Closure
What You Know
What You Don’t Know (Yet)
Glossary
Why subscribe?
Other Books You May Enjoy
Index
Cover
Index
Data Ingestion
