Cleaning Data for Effective Data Science - David Mertz - E-Book

Cleaning Data for Effective Data Science E-Book

David Mertz

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data cleaning is the all-important first step to successful data science, data analysis, and machine learning. If you work with any kind of data, this book is your go-to resource, arming you with the insights and heuristics experienced data scientists had to learn the hard way.

In a light-hearted and engaging exploration of different tools, techniques, and datasets real and fictitious, Python veteran David Mertz teaches you the ins and outs of data preparation and the essential questions you should be asking of every piece of data you work with.

Using a mixture of Python, R, and common command-line tools, Cleaning Data for Effective Data Science follows the data cleaning pipeline from start to end, focusing on helping you understand the principles underlying each step of the process. You'll look at data ingestion of a vast range of tabular, hierarchical, and other data formats, impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features. The long-form exercises at the end of each chapter let you get hands-on with the skills you've acquired along the way, also providing a valuable resource for academic courses.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 646

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Cleaning Data for Effective Data Science

Doing the other 80% of the work with Python, R, and command-line tools

David Mertz

BIRMINGHAM—MUMBAI

Cleaning Data for Effective Data Science

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Producer: Shailesh Jain

Acquisition Editor – Peer Reviews: Saby D’silva

Project Editor: Rianna Rodrigues

Content Development Editor: Lucy Wan

Copy Editor: Safis Editing

Technical Editor: Aditya Sawant

Proofreader: Safis Editing

Indexer: Priyanka Dhadke

Presentation Designer: Pranit Padwal

First published: March 2021

Production reference: 1260321

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-129-1

www.packt.com

Contributors

About the author

David Mertz, Ph.D. is the founder of KDM Training, a partnership dedicated to educating developers and data scientists in machine learning and scientific computing. He created a data science training program for Anaconda Inc. and was a senior trainer for them. With the advent of deep neural networks, he has turned to training our robot overlords as well.

He previously worked for 8 years with D. E. Shaw Research and was also a Director of the Python Software Foundation for 6 years. David remains co-chair of its Trademarks Committee and Scientific Python Working Group. His columns, Charming Python and XML Matters, were once the most widely read articles in the Python world.

I give great thanks to those people who have helped make this book better.

First and foremost, I am thankful for the careful attention and insightful suggestions of my development editor Lucy Wan, and technical reviewer Miki Tebeka. Other colleagues and friends who have read and provided helpful comments on parts of this book, while it was in progress, include Micah Dubinko, Vladimir Shulyak, Laura Richter, Alessandra Smith, Mary Ann Sushinsky, Tim Churches, and Paris Finley.

The text in front of you is better for their kindnesses and intelligence; all error and deficits remain mine entirely.

I also thank the thousands of contributors who have created the Free Software I used in the creation of this book, and in so much other work I do. No proprietary software was used by the author at any point in the production of this book. The operating system, text editors, plot creation tools, fonts, programming languages, shells, command-line tools, and all other software used belongs to our human community rather than to any exclusive private entity.

About the reviewer

Miki Tebeka is the CEO of 353solutions, and he has a passion for teaching and mentoring. He teaches many workshops on various technical subjects all over the world and also mentored many young developers on their way to success. Miki is involved in open source, has several projects of his own, and contributed to several more, including the Python project and the Go project. He has been writing software for 25 years.

Miki wrote Forging Python, Python Brain Teasers, Go Brain Teasers, Pandas Brain Teasers and is an author in LinkedIn Learning. He’s an organizer of the Go Israel Meetup, GopherCon Israel, and PyData Israel Conference.

Contents

Preface

PART I: Data Ingestion

Tabular Formats

Tidying Up

CSV

Sanity Checks

The Good, the Bad, and the Textual Data

The Bad

The Good

Spreadsheets Considered Harmful

SQL RDBMS

Massaging Data Types

Repeating in R

Where SQL Goes Wrong (and How to Notice It)

Other Formats

HDF5 and NetCDF-4

Tools and Libraries

SQLite

Apache Parquet

Data Frames

Spark/Scala

Pandas and Derived Wrappers

Vaex

Data Frames in R (Tidyverse)

Data Frames in R (data.table)

Bash for Fun

Exercises

Tidy Data from Excel

Tidy Data from SQL

Denouement

Hierarchical Formats

JSON

What JSON Looks Like

NaN Handling and Data Types

JSON Lines

GeoJSON

Tidy Geography

JSON Schema

XML

User Records

Keyhole Markup Language

Configuration Files

INI and Flat Custom Formats

TOML

Yet Another Markup Language

NoSQL Databases

Document-Oriented Databases

Missing Fields

Denormalization and Its Discontents

Key/Value Stores

Exercises

Exploring Filled Area

Create a Relational Model

Denouement

Repurposing Data Sources

Web Scraping

HTML Tables

Non-Tabular Data

Command-Line Scraping

Portable Document Format

Image Formats

Pixel Statistics

Channel Manipulation

Metadata

Binary Serialized Data Structures

Custom Text Formats

A Structured Log

Character Encodings

Exercises

Enhancing the NPY Parser

Scraping Web Traffic

Denouement

PART II: The Vicissitudes of Error

Anomaly Detection

Missing Data

SQL

Hierarchical Formats

Sentinels

Miscoded Data

Fixed Bounds

Outliers

Z-Score

Interquartile Range

Multivariate Outliers

Exercises

A Famous Experiment

Misspelled Words

Denouement

Data Quality

Missing Data

Biasing Trends

Understanding Bias

Detecting Bias

Comparison to Baselines

Benford’s Law

Class Imbalance

Normalization and Scaling

Applying a Machine Learning Model

Scaling Techniques

Factor and Sample Weighting

Cyclicity and Autocorrelation

Domain Knowledge Trends

Discovered Cycles

Bespoke Validation

Collation Validation

Transcription Validation

Exercises

Data Characterization

Oversampled Polls

Denouement

PART III: Rectification and Creation

Value Imputation

Typical-Value Imputation

Typical Tabular Data

Locality Imputation

Trend Imputation

Types of Trends

A Larger Coarse Time Series

Understanding the Data

Removing Unusable Data

Imputing Consistency

Interpolation

Non-Temporal Trends

Sampling

Undersampling

Oversampling

Exercises

Alternate Trend Imputation

Balancing Multiple Features

Denouement

Feature Engineering

Date/Time Fields

Creating Datetimes

Imposing Regularity

Duplicated Timestamps

Adding Timestamps

String Fields

Fuzzy Matching

Explicit Categories

String Vectors

Decompositions

Rotation and Whitening

Dimensionality Reduction

Visualization

Quantization and Binarization

One-Hot Encoding

Polynomial Features

Generating Synthetic Features

Feature Selection

Exercises

Intermittent Occurrences

Characterizing Levels

Denouement

PART IV: Ancillary Matters

Closure

What You Know

What You Don’t Know (Yet)

Glossary

Why subscribe?

Other Books You May Enjoy

Index

Landmarks

Cover

Index

PART I

Data Ingestion