Practical Data Wrangling - Allan Visochek - E-Book

Practical Data Wrangling E-Book

Allan Visochek

0,0
27,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Turn your noisy data into relevant, insight-ready information by leveraging the data wrangling techniques in Python and R

About This Book

  • This easy-to-follow guide takes you through every step of the data wrangling process in the best possible way
  • Work with different types of datasets, and reshape the layout of your data to make it easier for analysis
  • Get simple examples and real-life data wrangling solutions for data pre-processing

Who This Book Is For

If you are a data scientist, data analyst, or a statistician who wants to learn how to wrangle your data for analysis in the best possible manner, this book is for you. As this book covers both R and Python, some understanding of them will be beneficial.

What You Will Learn

  • Read a csv file into python and R, and print out some statistics on the data
  • Gain knowledge of the data formats and programming structures involved in retrieving API data
  • Make effective use of regular expressions in the data wrangling process
  • Explore the tools and packages available to prepare numerical data for analysis
  • Find out how to have better control over manipulating the structure of the data
  • Create a dexterity to programmatically read, audit, correct, and shape data
  • Write and complete programs to take in, format, and output data sets

In Detail

Around 80% of time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis and reporting. Python and R are considered a popular choice of tool for data analysis, and have packages that can be best used to manipulate different kinds of data, as per your requirements. This book will show you the different data wrangling techniques, and how you can leverage the power of Python and R packages to implement them.

You'll start by understanding the data wrangling process and get a solid foundation to work with different types of data. You'll work with different data structures and acquire and parse data from various locations. You'll also see how to reshape the layout of data and manipulate, summarize, and join data sets. Finally, we conclude with a quick primer on accessing and processing data from databases, conducting data exploration, and storing and retrieving data quickly using databases.

The book includes practical examples on each of these points using simple and real-world data sets to give you an easier understanding. By the end of the book, you'll have a thorough understanding of all the data wrangling concepts and how to implement them in the best possible way.

Style and approach

This is a practical book on data wrangling designed to give you an insight into the practical application of data wrangling. It takes you through complex concepts and tasks in an accessible way, featuring information on a wide range of data wrangling techniques with Python and R

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 219

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Practical Data Wrangling

 

 

 

 

 

 

 

 

Expert techniques for transforming your raw data into a valuable source for analytics

 

 

 

 

 

 

 

 

Allan Visochek

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Practical Data Wrangling

 

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: November 2017

 

Production reference: 1141117

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78728-613-9

 

www.packtpub.com

Credits

Author

Allan Visochek

Copy Editors

Tasneem Fatehi

Safis Editing

Reviewer

Adriano Longo

Project Coordinator

Manthan Patel

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Malaika Monteiro

Indexer

Pratik Shirodkar

Content Development Editor

Aaryaman Singh

Graphics

Tania Dutta

Technical EditorDinesh Chaudhary

Production Coordinator

Shraddha Falebhai

About the Author

Allan Visochek is a freelance web developer and data analyst in New Haven, Connecticut. Outside of work, Allan has a deep interest in machine learning and artificial intelligence.

Allan thoroughly enjoys teaching and sharing knowledge. After graduating from the Udacity Data Analyst Nanodegree program, he was contracted to Udacity for several months as a forum mentor and project reviewer, offering guidance to students working on data analysis projects. He has also written technical content for learntoprogram.tv.

About the Reviewer

Adriano Longo is a freelance data analyst based in the Netherlands with a passion for Neo4j's relationship-oriented data model.

He specializes in querying, processing, and modeling data with Cypher, R, Python, and SQL, and worked on climate prediction models at UEA's Climatic Research Unit before focusing on analytical solutions for the private sector.

Today, Adriano uses Neo4j and linkurious.js to explore the complex web of relationships nefarious actors use to obfuscate their abuse of environmental and financial regulations, making dirty secrets less transparent, one graph at a time.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787286134. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Programming with Data

Understanding data wrangling

Getting and reading data

Cleaning data

Shaping and structuring data

Storing data

The tools for data wrangling

Python

R

Summary

Introduction to Programming in Python

External resources

Logistical overview

Installation requirements

Using other learning resources

Python 2 versus Python 3

Running programs in python

Using text editors to write and manage programs

Writing the hello world program

Using the terminal to run programs

Running the Hello World program

What if it didn't work?

Data types, variables, and the Python shell

Numbers - integers and floats

Why integers? 

Strings

Booleans

The print function

Variables

Adding to a variable

Subtracting from a variable

Multiplication

Division

Naming variables

Arrays (lists, if you ask Python)

Dictionaries 

Compound statements

Compound statement syntax and indentation level

For statements and iterables

If statements

Else and elif clauses

Functions

Passing arguments to a function

Returning values from a function

Making annotations within programs

A programmer's resources

Documentation

Online forums and mailing lists

 Summary

Reading, Exploring, and Modifying Data - Part I

External resources

Logistical overview

Installation requirements

Data

File system setup

Introducing a basic data wrangling work flow

Introducing the JSON file format

Opening and closing a file in Python using file I/O

The open function and file objects

File structure - best practices to store your data

Opening a file

Reading the contents of a file

Modules in Python

Parsing a JSON file using the json module

Exploring the contents of a data file

Extracting the core content of the data

Listing out all of the variables in the data

Modifying a dataset

Extracting data variables from the original dataset

Using a for loop to iterate over the data

Using a nested for loop to iterate over the data variables

Outputting the modified data to a new file

Specifying input and output file names in the Terminal

Specifying the filenames from the Terminal

Summary

Reading, Exploring, and Modifying Data - Part II

Logistical overview

File system setup

Data

Installing pandas

Understanding the CSV format

Introducing the CSV module

Using the CSV module to read CSV data

Using the CSV module to write CSV data

Using the pandas module to read and process data

Counting the total road length in 2011 revisited

Handling non-standard CSV encoding and dialect

Understanding XML

XML versus JSON

Using the XML module to parse XML data

XPath

Summary

Manipulating Text Data - An Introduction to Regular Expressions

Logistical overview

Data

File structure setup

Understanding the need for pattern recognition

Introducting regular expressions

Writing and using a regular expression

Special characters

Matching whitespace

Matching the start of string

Matching the end of a string

Matching a range of characters

Matching any one of several patterns

Matching a sequence instead of just one character

Putting patterns together

Extracting a pattern from a string

The regex split() function

Python regex documentation

Looking for patterns

Quantifying the existence of patterns

Creating a regular expression to match the street address

Counting the number of matches

Verifying the correctness of the matches

Extracting patterns

Outputting the data to a new file

Summary

Cleaning Numerical Data - An Introduction to R and RStudio

Logistical overview

Data

Directory structure

Installing R and RStudio

Introducing R and RStudio

Familiarizing yourself with RStudio

Running R commands

Setting the working directory

Reading data

The R dataframe

R vectors

Indexing R dataframes

Finding the 2011 total in R

Conducting basic outlier detection and removal

Handling NA values

Deleting missing values

Replacing missing values with a constant

Imputation of missing values

Variable names and contents

Summary

Simplifying Data Manipulation with dplyr

Logistical overview

Data

File system setup

Installing the dplyr and tibble packages

Introducing dplyr

Getting started with dplyr

Chaining operations together

Filtering the rows of a dataframe

Summarizing data by category

Rewriting code using dplyr

Summary

Getting Data from the Web

Logistical overview

Filesystem setup

Installing the requests module

Internet connection

Introducing APIs

Using Python to retrieve data from APIs

Using URL parameters to filter the results

Summary

Working with Large Datasets

Logistical overview 

System requirements

Data

File system setup

Installing MongoDB

Planning out your time

Cleaning up

Understanding computer memory

Understanding databases

Introducing MongoDB

Interfacing with MongoDB from Python

Summary

Preface

Data rarely comes prepared for its end use. For any particular project, there may be too much data, too little data, missing data, erroneous data, poorly structured data, or improperly formatted data. This book is about how to gather the data that is available and produce an output that is ready to be used. In each of the chapters, one or more demonstrations are used to show a new approach to data wrangling.

What this book covers

Chapter 1, Programming with Data, discusses the context of data wrangling and offers a high-level overview of the rest of the book's content.

Section 1: A generalized programming approach to data wrangling

Chapter 2, Introduction to Programming in Python, introduces programming using the Python programming language, which used in most of the chapters of the book.

Chapter 3, Reading, Exploring, and Modifying Data - Part I, is an overview of the steps for processing a data file and an introduction to JSON data.

Chapter 4, Reading, Exploring, and Modifying Data - Part II, continues from the previous chapter, extending to the CSV and XML data formats.

Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, is an introduction to regular expressions with the application of extracting street names from street addresses.

Section 2: A formulated approach to data wrangling

Chapter 6, Cleaning Numerical Data - An Introduction to R and RStudio, introduces R and RStudio with the application of cleaning numerical data.

Chapter 7, Simplifying Data Manipulation with dplyr, is an introduction to the dplyr package for R, which can be used to express multiple data processing steps elegantly and concisely.

Section 3: Advanced methods for retrieving and storing data

Chapter 8, Getting Data from the Web, is an introduction to APIs. This chapter shows how to extract data from APIs using Python.

Chapter 9, Working with Large Datasets, has an overview of the issues when working with large amounts of data and a very brief introduction to MongoDB.

What you need for this book

You will need a Python 3 installation on your computer, and you will need to be able to execute Python from your operating system’s command-line interface. In addition, the following external Python modules will be used:

pandas (Chapters 4 and 5)

requests (Chapter 8)

PyMongo (Chapter 9)

For Chapter 9, you will need to install MongoDB and set up your own local MongoDB server. For Chapters 6 and 7, you will need RStudio and Rbase. Additionally, for Chapter 7, you will need the dplyr and tibble libraries.

Who this book is for

If you are a data scientist, data analyst, or statistician who wants to learn how to wrangle your data for analysis in the best possible manner, this book is for you. As this book covers both R and Python, some understanding of these will be beneficial.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Data-Wrangling. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PracticalDataWrangling_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Programming with Data

It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.

For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.

Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.

A short side note on terminology: Data science as an all encompassing term can be a bit elusive. As it is such a new field, the definition of a data scientist can change depending on who you ask. To be more general, the term data programmer will be used in this book to refer to anyone who will find data wrangling useful in their work.

Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:

There may be extra steps involved in getting the data

The information needed may be spread across multiple sources

Datasets may be too large to work with in their original format

There may be far more fields or information in a particular dataset than needed

Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on

Datasets may be structured or formatted in a way that is not compatible with a particular application

Due to this, it is often the responsibility of the data programmer to perform the following functions:

Discover and gather the data that is needed (getting data)

Merge data from different sources if necessary (merging data)

Fix flaws in the data entries (cleaning data)

Extract the necessary data and put it in the proper structure (shaping data)

Store it in the proper format for further use (storing data)

This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:

Getting data

Cleaning data

Merging and shaping data

Storing data

Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling. 

Understanding data wrangling

Data wrangling, broadly speaking, is the process of gathering data in its raw form and molding it into a form that is suitable for its end use. Preparing data for its end use can branch out into a number of different tasks based on the exact use case. This can make it rather hard to pin down exactly what data wrangling entails, and formulate how to go about it. Nevertheless, there are a number of common steps in the data wrangling process, as outlined in the following subsections. The approach that I will take in this book is to introduce a number of tools and practices that are often involved in data wrangling. Each of the chapters will consist of one or more exercises and/or projects that will demonstrate the application of a particular tool or approach. 

Getting and reading data

The first step is to retrieve a dataset and open it with a program capable of manipulating the data. The simplest way of retrieving a dataset is to find a data file. Python and R can be used to open, read, modify, and save data stored in static files. In Chapter 3, Reading, Exploring, and Modifying Data - Part I, I will introduce the JSON data format and show how to use Python to read, write and modify JSON data. In Chapter 4, Reading, Exploring, and Modifying Data - Part II, I will walk through how to use Python to work with data files in the CSV and XML data formats. In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will introduce R and Rstudio, and show how to use R to read and manipulate data. 

Larger data sources are often made available through web interfaces called application programming interfaces (APIs). APIs allow you to retrieve specific bits of data from a larger collection of data. Web APIs can be great resources for data that is otherwise hard to get. In Chapter 8, Getting Data from the Web, I discuss APIs in detail and walk through the use of Python to extract data from APIs.

Another possible source of data is a database. I won't go into detail on the use of databases in this book, though in Chapter 9, Working with Large Datasets, I will show how to interact with a particular database using Python.

Databases are collections of data that are organized to optimize the quick retrieval of data. They can be particularly useful when we need to work incrementally on very large datasets, and of course may be a source of data.

Cleaning data

When working with data, you can generally expect to find human errors, missing entries, and numerical outliers. These types of errors usually need to be corrected, handled, or removed to prepare a dataset for analysis.

In Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, I will demonstrate how to use regular expressions, a tool to identify, extract, and modify patterns in text data. Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, includes a project to use regular expressions to extract street names.

In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will demonstrate how to use RStudio to conduct two common tasks for cleaning numerical data: outlier detection and NA handling.

Shaping and structuring data

Preparing data for its end use often requires both structuring and organizing the data in the correct manner. 

To illustrate this, suppose you have a hierarchical dataset of city populations, as shown in Figure 01:

Figure 01: Hierarchical structure of the population of cities

If the goal is to create a histogram of city populations, the previous data format would be hard to work with. Not only is the information of the city populations nested within the data structure, but it is nested to varying degrees of depth. For the purposes of creating a histogram, it is better to represent the data as a list of numbers, as shown in Figure 02:

Figure 02: List of populations for histogram visualization