27,59 €
Turn your noisy data into relevant, insight-ready information by leveraging the data wrangling techniques in Python and R
If you are a data scientist, data analyst, or a statistician who wants to learn how to wrangle your data for analysis in the best possible manner, this book is for you. As this book covers both R and Python, some understanding of them will be beneficial.
Around 80% of time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis and reporting. Python and R are considered a popular choice of tool for data analysis, and have packages that can be best used to manipulate different kinds of data, as per your requirements. This book will show you the different data wrangling techniques, and how you can leverage the power of Python and R packages to implement them.
You'll start by understanding the data wrangling process and get a solid foundation to work with different types of data. You'll work with different data structures and acquire and parse data from various locations. You'll also see how to reshape the layout of data and manipulate, summarize, and join data sets. Finally, we conclude with a quick primer on accessing and processing data from databases, conducting data exploration, and storing and retrieving data quickly using databases.
The book includes practical examples on each of these points using simple and real-world data sets to give you an easier understanding. By the end of the book, you'll have a thorough understanding of all the data wrangling concepts and how to implement them in the best possible way.
This is a practical book on data wrangling designed to give you an insight into the practical application of data wrangling. It takes you through complex concepts and tasks in an accessible way, featuring information on a wide range of data wrangling techniques with Python and R
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 219
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: November 2017
Production reference: 1141117
ISBN 978-1-78728-613-9
www.packtpub.com
Author
Allan Visochek
Copy Editors
Tasneem Fatehi
Safis Editing
Reviewer
Adriano Longo
Project Coordinator
Manthan Patel
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Malaika Monteiro
Indexer
Pratik Shirodkar
Content Development Editor
Aaryaman Singh
Graphics
Tania Dutta
Technical EditorDinesh Chaudhary
Production Coordinator
Shraddha Falebhai
Allan Visochek is a freelance web developer and data analyst in New Haven, Connecticut. Outside of work, Allan has a deep interest in machine learning and artificial intelligence.
Allan thoroughly enjoys teaching and sharing knowledge. After graduating from the Udacity Data Analyst Nanodegree program, he was contracted to Udacity for several months as a forum mentor and project reviewer, offering guidance to students working on data analysis projects. He has also written technical content for learntoprogram.tv.
Adriano Longo is a freelance data analyst based in the Netherlands with a passion for Neo4j's relationship-oriented data model.
He specializes in querying, processing, and modeling data with Cypher, R, Python, and SQL, and worked on climate prediction models at UEA's Climatic Research Unit before focusing on analytical solutions for the private sector.
Today, Adriano uses Neo4j and linkurious.js to explore the complex web of relationships nefarious actors use to obfuscate their abuse of environmental and financial regulations, making dirty secrets less transparent, one graph at a time.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787286134. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Programming with Data
Understanding data wrangling
Getting and reading data
Cleaning data
Shaping and structuring data
Storing data
The tools for data wrangling
Python
R
Summary
Introduction to Programming in Python
External resources
Logistical overview
Installation requirements
Using other learning resources
Python 2 versus Python 3
Running programs in python
Using text editors to write and manage programs
Writing the hello world program
Using the terminal to run programs
Running the Hello World program
What if it didn't work?
Data types, variables, and the Python shell
Numbers - integers and floats
Why integers?
Strings
Booleans
The print function
Variables
Adding to a variable
Subtracting from a variable
Multiplication
Division
Naming variables
Arrays (lists, if you ask Python)
Dictionaries
Compound statements
Compound statement syntax and indentation level
For statements and iterables
If statements
Else and elif clauses
Functions
Passing arguments to a function
Returning values from a function
Making annotations within programs
A programmer's resources
Documentation
Online forums and mailing lists
Summary
Reading, Exploring, and Modifying Data - Part I
External resources
Logistical overview
Installation requirements
Data
File system setup
Introducing a basic data wrangling work flow
Introducing the JSON file format
Opening and closing a file in Python using file I/O
The open function and file objects
File structure - best practices to store your data
Opening a file
Reading the contents of a file
Modules in Python
Parsing a JSON file using the json module
Exploring the contents of a data file
Extracting the core content of the data
Listing out all of the variables in the data
Modifying a dataset
Extracting data variables from the original dataset
Using a for loop to iterate over the data
Using a nested for loop to iterate over the data variables
Outputting the modified data to a new file
Specifying input and output file names in the Terminal
Specifying the filenames from the Terminal
Summary
Reading, Exploring, and Modifying Data - Part II
Logistical overview
File system setup
Data
Installing pandas
Understanding the CSV format
Introducing the CSV module
Using the CSV module to read CSV data
Using the CSV module to write CSV data
Using the pandas module to read and process data
Counting the total road length in 2011 revisited
Handling non-standard CSV encoding and dialect
Understanding XML
XML versus JSON
Using the XML module to parse XML data
XPath
Summary
Manipulating Text Data - An Introduction to Regular Expressions
Logistical overview
Data
File structure setup
Understanding the need for pattern recognition
Introducting regular expressions
Writing and using a regular expression
Special characters
Matching whitespace
Matching the start of string
Matching the end of a string
Matching a range of characters
Matching any one of several patterns
Matching a sequence instead of just one character
Putting patterns together
Extracting a pattern from a string
The regex split() function
Python regex documentation
Looking for patterns
Quantifying the existence of patterns
Creating a regular expression to match the street address
Counting the number of matches
Verifying the correctness of the matches
Extracting patterns
Outputting the data to a new file
Summary
Cleaning Numerical Data - An Introduction to R and RStudio
Logistical overview
Data
Directory structure
Installing R and RStudio
Introducing R and RStudio
Familiarizing yourself with RStudio
Running R commands
Setting the working directory
Reading data
The R dataframe
R vectors
Indexing R dataframes
Finding the 2011 total in R
Conducting basic outlier detection and removal
Handling NA values
Deleting missing values
Replacing missing values with a constant
Imputation of missing values
Variable names and contents
Summary
Simplifying Data Manipulation with dplyr
Logistical overview
Data
File system setup
Installing the dplyr and tibble packages
Introducing dplyr
Getting started with dplyr
Chaining operations together
Filtering the rows of a dataframe
Summarizing data by category
Rewriting code using dplyr
Summary
Getting Data from the Web
Logistical overview
Filesystem setup
Installing the requests module
Internet connection
Introducing APIs
Using Python to retrieve data from APIs
Using URL parameters to filter the results
Summary
Working with Large Datasets
Logistical overview
System requirements
Data
File system setup
Installing MongoDB
Planning out your time
Cleaning up
Understanding computer memory
Understanding databases
Introducing MongoDB
Interfacing with MongoDB from Python
Summary
Data rarely comes prepared for its end use. For any particular project, there may be too much data, too little data, missing data, erroneous data, poorly structured data, or improperly formatted data. This book is about how to gather the data that is available and produce an output that is ready to be used. In each of the chapters, one or more demonstrations are used to show a new approach to data wrangling.
Chapter 1, Programming with Data, discusses the context of data wrangling and offers a high-level overview of the rest of the book's content.
Section 1: A generalized programming approach to data wrangling
Chapter 2, Introduction to Programming in Python, introduces programming using the Python programming language, which used in most of the chapters of the book.
Chapter 3, Reading, Exploring, and Modifying Data - Part I, is an overview of the steps for processing a data file and an introduction to JSON data.
Chapter 4, Reading, Exploring, and Modifying Data - Part II, continues from the previous chapter, extending to the CSV and XML data formats.
Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, is an introduction to regular expressions with the application of extracting street names from street addresses.
Section 2: A formulated approach to data wrangling
Chapter 6, Cleaning Numerical Data - An Introduction to R and RStudio, introduces R and RStudio with the application of cleaning numerical data.
Chapter 7, Simplifying Data Manipulation with dplyr, is an introduction to the dplyr package for R, which can be used to express multiple data processing steps elegantly and concisely.
Section 3: Advanced methods for retrieving and storing data
Chapter 8, Getting Data from the Web, is an introduction to APIs. This chapter shows how to extract data from APIs using Python.
Chapter 9, Working with Large Datasets, has an overview of the issues when working with large amounts of data and a very brief introduction to MongoDB.
You will need a Python 3 installation on your computer, and you will need to be able to execute Python from your operating system’s command-line interface. In addition, the following external Python modules will be used:
pandas (Chapters 4 and 5)
requests (Chapter 8)
PyMongo (Chapter 9)
For Chapter 9, you will need to install MongoDB and set up your own local MongoDB server. For Chapters 6 and 7, you will need RStudio and Rbase. Additionally, for Chapter 7, you will need the dplyr and tibble libraries.
If you are a data scientist, data analyst, or statistician who wants to learn how to wrangle your data for analysis in the best possible manner, this book is for you. As this book covers both R and Python, some understanding of these will be beneficial.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Data-Wrangling. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PracticalDataWrangling_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.
For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.
Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.
Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:
There may be extra steps involved in getting the data
The information needed may be spread across multiple sources
Datasets may be too large to work with in their original format
There may be far more fields or information in a particular dataset than needed
Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on
Datasets may be structured or formatted in a way that is not compatible with a particular application
Due to this, it is often the responsibility of the data programmer to perform the following functions:
Discover and gather the data that is needed (getting data)
Merge data from different sources if necessary (merging data)
Fix flaws in the data entries (cleaning data)
Extract the necessary data and put it in the proper structure (shaping data)
Store it in the proper format for further use (storing data)
This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:
Getting data
Cleaning data
Merging and shaping data
Storing data
Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling.
Data wrangling, broadly speaking, is the process of gathering data in its raw form and molding it into a form that is suitable for its end use. Preparing data for its end use can branch out into a number of different tasks based on the exact use case. This can make it rather hard to pin down exactly what data wrangling entails, and formulate how to go about it. Nevertheless, there are a number of common steps in the data wrangling process, as outlined in the following subsections. The approach that I will take in this book is to introduce a number of tools and practices that are often involved in data wrangling. Each of the chapters will consist of one or more exercises and/or projects that will demonstrate the application of a particular tool or approach.
The first step is to retrieve a dataset and open it with a program capable of manipulating the data. The simplest way of retrieving a dataset is to find a data file. Python and R can be used to open, read, modify, and save data stored in static files. In Chapter 3, Reading, Exploring, and Modifying Data - Part I, I will introduce the JSON data format and show how to use Python to read, write and modify JSON data. In Chapter 4, Reading, Exploring, and Modifying Data - Part II, I will walk through how to use Python to work with data files in the CSV and XML data formats. In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will introduce R and Rstudio, and show how to use R to read and manipulate data.
Larger data sources are often made available through web interfaces called application programming interfaces (APIs). APIs allow you to retrieve specific bits of data from a larger collection of data. Web APIs can be great resources for data that is otherwise hard to get. In Chapter 8, Getting Data from the Web, I discuss APIs in detail and walk through the use of Python to extract data from APIs.
Another possible source of data is a database. I won't go into detail on the use of databases in this book, though in Chapter 9, Working with Large Datasets, I will show how to interact with a particular database using Python.
When working with data, you can generally expect to find human errors, missing entries, and numerical outliers. These types of errors usually need to be corrected, handled, or removed to prepare a dataset for analysis.
In Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, I will demonstrate how to use regular expressions, a tool to identify, extract, and modify patterns in text data. Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, includes a project to use regular expressions to extract street names.
In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will demonstrate how to use RStudio to conduct two common tasks for cleaning numerical data: outlier detection and NA handling.
Preparing data for its end use often requires both structuring and organizing the data in the correct manner.
To illustrate this, suppose you have a hierarchical dataset of city populations, as shown in Figure 01:
If the goal is to create a histogram of city populations, the previous data format would be hard to work with. Not only is the information of the city populations nested within the data structure, but it is nested to varying degrees of depth. For the purposes of creating a histogram, it is better to represent the data as a list of numbers, as shown in Figure 02:
