45,59 €
Over 80 recipes to help you breeze through your data analysis projects using R
This book is for data scientists, analysts and even enthusiasts who want to learn and implement the various data analysis techniques using R in a practical way. Those looking for quick, handy solutions to common tasks and challenges in data analysis will find this book to be very useful. Basic knowledge of statistics and R programming is assumed.
Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data.
This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way.
By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 464
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: May 2015
Second Edition: September 2017
Production reference: 1150917
ISBN 978-1-78712-447-9
www.packtpub.com
Author
Kuntal Ganguly
Copy Editor
Manisha Sinha
Reviewers
Davor Lozić
Daniel Alvarez Rojas
Project Coordinator
Manthan Patel
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Tejal Daruwale Soni
Content Development Editor
Tejas Limkar
Graphics
Tania Dutta
Technical Editor
Sagar Sawant
Production Coordinator
Deepika Naik
Kuntal Ganguly is a big data analytics engineer focused on building large-scale data-driven systems using big data frameworks and machine learning. He has around 7 years of experience of building several big data and machine learning applications.
Kuntal provides solutions to AWS customers in building real-time analytics systems using AWS services and open source Hadoop ecosystem technologies such as Spark, Kafka, Storm, and Flink, along with machine learning and deep learning frameworks.
Kuntal enjoys hands-on software development, and has single-handedly conceived, architectured, developed, and deployed several large-scale distributed applications. Besides being an open source contributor, he is a machine learning and deep learning practitioner and is very passionate about building intelligent applications.
I am grateful to my mother, Chitra Ganguly, and father, Gopal Ganguly, for their love and support and for teaching me much about hard work, and even the little I have absorbed has helped me immensely throughout my life. I would also like to thank all my friends, colleagues, and mentors that I've had over the years.
You can reach Kuntal on LinkedIn at https://in.linkedin.com/in/kuntal-ganguly-59564088
Davor Lozić is a senior software engineer interested in various subjects, especially computer security, algorithms, and data structures. He manages teams of 15+ engineers and is a part-time assistant professor who lectures about database systems, Java, and interoperability. You can visit his website at http://warriorkitty.com and contact him from there. He likes cats! If you want to talk about any aspect of technology or if you have funny pictures of cats, feel free to contact him.
Daniel Alvarez Rojas is currently a data scientist at Hova Health, an IT/consulting company in the health sector. With experience in statistics, marketing, and BI, Daniel holds a BA in Business and Marketing and works in government consulting, helping health managers and directors to take data-driven decisions to solve industry challenges. He has spent years as an analyst in logistic companies, working on optimization and predictive models.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Acquire and Prepare the Ingredients - Your Data
Introduction
Working with data
Reading data from CSV files
Getting ready
How to do it...
How it works...
There's more...
Handling different column delimiters
Handling column headers/variable names
Handling missing values
Reading strings as characters and not as factors
Reading data directly from a website
Reading XML data
Getting ready
How to do it...
How it works...
There's more...
Extracting HTML table data from a web page
Extracting a single HTML table from a web page
Reading JSON data
Getting ready
How to do it...
How it works...
Reading data from fixed-width formatted files
Getting ready
How to do it...
How it works...
There's more...
Files with headers
Excluding columns from data
Reading data from R files and R libraries
Getting ready
How to do it...
How it works...
There's more...
Saving all objects in a session
Saving objects selectively in a session
Attaching/detaching R data files to an environment
Listing all datasets in loaded packages
Removing cases with missing values
Getting ready
How to do it...
How it works...
There's more...
Eliminating cases with NA for selected variables
Finding cases that have no missing values
Converting specific values to NA
Excluding NA values from computations
Replacing missing values with the mean
Getting ready
How to do it...
How it works...
There's more...
Imputing random values sampled from non-missing values
Removing duplicate cases
Getting ready
How to do it...
How it works...
There's more...
Identifying duplicates without deleting them
Rescaling a variable to specified min-max range
Getting ready
How to do it...
How it works...
There's more...
Rescaling many variables at once
See also
Normalizing or standardizing data in a data frame
Getting ready
How to do it...
How it works...
There's more...
Standardizing several variables simultaneously
See also
Binning numerical data
Getting ready
How to do it...
How it works...
There's more...
Creating a specified number of intervals automatically
Creating dummies for categorical variables
Getting ready
How to do it...
How it works...
There's more...
Choosing which variables to create dummies for
Handling missing data
Getting ready
How to do it...
How it works...
There's more...
Understanding missing data pattern
Correcting data
Getting ready
How to do it...
How it works...
There's more...
Combining multiple columns to single columns
Splitting single column to multiple columns
Imputing data
Getting ready
How to do it...
How it works...
There's more...
Detecting outliers
Getting ready
How to do it...
How it works...
There's more...
Treating the outliers with mean/median imputation
Handling extreme values with capping
Transforming and binning values
Outlier detection with LOF
What's in There - Exploratory Data Analysis
Introduction
Creating standard data summaries
Getting ready
How to do it...
How it works...
There's more...
Using the str() function for an overview of a data frame
Computing the summary and the str() function for a single variable
Finding other measures
Extracting a subset of a dataset
Getting ready
How to do it...
How it works...
There's more...
Excluding columns
Selecting based on multiple values
Selecting using logical vector
Splitting a dataset
Getting ready
How to do it...
How it works...
Creating random data partitions
Getting ready
How to do it...
Case 1 - Numerical target variable and two partitions
Case 2 - Numerical target variable and three partitions
Case 3 - Categorical target variable and two partitions
Case 4 - Categorical target variable and three partitions
How it works...
There's more...
Using a convenience function for partitioning
Sampling from a set of values
Generating standard plots, such as histograms, boxplots, and scatterplots
Getting ready
How to do it...
Creating histograms
Creating boxplots
Creating scatterplots
Creating scatterplot matrices
How it works...
Histograms
Boxplots
There's more...
Overlay a density plot on a histogram
Overlay a regression line on a scatterplot
Color specific points on a scatterplot
Generating multiple plots on a grid
Getting ready
How to do it...
How it works...
Graphics parameters
Creating plots with the lattice package
Getting ready
How to do it...
How it works...
There's more...
Adding flair to your graphs
See also
Creating charts that facilitate comparisons
Getting ready
How to do it...
Using base plotting system
How it works...
There's more...
Creating beanplots with the beanplot package
See also
Creating charts that help to visualize possible causality
Getting ready
How to do it...
How it works...
See also
Where Does It Belong? Classification
Introduction
Generating error/classification confusion matrices
Getting ready
How to do it...
How it works...
There's more...
Visualizing the error/classification confusion matrix
Comparing the model's performance for different classes
Principal Component Analysis
Getting ready
How to do it...
How it works...
Generating receiver operating characteristic charts
Getting ready
How to do it...
How it works...
There's more...
Using arbitrary class labels
Building, plotting, and evaluating with classification trees
Getting ready
How to do it...
How it works...
There's more...
Computing raw probabilities
Creating the ROC chart
See also
Using random forest models for classification
Getting ready
How to do it...
How it works...
There's more...
Computing raw probabilities
Generating the ROC chart
Specifying cutoffs for classification
See also
Classifying using the support vector machine approach
Getting ready
How to do it...
How it works...
There's more...
Controlling the scaling of variables
Determining the type of SVM model
Assigning weights to the classes
Choosing the cost of SVM
Tuning the SVM
See also
Classifying using the Naive Bayes approach
Getting ready
How to do it...
How it works...
See also
Classifying using the KNN approach
Getting ready
How to do it...
How it works...
There's more...
Automating the process of running KNN for many k values
Selecting appropriate values of k using caret
Using KNN to compute raw probabilities instead of classifications
Using neural networks for classification
Getting ready
How to do it...
How it works...
There's more...
Exercising greater control over nnet
Generating raw probabilities and plotting the ROC curve
Classifying using linear discriminant function analysis
Getting ready
How to do it...
How it works...
There's more...
Using the formula interface for lda
See also
Classifying using logistic regression
Getting ready
How to do it...
How it works...
Text classification for sentiment analysis
Getting ready
How to do it...
How it works...
Give Me a Number - Regression
Introduction
Computing the root-mean-square error
Getting ready
How to do it...
How it works...
There's more...
Using a convenience function to compute the RMS error
Building KNN models for regression
Getting ready
How to do it...
How it works...
There's more...
Running KNN with cross-validation in place of a validation partition
Using a convenience function to run KNN
Using a convenience function to run KNN for multiple k values
See also
Performing linear regression
Getting ready
How to do it...
How it works...
There's more...
Forcing lm to use a specific factor level as the reference
Using other options in the formula expression for linear models
See also
Performing variable selection in linear regression
Getting ready
How to do it...
How it works...
See also
Building regression trees
Getting ready
How to do it...
How it works...
There's more...
Generating regression trees for data with categorical predictors
Generating regression trees using the ensemble method - Bagging and Boosting
See also
Building random forest models for regression
Getting ready
How to do it...
How it works...
There's more...
Controlling forest generation
See also
Using neural networks for regression
Getting ready
How to do it...
How it works...
See also
Performing k-fold cross-validation
Getting ready
How to do it...
How it works...
See also
Performing leave-one-out cross-validation to limit overfitting
How to do it...
How it works...
See also
Can you Simplify That? Data Reduction Techniques
Introduction
Performing cluster analysis using hierarchical clustering
Getting ready
How to do it...
How it works...
There's more...
Cutting trees into clusters
Getting ready
How to do it...
How it works...
Performing cluster analysis using partitioning clustering
Getting ready
How to do it...
How it works...
There's more...
Image segmentation using mini-batch K-means
Getting ready
How to do it...
Partitioning around medoids
Getting ready
How to do it...
How it works...
Clustering large application
Getting ready
How to do it...
How it works...
Performing cluster validation
Getting ready
How to do it...
How it works...
Performing Advance clustering
Density-based spatial clustering of applications with noise
Getting ready
How to do it...
How it works...
Model-based clustering with the EM algorithm
Getting ready
How to do it...
How it works...
Reducing dimensionality with principal component analysis
Getting ready
How to do it...
How it works...
Lessons from History - Time Series Analysis
Introduction
Exploring finance datasets
Getting ready
How to do it...
How it works...
There's more...
Creating and examining date objects
Getting ready
How to do it...
How it works...
Operating on date objects
Getting ready
How to do it...
How it works...
See also
Performing preliminary analyses on time series data
Getting ready
How to do it...
How it works...
See also
Using time series objects
Getting ready
How to do it...
How it works...
See also
Decomposing time series
Getting ready
How to do it...
How it works...
See also
Filtering time series data
Getting ready
How to do it...
How it works...
See also
Smoothing and forecasting using the Holt-Winters method
Getting ready
How to do it...
How it works...
See also
Building an automated ARIMA model
Getting ready
How to do it...
How it works...
See also
How does it look? - Advanced data visualization
Introduction
Creating scatter plots
Getting ready
How to do it...
How it works...
There's more...
Graph using qplot
Creating line graphs
Getting ready
How to do it...
How it works...
Creating bar graphs
Getting ready
How to do it...
Creating bar charts with ggplot2
How it works...
Making distributions plots
Getting ready
How to do it...
How it works...
Creating mosaic graphs
Getting ready
How to do it...
How it works...
Making treemaps
Getting ready
How to do it...
How it works...
Plotting a correlations matrix
Getting ready
How to do it...
How it works...
There's more...
Visualizing a correlation matrix with ggplot2
Creating heatmaps
Getting ready
How to do it...
How it works...
There's more...
Plotting a heatmap over geospatial data
See also
Plotting network graphs
Getting ready
How to do it...
How it works...
See also
Labeling and legends
Getting ready
How to do it...
How it works...
Coloring and themes
Getting ready
How to do it...
How it works...
Creating multivariate plots
Getting ready
How to do it...
How it works...
There's more...
Multivariate plots with the GGally package
Creating 3D graphs and animation
Getting ready
How to do it...
How it works...
There's more...
Adding text to an existing 3D plot
Using a 3D histogram
Using a line graph
Selecting a graphics device
Getting ready
How to do it...
How it works...
This may also interest you - Building Recommendations
Introduction
Building collaborative filtering systems
Getting ready
How to do it...
How it works...
There's more...
Using collaborative filtering on binary data
Performing content-based systems
Getting ready
How to do it...
How it works...
Building hybrid systems
Getting ready
How to do it...
How it works...
Performing similarity measures
Getting ready
How to do it...
How it works...
Application of ML algorithms - image recognition system
Getting ready
How to do it...
How it works...
Evaluating models and optimization
Getting ready
How to do it...
How it works...
There's more...
Identifying a suitable model
Optimizing parameters
A practical example - fraud detection system
Getting ready
How to do it...
How it works...
It's All About Your Connections - Social Network Analysis
Introduction
Downloading social network data using public APIs
Getting ready
How to do it...
How it works...
See also
Creating adjacency matrices and edge lists
Getting ready
How to do it...
How it works...
See also
Plotting social network data
Getting ready
How to do it...
How it works...
There's more...
Specifying plotting preferences
Plotting directed graphs
Creating a graph object with weights
Extracting the network as an adjacency matrix from the graph object
Extracting an adjacency matrix with weights
Extracting an edge list from a graph object
Creating a bipartite network graph
Generating projections of a bipartite network
Computing important network metrics
Getting ready
How to do it...
How it works...
There's more...
Getting edge sequences
Getting immediate and distant neighbors
Adding vertices or nodes
Adding edges
Deleting isolates from a graph
Creating subgraphs
Cluster analysis
Getting ready
How to do it...
How it works...
Force layout
Getting ready
How to do it...
How it works...
There's more...
Force Atlas 2
YiFan Hu layout
Getting ready
How to do it...
How it works...
There's more...
Put Your Best Foot Forward - Document and Present Your Analysis
Introduction
Generating reports of your data analysis with R Markdown and knitr
Getting ready
How to do it...
How it works...
There's more...
Using the render function
Adding output options
Creating interactive web applications with shiny
Getting ready
How to do it...
How it works...
There's more...
Adding images
Adding HTML
Adding tab sets
Adding a dynamic UI
Creating a single-file web application
Dynamic integration of Shiny with knitr
Creating PDF presentations of your analysis with R presentation
Getting ready
How to do it...
How it works...
There's more...
Using hyperlinks
Controlling the display
Enhancing the look of the presentation
Generating dynamic reports
Getting ready
How to do it...
How it works...
Work Smarter, Not Harder - Efficient and Elegant R Code
Introduction
Exploiting vectorized operations
Getting ready
How to do it...
How it works...
There's more...
Processing entire rows or columns using the apply function
Getting ready
How to do it...
How it works...
There's more...
Using apply on a three-dimensional array
Applying a function to all elements of a collection with lapply and sapply
Getting ready
How to do it...
How it works...
There's more...
Dynamic output
One caution
Applying functions to subsets of a vector
Getting ready
How to do it...
How it works...
There's more...
Applying a function on groups from a data frame
Using the split-apply-combine strategy with plyr
Getting ready
How to do it...
How it works...
There's more...
Adding a new column using transform or mutate
Using summarize along with the plyr function
Concatenating the list of data frames into a big data frame
Common grouping functions in plyr
Split-apply-combine with dplyr
Slicing, dicing, and combining data with data tables
Getting ready
How to do it...
How it works...
There's more...
Adding multiple aggregated columns
Counting groups
Deleting a column
Joining data tables
Using symbols
Where in the World? Geospatial Analysis
Introduction
Downloading and plotting a Google map of an area
Getting ready
How to do it...
How it works...
There's more...
Saving the downloaded map as an image file
Getting a satellite image
Overlaying data on the downloaded Google map
Getting ready
How to do it...
How it works...
There's more...
Importing ESRI shape files to R
Getting ready
How to do it...
How it works...
Using the sp package to plot geographic data
Getting ready
How to do it...
How it works...
Getting maps from the maps package
Getting ready
How to do it...
How it works...
Creating spatial data frames from regular data frames containing spatial and other data
Getting ready
How to do it...
How it works...
Creating spatial data frames by combining regular data frames with spatial objects
Getting ready
How to do it...
How it works...
Adding variables to an existing spatial data frame
Getting ready
How to do it...
How it works...
Spatial data analysis with R and QGIS
Getting ready
How to do it...
How it works...
Playing Nice - Connecting to Other Systems
Introduction
Using Java objects in R
Getting ready
How to do it...
How it works...
There's more...
Checking JVM properties
Displaying available methods
Using JRI to call R functions from Java
Getting ready
How to do it...
How it works...
There's more...
Using Rserve to call R functions from Java
Getting ready
How to do it...
How it works...
There's more...
Retrieving an array from R
Executing R scripts from Java
Getting ready
How to do it...
How it works...
Using the xlsx package to connect to Excel
Getting ready
How to do it...
How it works...
Reading data from relational databases - MySQL
Getting ready
How to do it...
Using RODBC
Using RMySQL
Using RJDBC
How it works...
Using RODBC
Using RMySQL
Using RJDBC
There's more...
Fetching all rows
When the SQL query is long
Reading data from NoSQL databases - MongoDB
Getting ready
How to do it...
How it works...
There's more...
Find most severe crime zone
Plotting the crimes on the Chicago map
Working with in-memory data processing with Apache Spark
Getting ready
How to do it...
How it works...
There's more...
Classification with SparkR
Movie lens recommendation system with SparkR
Data analytics with R has emerged as a very important topic for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book empowers you by showing you ways to use R to generate professional analysis reports. The book also teaches you how to quickly adapt the example code for your own needs and save yourself the time needed to construct code from scratch.
Chapter 1, Acquire and Prepare the Ingredients – Reading Your Data, provides the recipes to acquire, format, and cleanse data from multiple formats. Handling missing values, standardizing datasets, and transforming between numerical and categorical data are also covered.
Chapter 2, What's in There? – Exploratory Data Analysis, shows you how to perform exploratory data analysis and find underlying patterns to understand our dataset before getting into the analysis process.
Chapter 3, Where does it belong? - Classification, covers several classification techniques from basic classification trees, logistic regression, and support vector machines to text classification using Naive Bayes to find sentiment analysis.
Chapter 4, Give me a number - Regression, covers several algorithms for data prediction, such as linear regression, random forests, neural networks, and regression trees.
Chapter 5, Can you simplify that? – Data Reduction Techniques, covers code recipes for data reduction and clustering. We explore the different clustering algorithms in a practical way.
Chapter 6, Lessons from history - Time Series Analysis, explores how to work with financial time series data, how to visualize it, and how to perform predictions using the ARIMA algorithms.
Chapter 7, How does it look? - Advance data visualization, explores how to make attractive visualizations, 3D graphs, and advanced maps.
Chapter 8, This May also interest you – Building Recommendations Systems, guides you step by step through applying machine learning and data mining techniques, building and optimizing recommender models, followed by a fraud system practical example.
Chapter 9, It's all about Connections – Social Network Analysis, explores how to acquire, visualize, and cluster social network data using public APIs.
Chapter 10, Put your best foot forward – Document and present your Analysis, shows you how to show and share the results of the data analysis. It includes recipes to use R markdown, KnitR, and Shiny to create reports and dynamic dashboards.
Chapter 11, Work Smarter, not Harder – Efficient and elegant R code, covers recipes to handle large datasets using the apply family of functions, the plyr package, and using data tables to slice and dice data.
Chapter 12, Where in the world? – Geospatial Analysis, teaches you how to perform a geospatial data analysis implementing tools such as Google Maps and QGIS using R implementations. It covers how to import maps and visualize your own data into the maps.
Chapter 13, Playing nice – Working with external data sources, shows you how to work with external data sources such as Excel, MySql, or MongoDB, and how to perform large data processing methods with in-memory processing using Apache Spark.
The steps should be listed in a way that it prepares the system environment to be able to test the code examples in the book.
The following software is required:
R base
MS Office
Apache Java (JDK)
MySQL
MongoDB
Although all the code should run on the R command line, I suggest you run the code in the book from R Studio instead of the R command line, as it is an IDE and is easy to use. Also, a few chapters specifically require the use of R Studio only.
This book is for data science professionals or analysts who have performed machine learning tasks and now want to explore deep learning and want a quick reference thataddress the pain points that crop up while implementing deep learning. Those who wish to have an edge over other deep learning professionals will find this book quite useful.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."
A block of code is set as follows:
[default] exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100) exten => i,1,Voicemail(s0)
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
[default] exten => s,1,
Dial
(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,
102
,Voicemail(b100) exten => i,1,Voicemail(s0)
Any command-line input or output is written as follows:
# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample /etc/asterisk/cdr_mysql.conf
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."
Feedback from our readers is always welcome. Let us know what you think about this book--what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads
&
Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-Data-Analysis-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/RDataAnalysisCookbookSecondEdition_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books--maybe a mistake in the text or the code--we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we will cover:
Working with data
Reading data from CSV files
Reading XML data
Reading JSON data
Reading data from fixed-width formatted files
Reading data from R files and R libraries
Removing cases with missing values
Replacing missing values with the mean
Removing duplicate cases
Rescaling a variable to specified min-max range
Normalizing or standardizing data in a data frame
Binning numerical data
Creating dummies for categorical variables
Handling missing data
Correcting data
Imputing data
Detecting outliers
Data is everywhere and the amount of digital data that exists is growing rapidly, that is projected to grow to 180 zettabytes by 2025. Data Science is a field that tries to extract insights and meaningful information from structured and unstructured data through various stages such as asking questions, getting the data, exploring the data, modeling the data, and communicating result as shown in the following diagaram:
Data scientists or analysts often need to load or collect data from various resources having different input formats into R. Although R has its own native data format, data usually exists in text formats, such as Comma Separated Values (CSV), JavaScript Object Notation (JSON), and Extensible Markup Language (XML). This chapter provides recipes to load such data into your R system for processing.
Raw, real-world datasets are often messy with missing values, unusable format, and outliers. Very rarely can we start analyzing data immediately after loading it. Often, we will need to preprocess the data to clean, impute, wrangle, and transform it before embarking on analysis. This chapter provides recipes for some common cleaning, missing value imputation, outlier detection, and preprocessing steps.
In the wild, datasets come in many different formats, but each computer program expects your data to be organized in a well-defined structure.
As a result, every data science project begins with the same tasks: gather the data, view the data, clean the data, correct or change the layout of the data to make it tidy, handle missing values and outliers from the data, model the data, and evaluate the data.
With R, you can do everything from collecting your data (from the web or a database) to cleaning it, transforming it, visualizing it, modelling it, and running statistical tests on it.
CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single relation in a relational database, or to data (though not calculations) in a typical spreadsheet.
If you have not already downloaded the files for this chapter, do it now and ensure that the auto-mpg.csv file is in your R working directory.
The read.csv() function creates a data frame from the data in the .csv file. If we pass header=TRUE, then the function uses the very first row to name the variables in the resulting data frame:
> names(auto) [1] "No" "mpg" "cylinders" [4] "displacement" "horsepower" "weight" [7] "acceleration" "model_year" "car_name"
The header and sep parameters allow us to specify whether the .csv file has headers and the character used in the file to separate fields. The header=TRUE and sep="," parameters are the defaults for the read.csv() function; we can omit these in the code example.
The read.csv() function is a specialized form of read.table(). The latter uses whitespace as the default field separator. We will discuss a few important optional arguments to these functions.
In regions where a comma is used as the decimal separator, the .csv files use ";" as the field delimiter. While dealing with such data files, use read.csv2() to load data into R.
Alternatively, you can use the read.csv("<file name>", sep=";", dec=",") command.
Use sep="t" for tab-delimited files.
By default, R treats strings as factors (categorical variables). In some situations, you may want to leave them as character strings. Use stringsAsFactors=FALSE to achieve this:
> auto <- read.csv("auto-mpg.csv",stringsAsFactors=FALSE)
However, to selectively treat variables as characters, you can load the file with the defaults (that is, read all strings as factors) and then use as.character() to convert the requisite factor variables to characters.
If the data file is available on the web, you can load it directly into R, instead of downloading and saving it locally before loading it into R:
> dat <- read.csv("http://www.exploredata.net/ftp/WHO.csv")
You may sometimes need to extract data from websites. Many providers also supply data in XML and JSON formats. In this recipe, we learn about reading XML data.
Make sure you have downloaded the files for this chapters and the files cd_catalog.xml and WorldPopulation-wiki.htm are in working directory of R. If the XML package is not already installed in your R environment, install the package now, as follows:
> install.packages("XML")
XML data can be read by following these steps:
Load the library and initialize:
> library(XML) > url <- "cd_catalog.xml"
Parse the XML file and get the root node:
> xmldoc <- xmlParse(url) > rootNode <- xmlRoot(xmldoc) > rootNode[1]
Extract the XML data:
> data <- xmlSApply(rootNode,function(x) xmlSApply(x, xmlValue))
Convert the extracted data into a data frame:
> cd.catalog <- data.frame(t(data),row.names=NULL)
Verify the results:
> cd.catalog[1:2,]
The xmlParse function returns an object of the XMLInternalDocument class, which is a C-level internal data structure.
The xmlRoot() function gets access to the root node and its elements. Let us check the first element of the root node:
> rootNode[1] $CD <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD> attr(,"class") [1] "XMLInternalNodeList" "XMLNodeList"
To extract data from the root node, we use the xmlSApply() function iteratively over all the children of the root node. The xmlSApply function returns a matrix.
To convert the preceding matrix into a data frame, we transpose the matrix using the t() function and then extract the first two rows from the cd.catalog data frame:
> cd.catalog[1:2,] TITLE ARTIST COUNTRY COMPANY PRICE YEAR 1 Empire Burlesque Bob Dylan USA Columbia 10.90 1985 2 Hide your heart Bonnie Tyler UK CBS Records 9.90 1988
XML data can be deeply nested and hence can become complex to extract. Knowledge of XPath is helpful to access specific XML tags. R provides several functions, such as xpathSApply and getNodeSet, to locate specific elements.
Though it is possible to treat HTML data as a specialized form of XML, R provides specific functions to extract data from HTML tables, as follows:
> url <- "WorldPopulation-wiki.htm" > tables <- readHTMLTable(url) > world.pop <- tables[[6]]
The readHTMLTable() function parses the web page and returns a list of all the tables that are found on the page. For tables that have an id attribute, the function uses the id attribute as the name of that list element.
We are interested in extracting the "10 most populous countries", which is the fifth table, so we use tables[[6]].
A single table can be extracted using the following command:
> table <- readHTMLTable(url,which=5)
Specify which to get data from a specific table. R returns a data frame.
Several RESTful web services return data in JSON format, in some ways simpler and more efficient than XML. This recipe shows you how to read JSON data.
R provides several packages to read JSON data, but we will use the jsonlite package. Install the package in your R environment, as follows:
> install.packages("jsonlite")
If you have not already downloaded the files for this chapter, do it now and ensure that the students.json files and student-courses.json files are in your R working directory.
Once the files are ready, load the jsonlite package and read the files as follows:
Load the library:
> library(jsonlite)
Load the JSON data from the files:
> dat.1 <- fromJSON("students.json") > dat.2 <- fromJSON("student-courses.json")
Load the JSON document from the web:
> url <- "http://finance.yahoo.com/webservice/v1/symbols/allcurrencies/quote?format=json" > jsonDoc <- fromJSON(url)
Extract the data into data frames:
> dat <- jsonDoc$list$resources$resource$fields > dat.1 <- jsonDoc$list$resources$resource$fields > dat.2 <- jsonDoc$list$resources$resource$fields
Verify the results:
> dat[1:2,] > dat.1[1:3,] > dat.2[,c(1,2,4:5)]
The jsonlite package provides two key functions: fromJSON and toJSON.
The fromJSON function can load data either directly from a file or from a web page, as the preceding steps 2 and 3 show. If you get errors in downloading content directly from the web, install and load the httr package.
Depending on the structure of the JSON document, loading the data can vary in complexity.
If given a URL, the fromJSON function returns a list object. In the preceding list, in step 4, we see how to extract the enclosed data frame.
In fixed-width formatted files, columns have fixed widths; if a data element does not use up the entire allotted column width, then the element is padded with spaces to make up the specified width. To read fixed-width text files, specify the columns either by column widths or by starting positions.
Download the files for this chapter and store the student-fwf.txt file in your R working directory.
Read the fixed-width formatted file as follows:
> student <- read.fwf("student-fwf.txt", widths=c(4,15,20,15,4), col.names=c("id","name","email","major","year"))
In the student-fwf.txt file, the first column occupies 4 character positions, the second 15, and so on. The c(4,15,20,15,4) expression specifies the widths of the 5 columns in the data file.
We can use the optional col.names argument to supply our own variable names.
The read.fwf() function has several optional arguments that come in handy. We discuss a few of these, as follows:
Files with headers use the following command:
> student <- read.fwf("student-fwf-header.txt", widths=c(4,15,20,15,4), header=TRUE, sep="t",skip=2)
If header=TRUE, the first row of the file is interpreted as having the column headers. Column headers, if present, need to be separated by the specified sep argument. The sep argument only applies to the header row.
The skip argument denotes the number of lines to skip; in this recipe, the first two lines are skipped.
To exclude a column, make the column width negative. Thus, to exclude the email column, we will specify its width as -20 and also remove the column name from the col.names vector, as follows:
> student <- read.fwf("student-fwf.txt",widths=c(4,15,-20,15,4), col.names=c("id","name","major","year"))
During data analysis, you will create several R objects. You can save these in the native R data format and retrieve them later as needed.
First, create and save the R objects interactively, as shown in the following code. Make sure you have write access to the R working directory.
> customer <- c("John", "Peter", "Jane") > orderdate <- as.Date(c('2014-10-1','2014-1-2','2014-7-6')) > orderamount <- c(280, 100.50, 40.25) > order <- data.frame(customer,orderdate,orderamount) > names <- c("John", "Joan") > save(order, names, file="test.Rdata") > saveRDS(order,file="order.rds") > remove(order)
After saving the preceding code, the remove() function deletes the object from the current session.
To be able to read data from R files and libraries, follow these steps:
Load data from the R data files into memory:
> load("test.Rdata") > ord <- readRDS("order.rds")
The
datasets
package is loaded in the R environment by default and contains the
iris
and
cars
datasets. To load these datasets data into memory, use the following code:
> data(iris) > data(list(cars,iris))
The first command loads only the iris dataset, and the second loads both the cars and iris datasets.
The save() function saves the serialized version of the objects supplied as arguments along with the object name. The subsequent load() function restores the saved objects, with the same object names that they were saved with, to the global environment by default. If there are existing objects with the same names in that environment, they will be replaced without any warnings.
The saveRDS() function saves only one object. It saves the serialized version of the object and not the object name. Hence, with the readRDS() function, the saved object can be restored into a variable with a different name from when it was saved.
The preceding recipe has shown you how to read saved R objects. We see more options in this section.
To save objects selectively, use the following commands:
> odd <- c(1,3,5,7) > even <- c(2,4,6,8) > save(list=c("odd","even"),file="OddEven.Rdata")
The list argument specifies a character vector containing the names of the objects to be saved. Subsequently, loading data from the OddEven.Rdata file creates both odd and even objects. The saveRDS() function can save only one object at a time.
While loading Rdata files, if we want to be notified whether objects with the same names already exist in the environment, we can use:
> attach("order.Rdata")
The order.Rdata file contains an object named order. If an object named order already exists in the environment, we will get the following error:
The following object is masked _by_ .GlobalEnv: order
All the loaded packages can be listed using the following command:
> data()
Datasets come with varying amounts of missing data. When we have abundant data, we sometimes (not always) want to eliminate the cases that have missing values for one or more variables. This recipe applies when we want to eliminate cases that have any missing values, as well as when we want to selectively eliminate cases that have missing values for a specific variable alone.
Download the missing-data.csv file from the code files for this chapter to your R working directory. Read the data from the missing-data.csv file, while taking care to identify the string used in the input file for missing values. In our file, missing values are shown with empty strings:
> dat <- read.csv("missing-data.csv", na.strings="")
To get a data frame that has only the cases with no missing values for any variable, use the na.omit() function:
> dat.cleaned <- na.omit(dat)
Now dat.cleaned contains only those cases from dat that have no missing values in any of the variables.
The na.omit() function internally uses the is.na() function, that allows us to find whether its argument is NA. When applied to a single value, it returns a Boolean value. When applied to a collection, it returns a vector:
> is.na(dat[4,2]) [1] TRUE > is.na(dat$Income) [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE [10] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
You will sometimes need to do more than just eliminate the cases with any missing values. We discuss some options in this section.
We might sometimes want to selectively eliminate cases that have NA only for a specific variable. The example data frame has two missing values for Income. To get a data frame with only these two cases removed, use:
> dat.income.cleaned <- dat[!is.na(dat$Income),] > nrow(dat.income.cleaned) [1] 25
The complete.cases() function takes a data frame or table as its argument and returns a Boolean vector with TRUE for rows that have no missing values, and FALSE otherwise:
> complete.cases(dat) [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE [10] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE [19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Rows 4, 6, 13, and 17 have at least one missing value. Instead of using the na.omit() function, we can do the following as well:
> dat.cleaned <- dat[complete.cases(dat),] > nrow(dat.cleaned) [1] 23
