R Data Analysis Cookbook - Second Edition - Kuntal Ganguly - E-Book

R Data Analysis Cookbook - Second Edition E-Book

Kuntal Ganguly

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Over 80 recipes to help you breeze through your data analysis projects using R

About This Book

  • Analyse your data using the popular R packages like ggplot2 with ready-to-use and customizable recipes
  • Find meaningful insights from your data and generate dynamic reports
  • A practical guide to help you put your data analysis skills in R to practical use

Who This Book Is For

This book is for data scientists, analysts and even enthusiasts who want to learn and implement the various data analysis techniques using R in a practical way. Those looking for quick, handy solutions to common tasks and challenges in data analysis will find this book to be very useful. Basic knowledge of statistics and R programming is assumed.

What You Will Learn

  • Acquire, format and visualize your data using R
  • Using R to perform an Exploratory data analysis
  • Introduction to machine learning algorithms such as classification and regression
  • Get started with social network analysis
  • Generate dynamic reporting with Shiny
  • Get started with geospatial analysis
  • Handling large data with R using Spark and MongoDB
  • Build Recommendation system- Collaborative Filtering, Content based and Hybrid
  • Learn real world dataset examples- Fraud Detection and Image Recognition

In Detail

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data.

This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way.

By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.

Style and Approach

  • Hands-on recipes to walk through data science challenges using R
  • Your one-stop solution for common and not-so-common pain points while performing real-world problems to execute a series of tasks.
  • Addressing your common and not-so-common pain points, this is a book that you must have on the shelf

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 464

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



R Data Analysis Cookbook

Second Edition

 

 

 

 

 

 

 

 

 

 

A journey from data computation to data-driven insights

 

 

 

 

 

 

 

 

 

Kuntal Ganguly

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

R Data Analysis Cookbook

Second Edition

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

 

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

 

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: May 2015

Second Edition: September 2017

Production reference: 1150917

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78712-447-9

 

www.packtpub.com

Credits

Author

Kuntal Ganguly

 

Copy Editor

Manisha Sinha

Reviewers

Davor Lozić

Daniel Alvarez Rojas

 

Project Coordinator

Manthan Patel

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Tushar Gupta

Indexer

Tejal Daruwale Soni

Content Development Editor

Tejas Limkar

Graphics

Tania Dutta

Technical Editor

Sagar Sawant

Production Coordinator

Deepika Naik

About the Author

Kuntal Ganguly is a big data analytics engineer focused on building large-scale data-driven systems using big data frameworks and machine learning. He has around 7 years of experience of building several big data and machine learning applications.

Kuntal provides solutions to AWS customers in building real-time analytics systems using AWS services and open source Hadoop ecosystem technologies such as Spark, Kafka, Storm, and Flink, along with machine learning and deep learning frameworks.

Kuntal enjoys hands-on software development, and has single-handedly conceived, architectured, developed, and deployed several large-scale distributed applications. Besides being an open source contributor, he is a machine learning and deep learning practitioner and is very passionate about building intelligent applications.

I am grateful to my mother, Chitra Ganguly, and father, Gopal Ganguly, for their love and support and for teaching me much about hard work, and even the little I have absorbed has helped me immensely throughout my life. I would also like to thank all my friends, colleagues, and mentors that I've had over the years.

 

You can reach Kuntal on LinkedIn at https://in.linkedin.com/in/kuntal-ganguly-59564088

I believe that data science and artificial intelligence will give us superpowers.

About the Reviewers

Davor Lozić is a senior software engineer interested in various subjects, especially computer security, algorithms, and data structures. He manages teams of 15+ engineers and is a part-time assistant professor who lectures about database systems, Java, and interoperability. You can visit his website at http://warriorkitty.com and contact him from there. He likes cats! If you want to talk about any aspect of technology or if you have funny pictures of cats, feel free to contact him.

 

 

 

Daniel Alvarez Rojas is currently a data scientist at Hova Health, an IT/consulting company in the health sector. With experience in statistics, marketing, and BI, Daniel holds a BA in Business and Marketing and works in government consulting, helping health managers and directors to take data-driven decisions to solve industry challenges. He has spent years as an analyst in logistic companies, working on optimization and predictive models.

 

 

I extend my deepest gratitude to my family: my parents, Daniel and Sara, for always supporting me and my brothers; and Abdiel and Amy, for urging me to be the best example that I can be for them. To Alessandra, for all the love and wisdom. Hector, for being, more than a friend, a mentor. To Adrian, for opening the doors for a new stage.

 

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

  

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Acquire and Prepare the Ingredients - Your Data

Introduction

Working with data

Reading data from CSV files

Getting ready

How to do it...

How it works...

There's more...

Handling different column delimiters

Handling column headers/variable names

Handling missing values

Reading strings as characters and not as factors

Reading data directly from a website

Reading XML data

Getting ready

How to do it...

How it works...

There's more...

Extracting HTML table data from a web page

Extracting a single HTML table from a web page

Reading JSON data

Getting ready

How to do it...

How it works...

Reading data from fixed-width formatted files

Getting ready

How to do it...

How it works...

There's more...

Files with headers

Excluding columns from data

Reading data from R files and R libraries

Getting ready

How to do it...

How it works...

There's more...

Saving all objects in a session

Saving objects selectively in a session

Attaching/detaching R data files to an environment

Listing all datasets in loaded packages

Removing cases with missing values

Getting ready

How to do it...

How it works...

There's more...

Eliminating cases with NA for selected variables

Finding cases that have no missing values

Converting specific values to NA

Excluding NA values from computations

Replacing missing values with the mean

Getting ready

How to do it...

How it works...

There's more...

Imputing random values sampled from non-missing values

Removing duplicate cases

Getting ready

How to do it...

How it works...

There's more...

Identifying duplicates without deleting them

Rescaling a variable to specified min-max range

Getting ready

How to do it...

How it works...

There's more...

Rescaling many variables at once

See also

Normalizing or standardizing data in a data frame

Getting ready

How to do it...

How it works...

There's more...

Standardizing several variables simultaneously

See also

Binning numerical data

Getting ready

How to do it...

How it works...

There's more...

Creating a specified number of intervals automatically

Creating dummies for categorical variables

Getting ready

How to do it...

How it works...

There's more...

Choosing which variables to create dummies for

Handling missing data

Getting ready

How to do it...

How it works...

There's more...

Understanding missing data pattern

Correcting data

Getting ready

How to do it...

How it works...

There's more...

Combining multiple columns to single columns

Splitting single column to multiple columns

Imputing data

Getting ready

How to do it...

How it works...

There's more...

Detecting outliers

Getting ready

How to do it...

How it works...

There's more...

Treating the outliers with mean/median imputation

Handling extreme values with capping

Transforming and binning values

Outlier detection with LOF

What's in There - Exploratory Data Analysis

Introduction

Creating standard data summaries

Getting ready

How to do it...

How it works...

There's more...

Using the str() function for an overview of a data frame

Computing the summary and the str() function for a single variable

Finding other measures

Extracting a subset of a dataset

Getting ready

How to do it...

How it works...

There's more...

Excluding columns

Selecting based on multiple values

Selecting using logical vector

Splitting a dataset

Getting ready

How to do it...

How it works...

Creating random data partitions

Getting ready

How to do it...

Case 1 - Numerical target variable and two partitions

Case 2 - Numerical target variable and three partitions

Case 3 - Categorical target variable and two partitions

Case 4 - Categorical target variable and three partitions

How it works...

There's more...

Using a convenience function for partitioning

Sampling from a set of values

Generating standard plots, such as histograms, boxplots, and scatterplots

Getting ready

How to do it...

Creating histograms

Creating boxplots

Creating scatterplots

Creating scatterplot matrices

How it works...

Histograms

Boxplots

There's more...

Overlay a density plot on a histogram

Overlay a regression line on a scatterplot

Color specific points on a scatterplot

Generating multiple plots on a grid

Getting ready

How to do it...

How it works...

Graphics parameters

Creating plots with the lattice package

Getting ready

How to do it...

How it works...

There's more...

Adding flair to your graphs

See also

Creating charts that facilitate comparisons

Getting ready

How to do it...

Using base plotting system

How it works...

There's more...

Creating beanplots with the beanplot package

See also

Creating charts that help to visualize possible causality

Getting ready

How to do it...

How it works...

See also

Where Does It Belong? Classification

Introduction

Generating error/classification confusion matrices

Getting ready

How to do it...

How it works...

There's more...

Visualizing the error/classification confusion matrix

Comparing the model's performance for different classes

Principal Component Analysis

Getting ready

How to do it...

How it works...

Generating receiver operating characteristic charts

Getting ready

How to do it...

How it works...

There's more...

Using arbitrary class labels

Building, plotting, and evaluating with classification trees

Getting ready

How to do it...

How it works...

There's more...

Computing raw probabilities

Creating the ROC chart

See also

Using random forest models for classification

Getting ready

How to do it...

How it works...

There's more...

Computing raw probabilities

Generating the ROC chart

Specifying cutoffs for classification

See also

Classifying using the support vector machine approach

Getting ready

How to do it...

How it works...

There's more...

Controlling the scaling of variables

Determining the type of SVM model

Assigning weights to the classes

Choosing the cost of SVM

Tuning the SVM

See also

Classifying using the Naive Bayes approach

Getting ready

How to do it...

How it works...

See also

Classifying using the KNN approach

Getting ready

How to do it...

How it works...

There's more...

Automating the process of running KNN for many k values

Selecting appropriate values of k using caret

Using KNN to compute raw probabilities instead of classifications

Using neural networks for classification

Getting ready

How to do it...

How it works...

There's more...

Exercising greater control over nnet

Generating raw probabilities and plotting the ROC curve

Classifying using linear discriminant function analysis

Getting ready

How to do it...

How it works...

There's more...

Using the formula interface for lda

See also

Classifying using logistic regression

Getting ready

How to do it...

How it works...

Text classification for sentiment analysis

Getting ready

How to do it...

How it works...

Give Me a Number - Regression

Introduction

Computing the root-mean-square error

Getting ready

How to do it...

How it works...

There's more...

Using a convenience function to compute the RMS error

Building KNN models for regression

Getting ready

How to do it...

How it works...

There's more...

Running KNN with cross-validation in place of a validation partition

Using a convenience function to run KNN

Using a convenience function to run KNN for multiple k values

See also

Performing linear regression

Getting ready

How to do it...

How it works...

There's more...

Forcing lm to use a specific factor level as the reference

Using other options in the formula expression for linear models

See also

Performing variable selection in linear regression

Getting ready

How to do it...

How it works...

See also

Building regression trees

Getting ready

How to do it...

How it works...

There's more...

Generating regression trees for data with categorical predictors

Generating regression trees using the ensemble method - Bagging and Boosting

See also

Building random forest models for regression

Getting ready

How to do it...

How it works...

There's more...

Controlling forest generation

See also

Using neural networks for regression

Getting ready

How to do it...

How it works...

See also

Performing k-fold cross-validation

Getting ready

How to do it...

How it works...

See also

Performing leave-one-out cross-validation to limit overfitting

How to do it...

How it works...

See also

Can you Simplify That? Data Reduction Techniques

Introduction

Performing cluster analysis using hierarchical clustering

Getting ready

How to do it...

How it works...

There's more...

Cutting trees into clusters

Getting ready

How to do it...

How it works...

Performing cluster analysis using partitioning clustering

Getting ready

How to do it...

How it works...

There's more...

Image segmentation using mini-batch K-means

Getting ready

How to do it...

Partitioning around medoids

Getting ready

How to do it...

How it works...

Clustering large application

Getting ready

How to do it...

How it works...

Performing cluster validation

Getting ready

How to do it...

How it works...

Performing Advance clustering

Density-based spatial clustering of applications with noise

Getting ready

How to do it...

How it works...

Model-based clustering with the EM algorithm

Getting ready

How to do it...

How it works...

Reducing dimensionality with principal component analysis

Getting ready

How to do it...

How it works...

Lessons from History - Time Series Analysis

Introduction

Exploring finance datasets

Getting ready

How to do it...

How it works...

There's more...

Creating and examining date objects

Getting ready

How to do it...

How it works...

Operating on date objects

Getting ready

How to do it...

How it works...

See also

Performing preliminary analyses on time series data

Getting ready

How to do it...

How it works...

See also

Using time series objects

Getting ready

How to do it...

How it works...

See also

Decomposing time series

Getting ready

How to do it...

How it works...

See also

Filtering time series data

Getting ready

How to do it...

How it works...

See also

Smoothing and forecasting using the Holt-Winters method

Getting ready

How to do it...

How it works...

See also

Building an automated ARIMA model

Getting ready

How to do it...

How it works...

See also

How does it look? - Advanced data visualization

Introduction

Creating scatter plots

Getting ready

How to do it...

How it works...

There's more...

Graph using qplot

Creating line graphs

Getting ready

How to do it...

How it works...

Creating bar graphs

Getting ready

How to do it...

Creating bar charts with ggplot2

How it works...

Making distributions plots

Getting ready

How to do it...

How it works...

Creating mosaic graphs

Getting ready

How to do it...

How it works...

Making treemaps

Getting ready

How to do it...

How it works...

Plotting a correlations matrix

Getting ready

How to do it...

How it works...

There's more...

Visualizing a correlation matrix with ggplot2

Creating heatmaps

Getting ready

How to do it...

How it works...

There's more...

Plotting a heatmap over geospatial data

See also

Plotting network graphs

Getting ready

How to do it...

How it works...

See also

Labeling and legends

Getting ready

How to do it...

How it works...

Coloring and themes

Getting ready

How to do it...

How it works...

Creating multivariate plots

Getting ready

How to do it...

How it works...

There's more...

Multivariate plots with the GGally package

Creating 3D graphs and animation

Getting ready

How to do it...

How it works...

There's more...

Adding text to an existing 3D plot

Using a 3D histogram

Using a line graph

Selecting a graphics device

Getting ready

How to do it...

How it works...

This may also interest you - Building Recommendations

Introduction

Building collaborative filtering systems

Getting ready

How to do it...

How it works...

There's more...

Using collaborative filtering on binary data

Performing content-based systems

Getting ready

How to do it...

How it works...

Building hybrid systems

Getting ready

How to do it...

How it works...

Performing similarity measures

Getting ready

How to do it...

How it works...

Application of ML algorithms - image recognition system

Getting ready

How to do it...

How it works...

Evaluating models and optimization

Getting ready

How to do it...

How it works...

There's more...

Identifying a suitable model

Optimizing parameters

A practical example - fraud detection system

Getting ready

How to do it...

How it works...

It's All About Your Connections - Social Network Analysis

Introduction

Downloading social network data using public APIs

Getting ready

How to do it...

How it works...

See also

Creating adjacency matrices and edge lists

Getting ready

How to do it...

How it works...

See also

Plotting social network data

Getting ready

How to do it...

How it works...

There's more...

Specifying plotting preferences

Plotting directed graphs

Creating a graph object with weights

Extracting the network as an adjacency matrix from the graph object

Extracting an adjacency matrix with weights

Extracting an edge list from a graph object

Creating a bipartite network graph

Generating projections of a bipartite network

Computing important network metrics

Getting ready

How to do it...

How it works...

There's more...

Getting edge sequences

Getting immediate and distant neighbors

Adding vertices or nodes

Adding edges

Deleting isolates from a graph

Creating subgraphs

Cluster analysis

Getting ready

How to do it...

How it works...

Force layout

Getting ready

How to do it...

How it works...

There's more...

Force Atlas 2

YiFan Hu layout

Getting ready

How to do it...

How it works...

There's more...

Put Your Best Foot Forward - Document and Present Your Analysis

Introduction

Generating reports of your data analysis with R Markdown and knitr

Getting ready

How to do it...

How it works...

There's more...

Using the render function

Adding output options

Creating interactive web applications with shiny

Getting ready

How to do it...

How it works...

There's more...

Adding images

Adding HTML

Adding tab sets

Adding a dynamic UI

Creating a single-file web application

Dynamic integration of Shiny with knitr

Creating PDF presentations of your analysis with R presentation

Getting ready

How to do it...

How it works...

There's more...

Using hyperlinks

Controlling the display

Enhancing the look of the presentation

Generating dynamic reports

Getting ready

How to do it...

How it works...

Work Smarter, Not Harder - Efficient and Elegant R Code

Introduction

Exploiting vectorized operations

Getting ready

How to do it...

How it works...

There's more...

Processing entire rows or columns using the apply function

Getting ready

How to do it...

How it works...

There's more...

Using apply on a three-dimensional array

Applying a function to all elements of a collection with lapply and sapply

Getting ready

How to do it...

How it works...

There's more...

Dynamic output

One caution

Applying functions to subsets of a vector

Getting ready

How to do it...

How it works...

There's more...

Applying a function on groups from a data frame

Using the split-apply-combine strategy with plyr

Getting ready

How to do it...

How it works...

There's more...

Adding a new column using transform or mutate

Using summarize along with the plyr function

Concatenating the list of data frames into a big data frame

Common grouping functions in plyr

Split-apply-combine with dplyr

Slicing, dicing, and combining data with data tables

Getting ready

How to do it...

How it works...

There's more...

Adding multiple aggregated columns

Counting groups

Deleting a column

Joining data tables

Using symbols

Where in the World? Geospatial Analysis

Introduction

Downloading and plotting a Google map of an area

Getting ready

How to do it...

How it works...

There's more...

Saving the downloaded map as an image file

Getting a satellite image

Overlaying data on the downloaded Google map

Getting ready

How to do it...

How it works...

There's more...

Importing ESRI shape files to R

Getting ready

How to do it...

How it works...

Using the sp package to plot geographic data

Getting ready

How to do it...

How it works...

Getting maps from the maps package

Getting ready

How to do it...

How it works...

Creating spatial data frames from regular data frames containing spatial and other data

Getting ready

How to do it...

How it works...

Creating spatial data frames by combining regular data frames with spatial objects

Getting ready

How to do it...

How it works...

Adding variables to an existing spatial data frame

Getting ready

How to do it...

How it works...

Spatial data analysis with R and QGIS

Getting ready

How to do it...

How it works...

Playing Nice - Connecting to Other Systems

Introduction

Using Java objects in R

Getting ready

How to do it...

How it works...

There's more...

Checking JVM properties

Displaying available methods

Using JRI to call R functions from Java

Getting ready

How to do it...

How it works...

There's more...

Using Rserve to call R functions from Java

Getting ready

How to do it...

How it works...

There's more...

Retrieving an array from R

Executing R scripts from Java

Getting ready

How to do it...

How it works...

Using the xlsx package to connect to Excel

Getting ready

How to do it...

How it works...

Reading data from relational databases - MySQL

Getting ready

How to do it...

Using RODBC

Using RMySQL

Using RJDBC

How it works...

Using RODBC

Using RMySQL

Using RJDBC

There's more...

Fetching all rows

When the SQL query is long

Reading data from NoSQL databases - MongoDB

Getting ready

How to do it...

How it works...

There's more...

Find most severe crime zone

Plotting the crimes on the Chicago map

Working with in-memory data processing with Apache Spark

Getting ready

How to do it...

How it works...

There's more...

Classification with SparkR

Movie lens recommendation system with SparkR

Preface

Data analytics with R has emerged as a very important topic for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book empowers you by showing you ways to use R to generate professional analysis reports. The book also teaches you how to quickly adapt the example code for your own needs and save yourself the time needed to construct code from scratch.

What this book covers

Chapter 1, Acquire and Prepare the Ingredients – Reading Your Data, provides the recipes to acquire, format, and cleanse data from multiple formats. Handling missing values, standardizing datasets, and transforming between numerical and categorical data are also covered.

Chapter 2, What's in There? – Exploratory Data Analysis, shows you how to perform exploratory data analysis and find underlying patterns to understand our dataset before getting into the analysis process.

Chapter 3, Where does it belong? - Classification, covers several classification techniques from basic classification trees, logistic regression, and support vector machines to text classification using Naive Bayes to find sentiment analysis.

Chapter 4, Give me a number - Regression, covers several algorithms for data prediction, such as linear regression, random forests, neural networks, and regression trees.

Chapter 5, Can you simplify that? – Data Reduction Techniques, covers code recipes for data reduction and clustering. We explore the different clustering algorithms in a practical way.

Chapter 6, Lessons from history - Time Series Analysis, explores how to work with financial time series data, how to visualize it, and how to perform predictions using the ARIMA algorithms.

Chapter 7, How does it look? - Advance data visualization, explores how to make attractive visualizations, 3D graphs, and advanced maps.

Chapter 8, This May also interest you – Building Recommendations Systems, guides you step by step through applying machine learning and data mining techniques, building and optimizing recommender models, followed by a fraud system practical example.

Chapter 9, It's all about Connections – Social Network Analysis, explores how to acquire, visualize, and cluster social network data using public APIs.

Chapter 10, Put your best foot forward – Document and present your Analysis, shows you how to show and share the results of the data analysis. It includes recipes to use R markdown, KnitR, and Shiny to create reports and dynamic dashboards.

Chapter 11, Work Smarter, not Harder – Efficient and elegant R code, covers recipes to handle large datasets using the apply family of functions, the plyr package, and using data tables to slice and dice data.

Chapter 12, Where in the world? – Geospatial Analysis, teaches you how to perform a geospatial data analysis implementing tools such as Google Maps and QGIS using R implementations. It covers how to import maps and visualize your own data into the maps.

Chapter 13, Playing nice – Working with external data sources, shows you how to work with external data sources such as Excel, MySql, or MongoDB, and how to perform large data processing methods with in-memory processing using Apache Spark.

What you need for this book

The steps should be listed in a way that it prepares the system environment to be able to test the code examples in the book.

The following software is required:

R base

MS Office

Apache Java (JDK)

MySQL

MongoDB

Although all the code should run on the R command line, I suggest you run the code in the book from R Studio instead of the R command line, as it is an IDE and is easy to use. Also, a few chapters specifically require the use of R Studio only.

Who this book is for

This book is for data science professionals or analysts who have performed machine learning tasks and now want to explore deep learning and want a quick reference thataddress the pain points that crop up while implementing deep learning. Those who wish to have an edge over other deep learning professionals will find this book quite useful.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

A block of code is set as follows:

[default] exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100) exten => i,1,Voicemail(s0)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default] exten => s,1,

Dial

(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,

102

,Voicemail(b100) exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample /etc/asterisk/cdr_mysql.conf

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."

Warnings or important notes appear in a box like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book--what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads

&

Errata

.

Enter the name of the book in the

Search

box.

 

 

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-Data-Analysis-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/RDataAnalysisCookbookSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books--maybe a mistake in the text or the code--we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

 

Acquire and Prepare the Ingredients - Your Data

In this chapter, we will cover:

Working with data

Reading data from CSV files

Reading XML data

Reading JSON data

Reading data from fixed-width formatted files

Reading data from R files and R libraries

Removing cases with missing values

Replacing missing values with the mean

Removing duplicate cases

Rescaling a variable to specified min-max range

Normalizing or standardizing data in a data frame

Binning numerical data

Creating dummies for categorical variables

Handling missing data

Correcting data

Imputing data

Detecting outliers

Introduction

Data is everywhere and the amount of digital data that exists is growing rapidly, that is projected to grow to 180 zettabytes by 2025. Data Science is a field that tries to  extract insights and meaningful information from structured and unstructured data through various stages such as asking questions, getting the data, exploring the data, modeling the data, and communicating result as shown in the following diagaram:

Data scientists or analysts often need to load or collect data from various resources having different input formats into R. Although R has its own native data format, data usually exists in text formats, such as Comma Separated Values (CSV), JavaScript Object Notation (JSON), and Extensible Markup Language (XML). This chapter provides recipes to load such data into your R system for processing.

Raw, real-world datasets are often messy with missing values, unusable format, and outliers. Very rarely can we start analyzing data immediately after loading it. Often, we will need to preprocess the data to clean, impute, wrangle, and transform it before embarking on analysis. This chapter provides recipes for some common cleaning, missing value imputation, outlier detection, and preprocessing steps.

Working with data

In the wild, datasets come in many different formats, but each computer program expects your data to be organized in a well-defined structure.

As a result, every data science project begins with the same tasks: gather the data, view the data, clean the data, correct or change the layout of the data to make it tidy, handle missing values and outliers from the data, model the data, and evaluate the data.

With R, you can do everything from collecting your data (from the web or a database) to cleaning it, transforming it, visualizing it, modelling it, and running statistical tests on it.

Reading data from CSV files

CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single relation in a relational database, or to data (though not calculations) in a typical spreadsheet.

Getting ready

If you have not already downloaded the files for this chapter, do it now and ensure that the auto-mpg.csv file is in your R working directory.

How it works...

The read.csv() function creates a data frame from the data in the .csv file. If we pass header=TRUE, then the function uses the very first row to name the variables in the resulting data frame:

> names(auto) [1] "No" "mpg" "cylinders" [4] "displacement" "horsepower" "weight" [7] "acceleration" "model_year" "car_name"

The header and sep parameters allow us to specify whether the .csv file has headers and the character used in the file to separate fields. The header=TRUE and sep="," parameters are the defaults for the read.csv() function; we can omit these in the code example.

There's more...

The read.csv() function is a specialized form of read.table(). The latter uses whitespace as the default field separator. We will discuss a few important optional arguments to these functions.

Handling different column delimiters

In regions where a comma is used as the decimal separator, the .csv files use ";" as the field delimiter. While dealing with such data files, use read.csv2() to load data into R.

Alternatively, you can use the read.csv("<file name>", sep=";", dec=",") command.

Use sep="t" for tab-delimited files.

Reading strings as characters and not as factors

By default, R treats strings as factors (categorical variables). In some situations, you may want to leave them as character strings. Use stringsAsFactors=FALSE to achieve this:

> auto <- read.csv("auto-mpg.csv",stringsAsFactors=FALSE)

However, to selectively treat variables as characters, you can load the file with the defaults (that is, read all strings as factors) and then use as.character() to convert the requisite factor variables to characters.

Reading data directly from a website

If the data file is available on the web, you can load it directly into R, instead of downloading and saving it locally before loading it into R:

> dat <- read.csv("http://www.exploredata.net/ftp/WHO.csv")

Reading XML data

You may sometimes need to extract data from websites. Many providers also supply data in XML and JSON formats. In this recipe, we learn about reading XML data.

Getting ready

Make sure you have downloaded the files for this chapters and the files cd_catalog.xml and WorldPopulation-wiki.htm are in working directory of R. If the XML package is not already installed in your R environment, install the package now, as follows:

> install.packages("XML")

How to do it...

XML data can be read by following these steps:

Load the library and initialize:

> library(XML) > url <- "cd_catalog.xml"

Parse the XML file and get the root node:

> xmldoc <- xmlParse(url) > rootNode <- xmlRoot(xmldoc) > rootNode[1]

Extract the XML data:

> data <- xmlSApply(rootNode,function(x) xmlSApply(x, xmlValue))

Convert the extracted data into a data frame:

> cd.catalog <- data.frame(t(data),row.names=NULL)

Verify the results:

> cd.catalog[1:2,]

How it works...

The xmlParse function returns an object of the XMLInternalDocument class, which is a C-level internal data structure.

The xmlRoot() function gets access to the root node and its elements. Let us check the first element of the root node:

> rootNode[1] $CD <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD> attr(,"class") [1] "XMLInternalNodeList" "XMLNodeList"

To extract data from the root node, we use the xmlSApply() function iteratively over all the children of the root node. The xmlSApply function returns a matrix.

To convert the preceding matrix into a data frame, we transpose the matrix using the t() function and then extract the first two rows from the cd.catalog data frame:

> cd.catalog[1:2,] TITLE ARTIST COUNTRY COMPANY PRICE YEAR 1 Empire Burlesque Bob Dylan USA Columbia 10.90 1985 2 Hide your heart Bonnie Tyler UK CBS Records 9.90 1988

There's more...

XML data can be deeply nested and hence can become complex to extract. Knowledge of XPath is helpful to access specific XML tags. R provides several functions, such as xpathSApply and getNodeSet, to locate specific elements.

Extracting HTML table data from a web page

Though it is possible to treat HTML data as a specialized form of XML, R provides specific functions to extract data from HTML tables, as follows:

> url <- "WorldPopulation-wiki.htm" > tables <- readHTMLTable(url) > world.pop <- tables[[6]]

The readHTMLTable() function parses the web page and returns a list of all the tables that are found on the page. For tables that have an id attribute, the function uses the id attribute as the name of that list element.

We are interested in extracting the "10 most populous countries", which is the fifth table, so we use tables[[6]].

Extracting a single HTML table from a web page

A single table can be extracted using the following command:

> table <- readHTMLTable(url,which=5)

Specify which to get data from a specific table. R returns a data frame.

Reading JSON data

Several RESTful web services return data in JSON format, in some ways simpler and more efficient than XML. This recipe shows you how to read JSON data.

Getting ready

R provides several packages to read JSON data, but we will use the jsonlite package. Install the package in your R environment, as follows:

> install.packages("jsonlite")

If you have not already downloaded the files for this chapter, do it now and ensure that the students.json files and student-courses.json files are in your R working directory.

How to do it...

Once the files are ready, load the jsonlite package and read the files as follows:

Load the library:

> library(jsonlite)

Load the JSON data from the files:

> dat.1 <- fromJSON("students.json") > dat.2 <- fromJSON("student-courses.json")

Load the JSON document from the web:

> url <- "http://finance.yahoo.com/webservice/v1/symbols/allcurrencies/quote?format=json" > jsonDoc <- fromJSON(url)

Extract the data into data frames:

> dat <- jsonDoc$list$resources$resource$fields > dat.1 <- jsonDoc$list$resources$resource$fields > dat.2 <- jsonDoc$list$resources$resource$fields

Verify the results:

> dat[1:2,] > dat.1[1:3,] > dat.2[,c(1,2,4:5)]

How it works...

The jsonlite package provides two key functions: fromJSON and toJSON.

The fromJSON function can load data either directly from a file or from a web page, as the preceding steps 2 and 3 show. If you get errors in downloading content directly from the web, install and load the httr package.

Depending on the structure of the JSON document, loading the data can vary in complexity.

If given a URL, the fromJSON function returns a list object. In the preceding list, in step 4, we see how to extract the enclosed data frame.

Reading data from fixed-width formatted files

In fixed-width formatted files, columns have fixed widths; if a data element does not use up the entire allotted column width, then the element is padded with spaces to make up the specified width. To read fixed-width text files, specify the columns either by column widths or by starting positions.

Getting ready

Download the files for this chapter and store the student-fwf.txt file in your R working directory.

How to do it...

Read the fixed-width formatted file as follows:

> student <- read.fwf("student-fwf.txt", widths=c(4,15,20,15,4), col.names=c("id","name","email","major","year"))

How it works...

In the student-fwf.txt file, the first column occupies 4 character positions, the second 15, and so on. The c(4,15,20,15,4) expression specifies the widths of the 5 columns in the data file.

We can use the optional col.names argument to supply our own variable names.

There's more...

The read.fwf() function has several optional arguments that come in handy. We discuss a few of these, as follows:

Files with headers

Files with headers use the following command:

> student <- read.fwf("student-fwf-header.txt", widths=c(4,15,20,15,4), header=TRUE, sep="t",skip=2)

If header=TRUE, the first row of the file is interpreted as having the column headers. Column headers, if present, need to be separated by the specified sep argument. The sep argument only applies to the header row.

The skip argument denotes the number of lines to skip; in this recipe, the first two lines are skipped.

Excluding columns from data

To exclude a column, make the column width negative. Thus, to exclude the email column, we will specify its width as -20 and also remove the column name from the col.names vector, as follows:

> student <- read.fwf("student-fwf.txt",widths=c(4,15,-20,15,4), col.names=c("id","name","major","year"))

Reading data from R files and R libraries

During data analysis, you will create several R objects. You can save these in the native R data format and retrieve them later as needed.

Getting ready

First, create and save the R objects interactively, as shown in the following code. Make sure you have write access to the R working directory.

> customer <- c("John", "Peter", "Jane") > orderdate <- as.Date(c('2014-10-1','2014-1-2','2014-7-6')) > orderamount <- c(280, 100.50, 40.25) > order <- data.frame(customer,orderdate,orderamount) > names <- c("John", "Joan") > save(order, names, file="test.Rdata") > saveRDS(order,file="order.rds") > remove(order)

After saving the preceding code, the remove() function deletes the object from the current session.

How to do it...

To be able to read data from R files and libraries, follow these steps:

Load data from the R data files into memory:

> load("test.Rdata") > ord <- readRDS("order.rds")

The

datasets

package is loaded in the R environment by default and contains the

iris

and

cars

datasets. To load these datasets data into memory, use the following code:

> data(iris) > data(list(cars,iris))

The first command loads only the iris dataset, and the second loads both the cars and iris datasets.

How it works...

The save() function saves the serialized version of the objects supplied as arguments along with the object name. The subsequent load() function restores the saved objects, with the same object names that they were saved with, to the global environment by default. If there are existing objects with the same names in that environment, they will be replaced without any warnings.

The saveRDS() function saves only one object. It saves the serialized version of the object and not the object name. Hence, with the readRDS() function, the saved object can be restored into a variable with a different name from when it was saved.

There's more...

The preceding recipe has shown you how to read saved R objects. We see more options in this section.

Saving objects selectively in a session

To save objects selectively, use the following commands:

> odd <- c(1,3,5,7) > even <- c(2,4,6,8) > save(list=c("odd","even"),file="OddEven.Rdata")

The list argument specifies a character vector containing the names of the objects to be saved. Subsequently, loading data from the OddEven.Rdata file creates both odd and even objects. The saveRDS() function can save only one object at a time.

Attaching/detaching R data files to an environment

While loading Rdata files, if we want to be notified whether objects with the same names already exist in the environment, we can use:

> attach("order.Rdata")

The order.Rdata file contains an object named order. If an object named order already exists in the environment, we will get the following error:

The following object is masked _by_ .GlobalEnv: order

Listing all datasets in loaded packages

All the loaded packages can be listed using the following command:

> data()

Removing cases with missing values

Datasets come with varying amounts of missing data. When we have abundant data, we sometimes (not always) want to eliminate the cases that have missing values for one or more variables. This recipe applies when we want to eliminate cases that have any missing values, as well as when we want to selectively eliminate cases that have missing values for a specific variable alone.

Getting ready

Download the missing-data.csv file from the code files for this chapter to your R working directory. Read the data from the missing-data.csv file, while taking care to identify the string used in the input file for missing values. In our file, missing values are shown with empty strings:

> dat <- read.csv("missing-data.csv", na.strings="")

How to do it...

To get a data frame that has only the cases with no missing values for any variable, use the na.omit() function:

> dat.cleaned <- na.omit(dat)

Now dat.cleaned contains only those cases from dat that have no missing values in any of the variables.

How it works...

The na.omit() function internally uses the is.na() function, that allows us to find whether its argument is NA. When applied to a single value, it returns a Boolean value. When applied to a collection, it returns a vector:

> is.na(dat[4,2]) [1] TRUE > is.na(dat$Income) [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE [10] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

There's more...

You will sometimes need to do more than just eliminate the cases with any missing values. We discuss some options in this section.

Eliminating cases with NA for selected variables

We might sometimes want to selectively eliminate cases that have NA only for a specific variable. The example data frame has two missing values for Income. To get a data frame with only these two cases removed, use:

> dat.income.cleaned <- dat[!is.na(dat$Income),] > nrow(dat.income.cleaned) [1] 25

Finding cases that have no missing values

The complete.cases() function takes a data frame or table as its argument and returns a Boolean vector with TRUE for rows that have no missing values, and FALSE otherwise:

> complete.cases(dat) [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE [10] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE [19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Rows 4, 6, 13, and 17 have at least one missing value. Instead of using the na.omit() function, we can do the following as well:

> dat.cleaned <- dat[complete.cases(dat),] > nrow(dat.cleaned) [1] 23

Converting specific values to NA