Machine Learning with R Cookbook, Second Edition - AshishSingh Bhatia - E-Book

Machine Learning with R Cookbook, Second Edition E-Book

AshishSingh Bhatia

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Explore over 110 recipes to analyze data and build predictive models with simple and easy-to-use R code



About This Book



  • Apply R to simplify predictive modeling with short and simple code
  • Use machine learning to solve problems ranging from small to big data
  • Build a training and testing dataset, applying different classification methods.


Who This Book Is For



This book is for data science professionals, data analysts, or people who have used R for data analysis and machine learning who now wish to become the go-to person for machine learning with R. Those who wish to improve the efficiency of their machine learning models and need to work with different kinds of data set will find this book very insightful.



What You Will Learn



  • Create and inspect transaction datasets and perform association analysis with the Apriori algorithm
  • Visualize patterns and associations using a range of graphs and find frequent item-sets using the Eclat algorithm
  • Compare differences between each regression method to discover how they solve problems
  • Detect and impute missing values in air quality data
  • Predict possible churn users with the classification approach
  • Plot the autocorrelation function with time series analysis
  • Use the Cox proportional hazards model for survival analysis
  • Implement the clustering method to segment customer data
  • Compress images with the dimension reduction method
  • Incorporate R and Hadoop to solve machine learning problems on big data


In Detail



Big data has become a popular buzzword across many industries. An increasing number of people have been exposed to the term and are looking at how to leverage big data in their own businesses, to improve sales and profitability. However, collecting, aggregating, and visualizing data is just one part of the equation. Being able to extract useful information from data is another task, and a much more challenging one. Machine Learning with R Cookbook, Second Edition uses a practical approach to teach you how to perform machine learning with R. Each chapter is divided into several simple recipes. Through the step-by-step instructions provided in each recipe, you will be able to construct a predictive model by using a variety of machine learning packages. In this book, you will first learn to set up the R environment and use simple R commands to explore data. The next topic covers how to perform statistical analysis with machine learning analysis and assess created models, covered in detail later on in the book. You'll also learn how to integrate R and Hadoop to create a big data analysis platform. The detailed illustrations provide all the information required to start applying machine learning to individual projects. With Machine Learning with R Cookbook, machine learning has never been easier.



Style and approach



This is an easy-to-follow guide packed with hands-on examples of machine learning tasks. Each topic includes step-by-step instructions on tackling difficulties faced when applying R to machine learning.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 490

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning with R Cookbook

 

Second Edition

 

 

 

 

 

 

 

 

Analyze data and build predictive models

 

 

 

 

 

 

 

 

 

AshishSingh Bhatia

 

Yu-Wei, Chiu (David Chiu)

 

 

 

 

BIRMINGHAM - MUMBAI

Machine Learning with R Cookbook

Second Edition

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: March 2015

Second edition: October 2017

 

Production reference: 1171017

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

 

ISBN 978-1-78728-439-5

 

www.packtpub.com

Credits

Authors

AshishSingh Bhatia

Yu-Wei, Chiu (David Chiu)

Copy Editor

Safis Editing

Reviewers

 

Ratanlal Mahanta

Saibal Dutta

Project Coordinator

 

Kinjal Bari

Commissioning Editor

 

Veena Pagare

Proofreader

 

Safis Editing

Acquisition Editor

 

Divya Poojari

Indexer

 

Francy Puthiry

Content Development Editor

 

Trusha Shriyan

Graphics

 

Kirk D'Penha

Technical Editor

 

Akash Patel

Production Coordinator

 

Aparna Bhagat

About the Authors

AshishSingh Bhatia is a reader and learner at his core. He has more than 11 years of rich experience in different IT sectors, encompassing training, development, and management. He has worked in many domains, such as software development, ERP, banking, and training. He is passionate about Python and Java, and recently he has been exploring R. He is mostly involved in web and mobile developments in various capacity. He always likes to explore new technologies and share his views and thoughts through various online medium and magazines. He believes in sharing his experience with new generation and do take active part in training and teaching also.

First and foremost, I would like to thank God almighty. I would like to thank my father, mother, brother and friends. I am also thankful to whole team at PacktPub especially Divya and Trusha. My special thanks go to my mother Smt. Ravindrakaur Bhatia for guiding and motivating me when its required most. I also want to take this opportunity to show my gratitude for Mitesh Soni, he is the one who introduced me to Packt and started the ball rolling.

Thanks to all who are directly or indirectly involved in this endeavor.

 

 

Yu-Wei, Chiu (David Chiu) is the founder of LargitData Company. He has previously worked for Trend Micro as a software engineer, with the responsibility of building up big data platforms for business intelligence and customer relationship management systems. In addition to being a startup entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques to data analysis. Yu-Wei is also a professional lecturer, and has delivered talks on Python, R, Hadoop, and tech talks at a variety of conferences.

In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, a book compiled for Packt Publishing.

He feels immense gratitude to his family and friends for supporting and encouraging him to complete this book. Here, he sincerely says thanks to his mother, Ming-Yang Huang (Miranda Huang); his mentor, Man-Kwan Shan; proofreader of this book, Brendan Fisher; Taiwan R User Group; Data Science Program (DSP); and more friends who have offered their support.

About the Reviewers

Ratanlal Mahanta has several years of experience in the modeling and simulation of quantitative trading. He works as a senior quantitative analyst at GPSK Investment Group, Kolkata. Ratanlal holds a master's degree of science in computational finance, and his research areas include quant trading, optimal Execution, Machine Learning and high-frequency trading.

He has also reviewed Mastering R for Quantitative Finance, Mastering Scientific Computing with R, Machine Learning with R Cookbook, and Mastering Python for Data Science and Building a Recommendation System with R all by Packt Publishing.

 

 

Saibal Dutta has been working as analytical consultant in SAS Research and Development. He is also pursuing PhD in data mining and machine learning from Indian Institute of Technology, Kharagpur. He holds Master of Technology in electronics and communication from National Institute of Technology, Rourkela. He has worked at TATA communications, Pune and HCL Technologies Limited, Noida, as a consultant. In his 7 years of consulting experience, he has been associated with global players such as IKEA (in Sweden), Pearson (in the U.S.), and so on. His passion for entrepreneurship has led him to start his own start-up in the field of data analytics, which is in the bootstrapping stage. His areas of expertise include data mining, machine learning, image processing, and business consultation.

I would like to thank my advisor, Prof. Sujoy Bhattacharya, all my colleagues specially, Ashwin Deokar, Lokesh Nagar, Savita Angadi, Swarup De and my family and friends specially, Madhuparna Bit for their encouragement, support, and inspiration.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Practical Machine Learning with R

Introduction

Downloading and installing R

Getting ready

How to do it...

How it works...

See also

Downloading and installing RStudio

Getting ready

How to do it...

How it works...

See also

Installing and loading packages

Getting ready

How to do it...

How it works...

See also

Understanding of basic data structures

Data types

Data structures

Vectors

How to do it...

How it works...

Lists

How to do it...

How it works...

Array

How to do it...

How it works...

Matrix

How to do it...

DataFrame

How to do it...

Basic commands for subsetting

How to do it...

Data input

Reading and writing data

Getting ready

How to do it...

How it works...

There's more...

Manipulating data

Getting ready

How to do it...

How it works...

There's more...

Applying basic statistics

Getting ready

How to do it...

How it works...

There's more...

Visualizing data

Getting ready

How to do it...

How it works...

See also

Getting a dataset for machine learning

Getting ready

How to do it...

How it works...

See also

Data Exploration with Air Quality Datasets

Introduction

Using air quality dataset

Getting ready

How to do it...

How it works...

There's more...

Converting attributes to factor

Getting ready

How to do it...

How it works...

There's more...

Detecting missing values

Getting ready

How to do it...

How it works...

There's more...

Imputing missing values

Getting ready

How to do it...

How it works...

Exploring and visualizing data

Getting ready

How to do it...

Predicting values from datasets

Getting ready

How to do it...

How it works...

Analyzing Time Series Data

Introduction

Looking at time series data

Getting ready

How to do it...

How it works...

See also

Plotting and forecasting time series data

Getting ready

How to do it...

How it works...

See also

Extracting, subsetting, merging, filling, and padding

Getting ready

How to do it...

How it works...

See also

Successive differences and moving averages

Getting ready

How to do it...

How it works...

See also

Exponential smoothing

Getting ready

How to do it...

How it works...

See also

Plotting the autocorrelation function

Getting ready

How to do it...

How it works...

See also

R and Statistics

Introduction

Understanding data sampling in R

Getting ready

How to do it...

How it works...

See also

Operating a probability distribution in R

Getting ready

How to do it...

How it works...

There's more...

Working with univariate descriptive statistics in R

Getting ready

How to do it...

How it works...

There's more...

Performing correlations and multivariate analysis

Getting ready

How to do it...

How it works...

See also

Conducting an exact binomial test

Getting ready

How to do it...

How it works...

See also

Performing a student's t-test

Getting ready

How to do it...

How it works...

See also

Performing the Kolmogorov-Smirnov test

Getting ready

How to do it...

How it works...

See also

Understanding the Wilcoxon Rank Sum and Signed Rank test

Getting ready

How to do it...

How it works...

See also

Working with Pearson's Chi-squared test

Getting ready

How to do it...

How it works...

There's more...

Conducting a one-way ANOVA

Getting ready

How to do it...

How it works...

There's more...

Performing a two-way ANOVA

Getting ready

How to do it...

How it works...

See also

Understanding Regression Analysis

Introduction

Different types of regression

Fitting a linear regression model with lm

Getting ready

How to do it...

How it works...

There's more...

Summarizing linear model fits

Getting ready

How to do it...

How it works...

See also

Using linear regression to predict unknown values

Getting ready

How to do it...

How it works...

See also

Generating a diagnostic plot of a fitted model

Getting ready

How to do it...

How it works...

There's more...

Fitting multiple regression

Getting ready

How to do it...

How it works...

Summarizing multiple regression

Getting ready

How to do it...

How it works...

See also

Using multiple regression to predict unknown values

Getting ready

How to do it...

How it works...

See also

Fitting a polynomial regression model with lm

Getting ready

How to do it...

How it works...

There's more...

Fitting a robust linear regression model with rlm

Getting ready

How to do it...

How it works...

There's more...

Studying a case of linear regression on SLID data

Getting ready

How to do it...

How it works...

See also

Applying the Gaussian model for generalized linear regression

Getting ready

How to do it...

How it works...

See also

Applying the Poisson model for generalized linear regression

Getting ready

How to do it...

How it works...

See also

Applying the Binomial model for generalized linear regression

Getting ready

How to do it...

How it works...

See also

Fitting a generalized additive model to data

Getting ready

How to do it...

How it works...

See also

Visualizing a generalized additive model

Getting ready

How to do it...

How it works...

There's more...

Diagnosing a generalized additive model

Getting ready

How to do it...

How it works...

There's more...

Survival Analysis

Introduction

Loading and observing data

Getting ready

How to do it...

How it works...

There's more...

Viewing the summary of survival analysis

Getting ready

How to do it...

How it works...

Visualizing the Survival Curve

Getting ready

How to do it...

How it works...

Using the log-rank test

Getting ready

How to do it...

How it works...

Using the COX proportional hazard model

Getting ready

How to do it...

How it works...

Nelson-Aalen Estimator of cumulative hazard

Getting ready

How to do it...

How it works...

See also

Classification 1 - Tree, Lazy, and Probabilistic

Introduction

Preparing the training and testing datasets

Getting ready

How to do it...

How it works...

There's more...

Building a classification model with recursive partitioning trees

Getting ready

How to do it...

How it works...

See also

Visualizing a recursive partitioning tree

Getting ready

How to do it...

How it works...

See also

Measuring the prediction performance of a recursive partitioning tree

Getting ready

How to do it...

How it works...

See also

Pruning a recursive partitioning tree

Getting ready

How to do it...

How it works...

See also

Handling missing data and split and surrogate variables

Getting ready

How to do it...

How it works...

See also

Building a classification model with a conditional inference tree

Getting ready

How to do it...

How it works...

See also

Control parameters in conditional inference trees

Getting ready

How to do it...

How it works...

See also

Visualizing a conditional inference tree

Getting ready

How to do it...

How it works...

See also

Measuring the prediction performance of a conditional inference tree

Getting ready

How to do it...

How it works...

See also

Classifying data with the k-nearest neighbor classifier

Getting ready

How to do it...

How it works...

See also

Classifying data with logistic regression

Getting ready

How to do it...

How it works...

See also

Classifying data with the Naïve Bayes classifier

Getting ready

How to do it...

How it works...

See also

Classification 2 - Neural Network and SVM

Introduction

Classifying data with a support vector machine

Getting ready

How to do it...

How it works...

See also

Choosing the cost of a support vector machine

Getting ready

How to do it...

How it works...

See also

Visualizing an SVM fit

Getting ready

How to do it...

How it works...

See also

Predicting labels based on a model trained by a support vector machine

Getting ready

How to do it...

How it works...

There's more...

Tuning a support vector machine

Getting ready

How to do it...

How it works...

See also

The basics of neural network

Getting ready

How to do it...

Training a neural network with neuralnet

Getting ready

How to do it...

How it works...

See also

Visualizing a neural network trained by neuralnet

Getting ready

How to do it...

How it works...

See also

Predicting labels based on a model trained by neuralnet

Getting ready

How to do it...

How it works...

See also

Training a neural network with nnet

Getting ready

How to do it...

How it works...

See also

Predicting labels based on a model trained by nnet

Getting ready

How to do it...

How it works...

See also

Model Evaluation

Introduction

Why do models need to be evaluated?

Different methods of model evaluation

Estimating model performance with k-fold cross-validation

Getting ready

How to do it...

How it works...

There's more...

Estimating model performance with Leave One Out Cross Validation

Getting ready

How to do it...

How it works...

See also

Performing cross-validation with the e1071 package

Getting ready

How to do it...

How it works...

See also

Performing cross-validation with the caret package

Getting ready

How to do it...

How it works...

See also

Ranking the variable importance with the caret package

Getting ready

How to do it...

How it works...

There's more...

Ranking the variable importance with the rminer package

Getting ready

How to do it...

How it works...

See also

Finding highly correlated features with the caret package

Getting ready

How to do it...

How it works...

See also

Selecting features using the caret package

Getting ready

How to do it...

How it works...

See also

Measuring the performance of the regression model

Getting ready

How to do it...

How it works...

There's more...

Measuring prediction performance with a confusion matrix

Getting ready

How to do it...

How it works...

See also

Measuring prediction performance using ROCR

Getting ready

How to do it...

How it works...

See also

Comparing an ROC curve using the caret package

Getting ready

How to do it...

How it works...

See also

Measuring performance differences between models with the caret package

Getting ready

How to do it...

How it works...

See also

Ensemble Learning

Introduction

Using the Super Learner algorithm

Getting ready

How to do it...

How it works...

Using ensemble to train and test

Getting ready

How to do it...

How it works...

Classifying data with the bagging method

Getting ready

How to do it...

How it works...

There's more...

Performing cross-validation with the bagging method

Getting ready

How to do it...

How it works...

See also

Classifying data with the boosting method

Getting ready

How to do it...

How it works...

There's more...

Performing cross-validation with the boosting method

Getting ready

How to do it...

How it works...

See also

Classifying data with gradient boosting

Getting ready

How to do it...

How it works...

There's more...

Calculating the margins of a classifier

Getting ready

How to do it...

How it works...

See also

Calculating the error evolution of the ensemble method

Getting ready

How to do it...

How it works...

See also

Classifying data with random forest

Getting ready

How to do it...

How it works...

There's more...

Estimating the prediction errors of different classifiers

Getting ready

How to do it...

How it works...

See also

Clustering

Introduction

Clustering data with hierarchical clustering

Getting ready

How to do it...

How it works...

There's more...

Cutting trees into clusters

Getting ready

How to do it...

How it works...

There's more...

Clustering data with the k-means method

Getting ready

How to do it...

How it works...

See also

Drawing a bivariate cluster plot

Getting ready

How to do it...

How it works...

There's more...

Comparing clustering methods

Getting ready

How to do it...

How it works...

See also

Extracting silhouette information from clustering

Getting ready

How to do it...

How it works...

See also

Obtaining the optimum number of clusters for k-means

Getting ready

How to do it...

How it works...

See also

Clustering data with the density-based method

Getting ready

How to do it...

How it works...

See also

Clustering data with the model-based method

Getting ready

How to do it...

How it works...

See also

Visualizing a dissimilarity matrix

Getting ready

How to do it...

How it works...

There's more...

Validating clusters externally

Getting ready

How to do it...

How it works...

See also

Association Analysis and Sequence Mining

Introduction

Transforming data into transactions

Getting ready

How to do it...

How it works...

See also

Displaying transactions and associations

Getting ready

How to do it...

How it works...

See also

Mining associations with the Apriori rule

Getting ready

How to do it...

How it works...

See also

Pruning redundant rules

Getting ready

How to do it...

How it works...

See also

Visualizing association rules

Getting ready

How to do it...

How it works...

See also

Mining frequent itemsets with Eclat

Getting ready

How to do it...

How it works...

See also

Creating transactions with temporal information

Getting ready

How to do it...

How it works...

See also

Mining frequent sequential patterns with cSPADE

Getting ready

How to do it...

How it works...

See also

Using the TraMineR package for sequence analysis

Getting ready

How to do it...

How it works...

Visualizing sequence, Chronogram, and Traversal Statistics

Getting ready

How to do it...

How it works...

See also

Dimension Reduction

Introduction

Why to reduce the dimension?

Performing feature selection with FSelector

Getting ready

How to do it...

How it works...

See also

Performing dimension reduction with PCA

Getting ready

How to do it...

How it works...

There's more...

Determining the number of principal components using the scree test

Getting ready

How to do it...

How it works...

There's more...

Determining the number of principal components using the Kaiser method

Getting ready

How to do it...

How it works...

See also

Visualizing multivariate data using biplot

Getting ready

How to do it...

How it works...

There's more...

Performing dimension reduction with MDS

Getting ready

How to do it...

How it works...

There's more...

Reducing dimensions with SVD

Getting ready

How to do it...

How it works...

See also

Compressing images with SVD

Getting ready

How to do it...

How it works...

See also

Performing nonlinear dimension reduction with ISOMAP

Getting ready

How to do it...

How it works...

There's more...

Performing nonlinear dimension reduction with Local Linear Embedding

Getting ready

How to do it...

How it works...

See also

Big Data Analysis (R and Hadoop)

Introduction

Preparing the RHadoop environment

Getting ready

How to do it...

How it works...

See also

Installing rmr2

Getting ready

How to do it...

How it works...

See also

Installing rhdfs

Getting ready

How to do it...

How it works...

See also

Operating HDFS with rhdfs

Getting ready

How to do it...

How it works...

See also

Implementing a word count problem with RHadoop

Getting ready

How to do it...

How it works...

See also

Comparing the performance between an R MapReduce program and a standard R program

Getting ready

How to do it...

How it works...

See also

Testing and debugging the rmr2 program

Getting ready

How to do it...

How it works...

See also

Installing plyrmr

Getting ready

How to do it...

How it works...

See also

Manipulating data with plyrmr

Getting ready

How to do it...

How it works...

See also

Conducting machine learning with RHadoop

Getting ready

How to do it...

How it works...

See also

Configuring RHadoop clusters on Amazon EMR

Getting ready

How to do it...

How it works...

See also

Preface

Big data has become a popular buzzword across many industries. An increasing number of people have been exposed to the term and are looking at how to leverage big data in their own businesses, to improve sales and profitability. However, collecting, aggregating, and visualizing data is just one part of the equation. Being able to extract useful information from data is another task, and much more challenging.

Traditionally, most researchers perform statistical analysis using historical samples of data. The main downside of this process is that conclusions drawn from statistical analysis are limited. In fact, researchers usually struggle to uncover hidden patterns and unknown correlations from target data. Aside from applying statistical analysis, machine learning has emerged as an alternative. This process yields a more accurate predictive model with the data inserted into a learning algorithm. Through machine learning, the analysis of business operations and processes is not limited to human-scale thinking. Machine-scale analysis enables businesses to discover hidden value in big data.

The most widely used tool for machine learning and data analysis is the R language. In addition to being the most popular language used by data scientists, R is open source and is free for use for all users. The R programming language offers a variety of learning packages and visualization functions, which enable users to analyze data on the fly. Any user can easily perform machine learning with R on their dataset without knowing every detail of the mathematical models behind the analysis.

Machine Learning with R Cookbook takes a practical approach to teaching you how to perform machine learning with R. Each of the 14 chapters are introduced to you by dividing this topic into several simple recipes. Through the step-by-step instructions provided in each recipe, the reader can construct a predictive model by using a variety of machine learning packages.

What this book covers

Chapter 1, Practical Machine Learning with R, shows how to install and setup R environment, it covers package installation basic syntax and data types followed by reading and writing data from various sources. It also covers basic statistics and visualization using R.

Chapter 2, Data Exploration with Air Quality Datasets, shows how actual data looks in R. It covers loading of data, exploring and visualizing the data.

Chapter 3, Analyzing Time Series Data, shows a totally different type of data which consist of time factor. It covers how to handle time series in R.

Chapter 4, R and Statistics, covers data sampling, probability distribution, univariate descriptive statistics, correlation, multivariate analysis, linear regression. Exact binomial test, student – t test, Kolmogorov-Smirnov test, Wilcoxon Rank Sum and Signed Rank test, Pearson's Chi-squared Test, One-way ANOVA, and Two-way ANOVA.

Chapter 5, Understanding Regression Analysis, introduces to the supervised learning, to analyze the relationship between dependent and independent variable. It covers different type of distribution model followed by generalized additive model.

Chapter 6, Survival Analysis, shows how to analyze the data where the outcome variable is time for occurrence of an event, widely used in clinical trials.

Chapter 7, Classification 1 – Tree, Lazy and Probabilistic, Tree, Lazy and Probabilistic, deals with classification model built from the training dataset, of which the categories are already known.

Chapter 8, Classification 2 – Neural Network and SVM, shows how to train a support vector machine and neural network, how to visualize and tune the both.

Chapter 9, Model Evaluation, shows to evaluate the performance of a fitted model.

Chapter 10, Ensemble Learning, shows bagging and boosting to classify the data, perform the cross validation to estimate the error rate. It also covers the random forest.

Chapter 11, Clustering, means grouping similar objects widely used in business applications. It covers four clustering techniques, validating clusters internally.

Chapter 12, Association Analysis and Sequence Mining, covers finding the hidden relationships within a transaction data set. It shows how to create and inspect the transaction data set, performing association analysis with an Aprori algorithm, visualizing associations in various graphs formats, using Eclat algorithm finding frequent itemset.

Chapter 13, Dimension Reduction, shows how to deal with redundant data and removing irrelevant data. It shows how to perform feature ranking and selection, extraction and dimension reduction using linear and nonlinear methods.

Chapter 14, Big Data Analysis ( R and Hadoop ), shows how R can be used with big data. It covers preparing of Hadoop environment, performing MapReduce from R, operate a HDFS, performing common data operation.

What you need for this book

All the examples cover in this book have been tested on R version 3.4.1 and R studio version 1.0.153. Chapter 1, Practical Machine Learning with R, covers how to download and install them.

Who this book is for

This book is for data science professionals, data analysts, or anyone who has used R for data analysis and machine learning, and now wishes to become the go-to person for machine learning with R. Those who wish to improve the efficiency of their machine learning models and need to work with different kids of datasets will find this book quite insightful.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:

"The acf function will plot the correlation between all pairs of data points with lagged values."

A block of code is set as follows:

> install.packages("forecast") > require(forecast) > forecast(my_series, 4)

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Click on Download R for Windows, as shown in the following screenshot."

Warnings or important notes appear in a box like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Machine-Learning-with-R-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Practical Machine Learning with R

In this chapter, we will cover the following topics:

Downloading and installing R

Downloading and installing RStudio

Installing and loading packages

Understanding basic data structures

Basic commands for subsetting

Reading and writing data

Manipulating data

Applying basic statistics

Visualizing data

Getting a dataset for machine learning

Introduction

The aim of machine learning is to uncover hidden patterns and unknown correlations, and to find useful information from data. In addition to this, through incorporation with data analysis, machine learning can be used to perform predictive analysis. With machine learning, the analysis of business operations and processes is not limited to human scale thinking; machine scale analysis enables businesses to capture hidden values in big data.

Machine learning has similarities to the human reasoning process. Unlike traditional analysis, the generated model cannot evolve as data is accumulated. Machine learning can learn from the data that is processed and analyzed. In other words, the more data that is processed, the more it can learn.

R, as a dialect of GNU-S, is a powerful statistical language that can be used to manipulate and analyze data. Additionally, R provides many machine learning packages and visualization functions, which enable users to analyze data on the fly. Most importantly, R is open source and free.

Using R greatly simplifies machine learning. All you need to know is how each algorithm can solve your problem and then you can simply use a written package to quickly generate prediction models on data with a few command lines. For example, you can perform Naïve Bayes for spam mail filtering, conduct k-means clustering for customer segmentation, use linear regression to forecast house prices, or implement a hidden Markov model to predict the stock market, as shown in the following screenshot:

Stock market prediction using R

Moreover, you can perform nonlinear dimension reduction to calculate the dissimilarity of image data and visualize the clustered graph, as shown in the following screenshot. All you need to do is follow the recipes provided in this book:

A clustered graph of face image data

This chapter serves as an overall introduction to machine learning and R; the first few recipes introduce how to set up the R environment and the integrated development environment, RStudio. After setting up the environment, the following recipe introduces package installation and loading. In order to understand how data analysis is practiced using R, the next four recipes cover data read/write, data manipulation, basic statistics, and data visualization using R. The last recipe in the chapter lists useful data sources and resources.

Downloading and installing R

To use R, you must first install it on your computer. This recipe gives detailed instructions on how to download and install R.

Getting ready

If you are new to the R language, you can find a detailed introduction, language history, and functionality on the official website (http://www.r-project.org/). When you are ready to download and install R, please access the following link: http://cran.r-project.org/.

How to do it...

Please perform the following steps to download and install R for Windows and macOS:

Go to the R CRAN website,

http://www.r-project.org/

, and click on the

download R

link, that is,

http://cran.r-project.org/mirrors.html

):

R Project home page

You may select the mirror location closest to you:

CRAN mirrors

Select the correct download link based on your operating system:

Click on the download link based on your OS

As the installation of R differs for Windows and macOS, the steps required to install R for each OS are provided here.

For Windows:

Click on

Download R for Windows

, as shown in the following screenshot, and then click on

base

:

Click on

Download R 3.x.x for Windows

:

The installation file should be downloaded. Once the download is finished, you can double-click on the installation file and begin installing R, It will ask for you selecting setup language:

Installation step - Selecting Language

The next screen will be an installation screen; click on

Next

on all screens to complete the installation. Once installed, you can see the shortcut icon on the desktop:

R icon for 32 bit and 64 bit on desktop

Double-click on the icon and it will open the R Console:

The Windows R Console

For macOS X:

Go to

Download R for (Mac) OS X,

as shown in the following screenshot.

Click on the latest version (

R-3.4.1.pkg

file extension) according to your macOS version:

Double-click on the downloaded installation file (

.pkg

extension) and begin to install R. Leave all the installation options as the default settings if you do not want to make any changes:

Follow the onscreen instructions through

Introduction

,

Read Me

,

License

,

Destination Select

,

Installation Type

,

Installation

, and

Summary

, and click on

Continue

to complete the installation.

After the file is installed, you can use spotlight search or go to the

Applications

folder to find R:

Use spotlight search to find R

Click on R to open

R Console

:

As an alternative to downloading a Mac .pkg file to install R, Mac users can also install R using Homebrew:

Download

XQuartz-2.X.X.dmg

from

https://xquartz.macosforge.org/landing/

.

Double-click on the

.dmg

file to mount it.

Update brew with the following command line:

$ brew update

Clone the repository and

symlink

all its formulae to

homebrew/science

:

$ brew tap homebrew/science

Install

gfortran

:

$ brew install gfortran

Install R:

$ brew install R

For Linux users, there are precompiled binaries for Debian, RedHat, SUSE, and Ubuntu. Alternatively, you can install R from a source code. Besides downloading precompiled binaries, you can install R for Linux through a package manager. Here are the installation steps for CentOS and Ubuntu.

Downloading and installing R on Ubuntu:

Add the entry to the

/etc/apt/sources.list

file replace

<>

with appropriate value:

$ sudo sh -c "echo 'deb http:// <cran mirros site url>/bin/linux/ubuntu <ubuntu version>/' >> /etc/apt/sources.list"

Then, update the repository:

$ sudo apt-get update

Install R with the following command:

$ sudo apt-get install r-base

Start R in the command line:

$ R

Downloading and installing R on CentOS 5:

Get the

rpm

CentOS 5 RHEL EPEL repository of CentOS 5:

$ wget http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5- 4.noarch.rpm

Install the CentOS 5 RHEL EPEL repository:

$ sudo rpm -Uvh epel-release-5-4.noarch.rpm

Update the installed packages:

$ sudo yum update

Install R through the repository:

$ sudo yum install R

Start R in the command line:

$ R

Downloading and installing R on CentOS 6:

Get the

rpm

CentOS 5 RHEL EPEL repository of CentOS 6:

$ wget http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6- 8.noarch.rpm

Install the CentOS 5 RHEL EPEL repository:

$ sudo rpm -Uvh epel-release-6-8.noarch.rpm

Update the installed packages:

$ sudo yum update

Install R through the repository:

$ sudo yum install R

Start R in the command line:

$ R

Downloading and installing R on Fedora [Latest Version]:

$ dnf install R

This will install R and all its dependencies.

How it works...

CRAN provides precompiled binaries for Linux, macOS X, and Windows. For macOS and Windows users, the installation procedures are straightforward. You can generally follow onscreen instructions to complete the installation. For Linux users, you can use the package manager provided for each platform to install R or build R from the source code.

See also

For those planning to build R from the source code, refer to

R Installation and Administration

(

http://cran.r-project.org/doc/manuals/R-admin.html

), which illustrates how to install R on a variety of platforms

Downloading and installing RStudio

To write an R script, one can use R Console, R commander, or any text editor (such as EMACS, VIM, or sublime). However, the assistance of RStudio, an integrated development environment (IDE) for R, can make development a lot easier.

RStudio provides comprehensive facilities for software development. Built-in features, such as syntax highlighting, code completion, and smart indentation, help maximize productivity. To make R programming more manageable, RStudio also integrates the main interface into a four-panel layout. It includes an interactive R Console, a tabbed source code editor, a panel for the currently active objects/history, and a tabbed panel for the file browser/plot window/package install window/R help window. Moreover, RStudio is open source and is available for many platforms, such as Windows, macOS X, and Linux. This recipe shows how to download and install RStudio.

Getting ready

RStudio requires a working R installation; when RStudio loads, it must be able to locate a version of R. You must therefore have completed the previous recipe with R installed on your OS before proceeding to install RStudio.

How to do it...

Perform the following steps to download and install RStudio for Windows and macOS users:

Access RStudio's official site by using the following URL:

http://www.rstudio.com/products/RStudio/

RStudio home page

For the desktop version installation, click on 

RStudio Desktop

under the

Desktop

section. It will redirect you to the bottom of the home page:

Click on the

DOWNLOAD RSTUDIO DESKTOP

button (

http://www.rstudio.com/products/rstudio/download/

), it will display download page, with the option of open source license and commercial license. Scroll down to

RStudio Desktop Open Source License

and click on

DOWNLOAD

button:

RStudio Download page

It will display different installers for different OS types. Select the appropriate option and download the RStudio:

RStudio Download page

Install RStudio by double-clicking on the downloaded packages. For Windows users, follow the onscreen instructions to install the application:

RStudio Installation page

For Mac users, simply drag the RStudio icon to the

Applications

folder.

Start RStudio:

The RStudio console

Perform the following steps for downloading and installing RStudio for Ubuntu/Debian and RedHat/CentOS users:

For Debian(6+)/Ubuntu(10.04+) 32 bit:

$ wget http://download1.rstudio.org/rstudio-0.98.1091-i386.deb

$ sudo gdebi rstudio-0.98. 1091-i386.deb

For Debian(6+)/Ubuntu(10.04+) 64 bit:

$ wget http://download1.rstudio.org/rstudio-0.98. 1091-amd64.deb

$ sudo gdebi rstudio-0.98. 1091-amd64.deb

For RedHat/CentOS(5,4+) 32 bit:

$ wget http://download1.rstudio.org/rstudio-0.98. 1091-i686.rpm

$ sudo yum install --nogpgcheck rstudio-0.98. 1091-i686.rpm

For RedHat/CentOS(5,4+) 64 bit:

$ wget http://download1.rstudio.org/rstudio-0.98. 1091-x86_64.rpm

$ sudo yum install --nogpgcheck rstudio-0.98. 1091-x86_64.rpm

How it works...

The RStudio program can be run on the desktop or through a web browser. The desktop version is available for the Windows, macOS X, and Linux platforms with similar operations across all platforms. For Windows and macOS users, after downloading the precompiled package of RStudio, follow the onscreen instructions, shown in the preceding steps, to complete the installation. Linux users may use the package management system provided for installation.

See also

In addition to the desktop version, users may install a server version to provide access to multiple users. The server version provides a URL that users can access to use the RStudio resources. To install RStudio, please refer to the following link:

http://www.rstudio.com/ide/download/server.html

. This page provides installation instructions for the following Linux distributions: Debian (6+), Ubuntu (10.04+), RedHat, and CentOS (5.4+).

For other Linux distributions, you can build RStudio from the source code.

Installing and loading packages

After successfully installing R, users can download, install, and update packages from the repositories. As R allows users to create their own packages, official and non-official repositories are provided to manage these user-created packages. CRAN is the official R package repository. Currently, the CRAN package repository features 11,589 available packages (as of 10/11/2017). Through the use of the packages provided on CRAN, users may extend the functionality of R to machine learning, statistics, and related purposes. CRAN is a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. You may select the closest CRAN mirror to your location to download packages.

Getting ready

Start an R session on your host computer.

How to do it...

Perform the following steps to install and load R packages:

Load a list of installed packages:

> library()

Set the default CRAN mirror:

> chooseCRANmirror()

R will return a list of CRAN mirrors, and then ask the user to either type a mirror ID to select it, or enter zero to exit:

Install a package from CRAN; take package

e1071

as an example:

> install.packages("e1071")

Update a package from CRAN; take package

e1071

as an example:

> update.packages("e1071")

Load the package:

> library(e1071)

If you would like to view the documentation of the package, you can use the

help

function:

> help(package ="e1071")

If you would like to view the documentation of the function, you can use the

help

function:

> help(svm, e1071)

Alternatively, you can use the help shortcut,

?

, to view the help document for this function:

> ?e1071::svm

If the function does not provide any documentation, you may want to search the supplied documentation for a given keyword. For example, if you wish to search for documentation related to

svm

:

> help.search("svm")

Alternatively, you can use

??

as the shortcut for

help.search

:

> ??svm

To view the argument taken for the function, simply use the

args

function. For example, if you would like to know the argument taken for the

lm

function:

> args(lm)

Some packages will provide examples and demos; you can use

example

or

demo

to view an example or demo. For example, one can view an example of the

lm

package and a demo of the

graphics

package by typing the following commands:

> example(lm)

> demo(graphics)

To view all the available demos, you may use the

demo

function to list all of them:

> demo()

How it works...

This recipe first introduces how to view loaded packages, install packages from CRAN, and load new packages. Before installing packages, those of you who are interested in the listing of the CRAN package can refer to http://cran.r-project.org/web/packages/available_packages_by_name.html.

When a package is installed, documentation related to the package is also provided. You are, therefore, able to view the documentation or the related help pages of installed packages and functions. Additionally, demos and examples are provided by packages that can help users understand the capability of the installed package.

See also

Besides installing packages from CRAN, there are other R package repositories, including Crantastic, a community site for rating and reviewing CRAN packages, and R-Forge, a central platform for the collaborative development of R packages. In addition to this, Bioconductor provides R packages for the analysis of genomic data.

If you would like to find relevant functions and packages, please visit the list of task views at

http://cran.r-project.org/web/views/

, or search for keywords at

http://rseek.org

.

Understanding of basic data structures

Ensure you have completed the previous recipes by installing R on your operating system.

Data types

You need to have brief idea about basic data types and structures in R in order to grasp all the recipies in book. This section will give you an overview for the same and make you ready for using R. R supports all the basic data types supported by any other programming and scripting language. In simple words, data can be of numeric, character, date, and logical type. As the name suggests, numeric means all type of numbers, while logical allows only true and false. To check the type of data, the class function, which will display the class of the data, is used.

Perform following task on R Console or RStudio:

> x=123 > class(x) Output: [1] "numeric"> x="ABC"> class(x)Output:[1] "character"

Data structures

R supports different types of data structures to store and process data. The following is a list of basic and commonly used data structures used in R:

Vectors

List

Array

Matrix

DataFrames

Vectors

A vector is a container that stores data of same type. It can be thought of as a traditional array in programming language. It is not to be confused with mathematical vector which have rows and columns. To create a vector the c() function, which will combine the arguments, is used. One of the beautiful features of vectors is that any operation performed on vector is performed on each element of the vector. If a vector consists of three elements, adding two will increases every element by two.

How it works...

Printing a vector will starts with index [1] which shows the elements are indexed in vector and it starts from 1, not from 0 like other languages. Any operation done on a vector is applied on individual elements of the vector, so the multiplication operation is applied on individual elements of the vector. If vector is passed as an argument to any inbuilt function, it will be applied on individual elements. You can see how powerful it is and it removes the need to write the loops for doing the operation. The vector changes the type on basis of data it holds and operation we apply on it. Using x==2 will check each element of vector for equality with two and returns the vector with logical value, that is, TRUE or FALSE. There are many other ways of creating a vector; one such way is shown in creating vector t.

Lists

Unlike a vector, a list can store any type of data. A list is, again, a container that can store arbitrary data. A list can contain another list, a vector, or any other data structure. To create a list, the list function is used.

How it works...

A list, as said, can contain anything; we start with a simple example to store some elements in a list using the list function. In the next step, we create a list with a vector as element of the list. So, y is a list with its first element as vector of 1, 2, 3 and its second element as vector of A, B, and C.

Array

An array is nothing but a multidimensional vector, and can store only the same type of data. The way to create a multidimensional vector dimension is specified using dim.

How it works...

Creating an array is straightforward. Use the array function and provide the value for nth row; it will create a two-dimensional array with appropriate columns.

Matrix

A matrix is like a DataFrame, with the constraint that every element must be of the same type.

DataFrame

DataFrame can be seen as an Excel spreadsheet, with rows and columns where every column can have different data types. In R, each column of a DataFrame is a vector.

Basic commands for subsetting

R allows data to be sliced or to get the subset using various methods.

How to do it...

Perform the following steps to see subsetting. It is assumed that the DataFrame d and matrix m exist from the previous exercise:

> d$No # Slice the column Output: [1] 1 2 3 > d$Name # Slice the column Output: [1] A B C > d$Name[1] Output: [1] A > d[2,] # get Row Output: No Name Attendance