Advanced Machine Learning with R - Cory Lesmeister - E-Book

Advanced Machine Learning with R E-Book

Cory Lesmeister

0,0
43,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Master machine learning techniques with real-world projects that interface TensorFlow with R, H2O, MXNet, and other languages


Key Features:


Gain expertise in machine learning, deep learning and other techniquesBuild intelligent end-to-end projects for finance, social media, and a variety of domainsImplement multi-class classification, regression, and clustering


Book Description:


R is one of the most popular languages when it comes to exploring the mathematical side of machine learning and easily performing computational statistics.


This Learning Path shows you how to leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. You'll tackle realistic projects such as building powerful machine learning models with ensembles to predict employee attrition. You'll explore different clustering techniques to segment customers using wholesale data and use TensorFlow and Keras-R for performing advanced computations. You’ll also be introduced to reinforcement learning along with its various use cases and models. Additionally, it shows you how some of these black-box models can be diagnosed and understood.


By the end of this Learning Path, you’ll be equipped with the skills you need to deploy machine learning techniques in your own projects.


This Learning Path includes content from the following Packt products:


R Machine Learning Projects by Dr. Sunil Kumar ChinnamgariMastering Machine Learning with R - Third Edition by Cory Lesmeister


What you will learn:


Develop a joke recommendation engine to recommend jokes that match users’ tastesBuild autoencoders for credit card fraud detectionWork with image recognition and convolutional neural networksMake predictions for casino slot machine using reinforcement learningImplement NLP techniques for sentiment analysis and customer segmentationProduce simple and effective data visualizations for improved insightsUse NLP to extract insights for textImplement tree-based classifiers including random forest and boosted tree


Who this book is for:


If you are a data analyst, data scientist, or machine learning developer this is an ideal Learning Path for you. Each project will help you test your skills in implementing machine learning algorithms and techniques. A basic understanding of machine learning and working knowledge of R programming is necessary to get the most out of this Learning Path.


Cory Lesmeister has over fourteen years of quantitative experience and is currently a senior data scientist for the advanced analytics team at Cummins, Inc. in Columbus, Indiana. He has spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. He also has several years of experience in the insurance and banking industries, both as a consultant and as a manager of marketing analytics. A former US Army active duty and reserve officer, Cory was stationed in Baghdad, Iraq, in 2009 serving as the strategic advisor to the 29,000-person Iraqi Oil Police, succeeding where others failed by acquiring and delivering promised equipment to help the country secure and protect its oil infrastructure. He has a BBA in aviation administration from the University of North Dakota and a commercial helicopter license. Dr. Sunil Kumar Chinnamgari has a Ph.D. in computer science and he specializes in machine learning and natural language processing. He is an AI researcher with more than 14 years of industry experience. Currently, he works in the capacity of a lead data scientist with a US financial giant. He has published several research papers in Scopus and IEEE journals and is a frequent speaker at various meetups. He is an avid coder and has won multiple hackathons. In his spare time, Sunil likes to teach, travel, and spend time with family.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 755

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Advanced Machine Learning with R

 

 

 

 

 

 

 

 

Tackle data analytics and machine learning challenges and build complex applications with R 3.5

 

 

 

 

 

 

 

 

Cory Lesmeister
Dr. Sunil Kumar Chinnamgari 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Advanced Machine Learning with R

 

 

 

Copyright © 2019 Packt Publishing

 

 

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: May 2019

Production reference: 1160519

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

 

ISBN 978-1-83864-177-1

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry-leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the authors

Cory Lesmeister has over fourteen years of quantitative experience and is currently a senior data scientist for the Advanced Analytics team at Cummins, Inc. in Columbus, Indiana. Cory spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. He also has several years of experience in the insurance and banking industries, both as a consultant and as a manager of marketing analytics. A former US Army active duty and reserve officer, Cory was stationed in Baghdad, Iraq, in 2009 serving as the strategic advisor to the 29,000-person Iraqi Oil Police, succeeding where others failed by acquiring and delivering promised equipment to help the country secure and protect its oil infrastructure. Cory has a BBA in Aviation Administration from the University of North Dakota and a commercial helicopter license.

Dr. Sunil Kumar Chinnamgari has a Ph.D. in computer science (specializing in machine learning and natural language processing). He is an AI researcher with more than 14 years of industry experience. Currently, he works in the capacity of a lead data scientist with a US financial giant. He has published several research papers in Scopus and IEEE journals and is a frequent speaker at various meet-ups. He is an avid coder and has won multiple hackathons. In his spare time, Sunil likes to teach, travel, and spend time with family.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Advanced Machine Learning with R

About Packt

Why subscribe?

Packt.com

Contributors

About the authors

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

Preparing and Understanding Data

Overview

Reading the data

Handling duplicate observations

Descriptive statistics

Exploring categorical variables

Handling missing values

Zero and near-zero variance features

Treating the data

Correlation and linearity

Summary

Linear Regression

Univariate linear regression

Building a univariate model

Reviewing model assumptions

Multivariate linear regression

Loading and preparing the data

Modeling and evaluation – stepwise regression

Modeling and evaluation – MARS

Reverse transformation of natural log predictions

Summary

Logistic Regression

Classification methods and linear regression

Logistic regression

Model training and evaluation

Training a logistic regression algorithm

Weight of evidence and information value

Feature selection

Cross-validation and logistic regression

Multivariate adaptive regression splines

Model comparison

Summary

Advanced Feature Selection in Linear Models

Regularization overview

Ridge regression

LASSO

Elastic net

Data creation

Modeling and evaluation

Ridge regression

LASSO

Elastic net

Summary

K-Nearest Neighbors and Support Vector Machines

K-nearest neighbors

Support vector machines

Manipulating data

Dataset creation

Data preparation

Modeling and evaluation

KNN modeling

Support vector machine

Summary

Tree-Based Classification

An overview of the techniques

Understanding a regression tree

Classification trees

Random forest

Gradient boosting

Datasets and modeling

Classification tree

Random forest

Extreme gradient boosting – classification

Feature selection with random forests

Summary

Neural Networks and Deep Learning

Introduction to neural networks

Deep learning – a not-so-deep overview

Deep learning resources and advanced methods

Creating a simple neural network

Data understanding and preparation

Modeling and evaluation

An example of deep learning

Keras and TensorFlow background

Loading the data

Creating the model function

Model training

Summary

Creating Ensembles and Multiclass Methods

Ensembles

Data understanding

Modeling and evaluation

Random forest model

Creating an ensemble

Summary

Cluster Analysis

Hierarchical clustering

Distance calculations

K-means clustering

Gower and PAM

Gower

PAM

Random forest

Dataset background

Data understanding and preparation

Modeling 

Hierarchical clustering

K-means clustering

Gower and PAM

Random forest and PAM

Summary

Principal Component Analysis

An overview of the principal components

Rotation

Data

Data loading and review

Training and testing datasets

PCA modeling

Component extraction

Orthogonal rotation and interpretation

Creating scores from the components

Regression with MARS

Test data evaluation

Summary

Association Analysis

An overview of association analysis

Creating transactional data

Data understanding

Data preparation

Modeling and evaluation

Summary

Time Series and Causality

Univariate time series analysis

Understanding Granger causality

Time series data

Data exploration

Modeling and evaluation

Univariate time series forecasting

Examining the causality

Linear regression

Vector autoregression

Summary

Text Mining

Text mining framework and methods

Topic models

Other quantitative analysis

Data overview

Data frame creation

Word frequency

Word frequency in all addresses

Lincoln's word frequency

Sentiment analysis

N-grams

Topic models

Classifying text

Data preparation

LASSO model

Additional quantitative analysis

Summary

Exploring the Machine Learning Landscape

ML versus software engineering

Types of ML methods

Supervised learning

Unsupervised learning

Semi-supervised learning

Reinforcement learning

Transfer learning

ML terminology – a quick review

Deep learning

Big data

Natural language processing

Computer vision

Cost function

Model accuracy

Confusion matrix

Predictor variables

Response variable

Dimensionality reduction

Class imbalance problem

Model bias and variance

Underfitting and overfitting

Data preprocessing

Holdout sample

Hyperparameter tuning

Performance metrics

Feature engineering

Model interpretability

ML project pipeline

Business understanding

Understanding and sourcing the data

Preparing the data 

Model building and evaluation

Model deployment

Learning paradigm

Datasets

Summary

Predicting Employee Attrition Using Ensemble Models

Philosophy behind ensembling 

Getting started

Understanding the attrition problem and the dataset 

K-nearest neighbors model for benchmarking the performance

Bagging

Bagged classification and regression trees (treeBag) implementation

Support vector machine bagging (SVMBag) implementation

Naive Bayes (nbBag) bagging implementation

Randomization with random forests

Implementing an attrition prediction model with random forests

Boosting 

The GBM implementation

Building attrition prediction model with XGBoost

Stacking 

Building attrition prediction model with stacking

Summary

Implementing a Jokes Recommendation Engine

Fundamental aspects of recommendation engines

Recommendation engine categories

Content-based filtering

Collaborative filtering

Hybrid filtering

Getting started

Understanding the Jokes recommendation problem and the dataset

Converting the DataFrame

Dividing the DataFrame

Building a recommendation system with an item-based collaborative filtering technique

Building a recommendation system with a user-based collaborative filtering technique

Building a recommendation system based on an association-rule mining technique

The Apriori algorithm

Content-based recommendation engine

Differentiating between ITCF and content-based recommendations

Building a hybrid recommendation system for Jokes recommendations

Summary

References

Sentiment Analysis of Amazon Reviews with NLP

The sentiment analysis problem

Getting started

Understanding the Amazon reviews dataset

Building a text sentiment classifier with the BoW approach

Pros and cons of the BoW approach

Understanding word embedding

Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus

Building a text sentiment classifier with GloVe word embedding

Building a text sentiment classifier with fastText

Summary

Customer Segmentation Using Wholesale Data

Understanding customer segmentation

Understanding the wholesale customer dataset and the segmentation problem

Categories of clustering algorithms

Identifying the customer segments in wholesale customer data using k-means clustering

Working mechanics of the k-means algorithm

Identifying the customer segments in the wholesale customer data using DIANA

Identifying the customer segments in the wholesale customers data using AGNES

Summary

Image Recognition Using Deep Neural Networks

Technical requirements

Understanding computer vision

Achieving computer vision with deep learning

Convolutional Neural Networks

Layers of CNNs

Introduction to the MXNet framework

Understanding the MNIST dataset

Implementing a deep learning network for handwritten digit recognition

Implementing dropout to avoid overfitting

Implementing the LeNet architecture with the MXNet library

Implementing computer vision with pretrained models

Summary

Credit Card Fraud Detection Using Autoencoders

Machine learning in credit card fraud detection

Autoencoders explained

Types of AEs based on hidden layers

Types of AEs based on restrictions

Applications of AEs

The credit card fraud dataset

Building AEs with the H2O library in R

Autoencoder code implementation for credit card fraud detection

Summary

Automatic Prose Generation with Recurrent Neural Networks

Understanding language models

Exploring recurrent neural networks

Comparison of feedforward neural networks and RNNs

Backpropagation through time

Problems and solutions to gradients in RNN

Exploding gradients

Vanishing gradients

Building an automated prose generator with an RNN

Implementing the project

Summary

Winning the Casino Slot Machines with Reinforcement Learning

Understanding RL

Comparison of RL with other ML algorithms

Terminology of RL

The multi-arm bandit problem

Strategies for solving MABP

The epsilon-greedy algorithm

Boltzmann or softmax exploration

Decayed epsilon greedy

The upper confidence bound algorithm

Thompson sampling

Multi-arm bandit – real-world use cases

Solving the MABP with UCB and Thompson sampling algorithms

Summary

Creating a Package

Creating a new package

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

R is one of the most popular languages when it comes to exploring the mathematical side of machine learning and easily performing computational statistics.

This Learning Path shows you how to leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. You'll tackle realistic projects such as building powerful machine learning models with ensembles to predict employee attrition. You'll explore different clustering techniques to segment customers using wholesale data and use TensorFlow and Keras-R for performing advanced computations. Each chapter will help you implement advanced machine learning algorithms using real-world examples. You’ll also be introduced to reinforcement learning along with its various use cases and models. Additionally, this book provides you with a glimpse into how some of these black-box models can be diagnosed and understood.

By the end of this Learning Path, you’ll be equipped with the skills you need to deploy machine learning techniques in your own projects.

Who this book is for

If you’re a data analyst, data scientist, or machine learning developer who wants to master machine learning techniques using R, this is an ideal Learning Path for you. Each project will help you test your skills in implementing machine learning algorithms and techniques. A basic understanding of machine learning and working knowledge of R programming is necessary to get the most out of this Learning Path.

What this book covers

Chapter 1, Preparing and Understanding Data, covers the loading of data and demonstrates how to obtain an understanding of its structure and dimensions, as well as how to install the necessary packages.

Chapter 2, Linear Regression,  provides you with a solid foundation before learning advanced methods such as Support Vector Machines and Gradient Boosting. No more solid foundation exists than the least squares linear regression.

Chapter 3, Logistic Regression, presents a discussion on how logistic regression and discriminant analysis is used in order to predict a categorical outcome. Multivariate adaptive regression splines have been added. This technique performs well, handles non-linearity, and is easy to explain.

Chapter 4, Advanced Feature Selection in Linear Models,  shows regularization techniques to help improve the predictive ability and interpretability as feature selection is a critical and often extremely challenging component of machine learning. It also includes techniques not only for regression but also for a classification problem.

Chapter 5, K-Nearest Neighbors and Support Vector Machines, begins the exploration of the more advanced and nonlinear techniques. The real power of machine learning will be unveiled.

Chapter 6, Tree-Based Classification, offers some of the most powerful predictive abilities of all the machine learning techniques, especially for classification problems. Single decision trees will be discussed along with the more advanced random forests and boosted trees. It also contains very popular techniques provided by the XGBOOST package.

Chapter 7, Neural Networks and Deep Learning, shows some of the most exciting machine learning methods currently used. Inspired by how the brain works, neural networks and their more recent and advanced offshoot, Deep Learning, will be put to the test. It also includes code for the H2O package, including hyperparameter search.

Chapter 8, Creating Ensembles and Multiclass Methods, has completely new content, involving the utilization of several great packages. 

Chapter 9, Cluster Analysis,  covers unsupervised learning. Instead of trying to make a prediction, the goal will focus on uncovering the latent structure of observations. Three clustering methods will be discussed: hierarchical, k-means, and partitioning around medoids. It also includes the methodology for executing unsupervised learning with random forests.

Chapter 10, Principal Component Analysis, continues the examination of unsupervised learning with principal components analysis, which is used to uncover the latent structure of the features. Once this is done, the new features will be used in a supervised learning exercise.

Chapter 11, Association Analysis, explains association analysis and applies not only to making recommendations, product placement, and promotional pricing, but can also be used in manufacturing, web usage, and healthcare.

Chapter 12, Time Series and Causality,  discusses univariate forecast models, bivariate regression, and Granger causality models, including an analysis of carbon emissions and climate change, along with a demonstration of different causality test methods.

Chapter 13, Text Mining, demonstrates a framework for quantitative text mining and the building of topic models. Along with time series, the world of data contains vast volumes of data in a textual format. With so much data as text, it is critically important to understand how to manipulate, code, and analyze the data in order to provide meaningful insights.

Chapter 14, Exploring the Machine Learning Landscape, will briefly review the various ML concepts that a practitioner must know. In this chapter, we will cover topics such as supervised learning, reinforcement learning, unsupervised learning, and real-world ML uses cases.

Chapter 15, Predicting Employee Attrition Using Ensemble Models, covers the creation of powerful ML models through ensemble learning.  We will introduce the problem at hand and then attempt to explore the dataset with exploratory data analysis (EDA). Then in the preprocessing phase, we will create new features using prior domain experience. Once the dataset is fully prepared, models will be created using multiple ensemble techniques, such as bagging, boosting, stacking, and randomization. Lastly, we will deploy the finally selected model for production. 

Chapter 16, Implementing a Joke Recommendation Engine, introduces recommendation engines. We start by understanding the concepts and types of collaborative filtering algorithms. We will then build a recommendation engine to provide personalized joke recommendations using collaborative filtering approaches such as user-based collaborative filters and item-based collaborative filters.  Apart from this, we will be exploring various libraries available in R that can be used to build recommendation systems.

Chapter 17, Sentiment Analysis of Amazon Reviews with NLP, covers sentiment analysis, which entails finding the sentiment of a sentence and labeling it as positive, negative, or neutral and covers the various techniques that can be used to analyze text. We will understand text-mining concepts and the various ways that text is labeled based on the tone. Apart from using various popular R text-mining libraries to preprocess the reviews to be classified, we will also be leveraging a wide range of text representations, such as a bag of words, word2vec, fastText, and Glove.

Chapter 18, Customer Segmentation Using Wholesale Data, covers the segmentation, grouping, or clustering of customers, which can be achieved through unsupervised learning. In this chapter, we learn the various techniques of customer segmentation.We will be applying advanced clustering techniques, such as k-means, DIANA, and AGNES. We will explore the ML techniques for dealing with such ambiguity and have ML find out the number of groups possible based on the underlying characteristics of the input data. Evaluating the output of the clustering algorithms is an area that is often challenging to practitioners.

Chapter 19, Image Recognition Using Deep Neural Networks, covers convolutional neural networks (CNNs). We explore why CNNs work so well with computer vision problems such as object detection. We will learn about all of these concepts by applying a CNN in the building of a multi-class classification model on a popular open dataset called MNIST. We will learn about the various preprocessing techniques that can be applied to the image data in order to use the data with deep learning models.  

Chapter 20, Credit Card Fraud Detection Using Autoencoders, covers autoencoders and how they are different from the other deep learning networks, such as recurrent neural networks (RNNs)and CNNs. We will learn about autoencoders by implementing a project that identifies credit card fraud. We will become familiar with dimensionality reduction and how it can be used to identify credit card fraud detection. 

Chapter 21, Automatic Prose Generation with Recurrent Neural Networks, introduces some deep neural networks (DNNs). We will implement a neural network from scratch and will learn how to apply an RNN by doing a project. We will create an application based on long short-term memory (LSTM) network, a variant of RNNs that generates text automatically. To accomplish this task, we make use of the MXNet framework, which extends its support for the R language to perform deep learning.

Chapter 22, Winning the Casino Slot Machines with Reinforcement Learning, begins with an explanation of RL. We discuss the various concepts of RL, including strategies for solving what is called as the multi-arm bandit problem. We implement a project that uses UCB and Thompson sampling techniques in order to solve the multi-arm bandit problem.

Appendix, Creating a Package, includes additional data packages.

To get the most out of this book

Assuming the reader has a working knowledge of R and of basic statistics, this book will provide the skills and tools required to get the reader up and running with R and ML as quickly and painlessly as possible. There will probably always be detractors who complain that it does not offer enough math or does not do this, or that, or the other thing, but my answer to that is that these books already exist! Why duplicate what has already been done, and very well, for that matter? Again, I have sought to provide something different, something to hold the reader's attention and allow them to succeed in this competitive and rapidly changing field.

The projects covered in this book are intended to expose you to practical knowledge on the implementation of various ML techniques to real-world problems. It is expected that you have a good working knowledge of R and some basic understanding of ML. Basic knowledge of ML and R is a must prior to starting this project.

It should also be noted that the code for the projects is implemented using R version 3.5.2 (2018-12-20), nicknamed Eggshell Igloo. The project code has been successfully tested on Linux Mint 18.3 Sylvia. There is no reason to believe that the code does not work on other platforms, such as Windows; however, this is not something that has been tested by the author.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map { height: 100%; margin: 0; padding: 0}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Preparing and Understanding Data

"We've got to use every piece of data and piece of information, and hopefully that will help us be accurate with our player evaluation. For us, that's our lifeblood."
– Billy Beane, General Manager Oakland Athletics, subject of the book Moneyball

Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.

The following are the topics that we'll cover in this chapter:

Overview 

Reading the data

Handling duplicate observations

Descriptive statistics

Exploring categorical variables

Handling missing values

Zero and near-zero variance features

Treating the data

Correlation and linearity

Overview

If you haven't been exposed to large, messy datasets, then be patient, for it's only a matter of time. If you've encountered such data, has it been in a domain where you have little subject matter expertise? If not, then once again I proffer that it's only a matter of time. Some of the common problems that make up this term messy data include the following:

Missing or invalid values

Novel levels in a categorical feature that show up in algorithm production

High cardinality in categorical features such as zip codes

High dimensionality

Duplicate observations

So this begs the question what are we to do? Well, first we need to look at what are the critical tasks that need to be performed during this phase of the process. The following tasks serve as the foundation for building a learning algorithm. They're from the paper by SPSS, CRISP-DM 1.0, a step-by-step data-mining guide available at https://the-modeling-agency.com/crisp-dm.pdf:

Data understanding:

Collect

Describe

Explore

Verify

Data preparation:

Select

Clean

Construct

Integrate

Format

Certainly this is an excellent enumeration of the process, but what do we really need to do? I propose that, in practical terms we can all relate to, the following must be done once the data is joined and loaded into your machine, cloud, or whatever you use:

Understand the data structure

Dedupe observations

Eliminate zero variance features and low variance features as desired

Handle missing values

Create dummy features (one-hot encoding)

Examine and deal with highly correlated features and those with perfect linear relationships

Scale as necessary

Create other features as desired

Many feel that this is a daunting task. I don't and, in fact, I quite enjoy it. If done correctly and with a judicious application of judgment, it should reduce the amount of time spent at this first stage of a project and facilitate training your learning algorithm. None of the previous steps are challenging, but it can take quite a bit of time to write the code to perform each task.

Well, that's the benefit of this chapter. The example to follow will walk you through the tasks and the R code that accomplishes it. The code is flexible enough that you should be able to apply it to your projects. Additionally, it will help you gain an understanding of the data at a point you can intelligently discuss it with Subject Matter Experts (SMEs) if, in fact, they're available.

In the practical exercise that follows, we'll work with a small dataset. However, it suffers from all of the problems described earlier. Don't let the small size fool you, as we'll take what we learn here and use it for the more massive datasets to come in subsequent chapters.

As background, the data we'll use I put together painstakingly by hand. It's the Order of Battle for the opposing armies at the Battle of Gettysburg, fought during the American Civil War, July 1st-3rd, 1863, and the casualties reported by the end of the day on July 3rd. I purposely chose this data because I'm reasonably sure you know very little about it. Don't worry, I'm the SME on the battle here and will walk you through it every step of the way. The one thing that we won't cover in this chapter is dealing with large volumes of textual features, which we'll discuss later in this book. Enough said already; let's get started!

The source used in the creation of the dataset is The Gettysburg Campaign in Numbers and Losses: Synopses, Orders of Battle, Strengths, Casualties, and Maps, June 9-July 14, 1863, by J. David Petruzzi and Steven A. Stanley.

Reading the data

This first task will load the data and show how to get a how level understanding of its structure and dimensions as well as install the necessary packages.

You have two ways to access the data, which resides on GitHub. You can download gettysburg.csv directly from the site at this link: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/gettysburg.csv, or you can use the RCurl package. An example of how to use the package is available here: https://github.com/opetchey/RREEBES/wiki/Reading-data-and-code-from-an-online-github-repository.

Let's assume you have the file in your working directory, so let's begin by installing the necessary packages:

install.packages("caret")install.packages("janitor")install.packages("readr")install.packages("sjmisc")install.packages("skimr")install.packages("tidyverse")install.packages("vtreat")

Let me make a quick note about how I've learned (the hard way) about how to correctly write code. With the packages installed, we could now specifically call the libraries into the R environment. However, it's a best practice and necessary when putting code into production that a function that isn't in base R be specified. First, this helps you and unfortunate others to read your code with an understanding of which library is mapped to a specific function. It also eliminates potential errors because different packages call different functions the same thing. The example that comes to my mind is the tsoutliers() function. The function is available in the forecast package and was in the tsoutliers package during earlier versions. Now I know this extra typing might seem unwieldy and unnecessary, but once you discipline yourself to do it, you'll find that it's well worth the effort.

There's one library we'll call and that's magrittr, which allows the use of a pipe-operator, %>%, to chain code together:

library(magrittr)

We're now ready to load the .csv file. In doing so, let's utilize the read_csv() function from readr as it's faster than base R and creates a tibble dataframe. In most cases, using tibbles in a tidyverse style is easier to write and understand. If you want to learn all the benefits of tidyverse, check out their website: tidyverse.org.

The only thing we need to specify in the function is our filename:

gettysburg <- read_csv("~/gettysburg.csv")

Here's a look at the column (feature) names:

colnames(gettysburg)

[1] "type" "state" "regiment_or_battery" "brigade"

[5] "division" "corps" "army" "july1_Commander"

[9] "Cdr_casualty" "men" "killed" "wounded"

[13] "captured" "missing" "total_casualties" "3inch_rifles"

[17] "4.5inch_rifles" "10lb_parrots" "12lb_howitzers" "12lb_napoleons"

[21] "6lb_howitzers" "24lb_howitzers" "20lb_parrots" "12lb_whitworths"

[25] "14lb_rifles" "total_guns"

We have 26 features in this data, and some of you're asking yourself things like, what the heck is a 20 pound parrot? If you put it in a search engine, you'll probably end up with the bird and not the 20 pound Parrot rifled artillery gun. You can see the dimensions of the data in RStudio in your Global Environment view, or you can dig on your own to see there're 590 observations:

dim(gettysburg)

[1] 590 26

In RStudio, you can click on the tibble name in the Global Environment or run the View(tibblename) code and it'll open a spreadsheet of all of the data.

So we have 590 observations of 26 features, but this data suffers from the issues that permeate large and complex data. Next, we'll explore if there're any duplicate observations and how to deal with them efficiently.

Handling missing values

Dealing with missing values can be a little tricky as there's a number of ways to approach the task. We've already seen in the section on descriptive statistics that there're missing values. First of all, let's get a full accounting of the missing quantity by feature, then we shall discuss how to deal with them. What I'm going to demonstrate in the following is how to put the count by feature into a dataframe that we can explore within RStudio:

na_count <- sapply(gettysburg, function(y) sum(length(which(is.na( y )))))na_df <- data.frame(na_count)View(na_df)

The following is a screenshot produced by the preceding code, after sorting the dataframe by descending count:

You can clearly see the count of missing by feature with the most missing is ironically named missing with a total of 17 observations.

So what should we do here or, more appropriately, what can we do here? There're several choices:

Do nothing

: However, some R functions will omit NAs and some functions will fail and produce an error.

Omit all observations with NAs

: In massive datasets, they may make sense, but we run the risk of losing information.

Impute values

: They could be something as simple as substituting the median value for the missing one or creating an algorithm to impute the values.

Dummy coding

: Turn the missing into a value such as 0 or -999, and code a dummy feature where if the feature for a specific observation is missing, the dummy is coded 1, otherwise, it's coded 0.

I could devote an entire chapter, indeed a whole book on the subject, delving into missing at random and others, but I was trained—and, in fact, shall insist—on the latter method. It's never failed me and the others can be a bit problematic. The benefit of dummy coding—or indicator coding, if you prefer—is that you don't lose information. In fact, missing-ness might be an essential feature in and of itself.

For a full discussion on the handling of missing values, you can reference the following articles: http://www.stat.columbia.edu/~gelman/arm/missing.pdf and https://pdfs.semanticscholar.org/4172/f558219b94f850c6567f93fa60dee7e65139.pdf.

So, here's an example of how I manually code a dummy feature and turn the NAs into zeroes:

gettysburg$missing_isNA <- ifelse(is.na(gettysburg$missing), 1, 0)gettysburg$missing[is.na(gettysburg$missing)] <- 0

The first iteration of code creates a dummy feature for the missing feature and the second changes any NAs in missing to zero. In the upcoming section, where the dataset is fully processed (treated), the other missing values will be imputed. 

Summary

This chapter looked at the common problems in large, messy datasets common in machine learning projects. These include, but are not limited to the following:

Missing or invalid values

Novel levels in a categorical feature that show up in algorithm production

High cardinality in categorical features such as zip code

High dimensionality

Duplicate observations

This chapter provided a disciplined approach to dealing with these problems by showing how to explore the data, treat it, and create a dataframe that you can use for developing your learning algorithm. It's also flexible enough that you can modify the code to suit your circumstances. This methodology should make what many feels is the most arduous, time-consuming, and least enjoyable part of the job an easy task.

With this task behind us, we can now get started on our first modeling task using linear regression in the following chapter.

Linear Regression

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem."
– John Tukey

It's essential that we get started with a simple yet extremely effective technique that's been used for a long time: linear regression. Albert Einstein is believed to have remarked at one time or another that things should be made as simple as possible, but no simpler. This is sage advice and a good rule of thumb in the development of algorithms for machine learning. Considering the other techniques that we'll discuss later, there's no simpler model than tried and tested linear regression, which uses the least squares approach to predict a quantitative outcome. We can consider it to be the foundation of all the methods that we'll discuss later, many of which are mere extensions. If you can master the linear regression method, well then quite frankly I believe you can master the rest of this book. Therefore, let's consider this as a good starting point for our journey towards becoming a machine learning guru.

This chapter covers introductory material and an expert in this subject can skip ahead to the next topic. Otherwise, ensure that you thoroughly understand this topic before venturing to other, more complex learning methods. I believe you'll discover that many of your projects can be addressed by just applying what's discussed in the following sections. Linear regression is probably the most straightforward model to explain to your customers, most of whom will have at least a cursory understanding of R-squared. Many of them will have been exposed to it at great depth and hence will be comfortable with variable contribution, collinearity, and the like.

The following are the topics that we'll be covering in this chapter:

Univariate linear regression

Multivariate linear regression

Univariate linear regression

We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as follows:

We can state it as the expected value of Y is a function of the parameters  (the intercept) plus  (the slope) times x, plus an error term e. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences:

This, with simple substitution, yields the following:

 

Before we begin with an application, I want to point out that if you read the headlines of various research breakthroughs, you should do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, R—and any other software, for that matter—will give us a solution regardless of the input. However, just because the math makes sense and a high correlation or R-squared statistic is reported doesn't mean that the conclusion is valid.

To drive this point home, let's have a look at the famous Anscombe dataset, which is available in R. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties but when plotted show something very different. I've used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is an excellent place to start should you have a similar need. It's a brief digression before moving on to serious modeling:

> #call up and explore the data> data(anscombe)> attach(anscombe)> anscombe