43,19 €
Master machine learning techniques with real-world projects that interface TensorFlow with R, H2O, MXNet, and other languages
Key Features:
Gain expertise in machine learning, deep learning and other techniquesBuild intelligent end-to-end projects for finance, social media, and a variety of domainsImplement multi-class classification, regression, and clustering
Book Description:
R is one of the most popular languages when it comes to exploring the mathematical side of machine learning and easily performing computational statistics.
This Learning Path shows you how to leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. You'll tackle realistic projects such as building powerful machine learning models with ensembles to predict employee attrition. You'll explore different clustering techniques to segment customers using wholesale data and use TensorFlow and Keras-R for performing advanced computations. You’ll also be introduced to reinforcement learning along with its various use cases and models. Additionally, it shows you how some of these black-box models can be diagnosed and understood.
By the end of this Learning Path, you’ll be equipped with the skills you need to deploy machine learning techniques in your own projects.
This Learning Path includes content from the following Packt products:
R Machine Learning Projects by Dr. Sunil Kumar ChinnamgariMastering Machine Learning with R - Third Edition by Cory Lesmeister
What you will learn:
Develop a joke recommendation engine to recommend jokes that match users’ tastesBuild autoencoders for credit card fraud detectionWork with image recognition and convolutional neural networksMake predictions for casino slot machine using reinforcement learningImplement NLP techniques for sentiment analysis and customer segmentationProduce simple and effective data visualizations for improved insightsUse NLP to extract insights for textImplement tree-based classifiers including random forest and boosted tree
Who this book is for:
If you are a data analyst, data scientist, or machine learning developer this is an ideal Learning Path for you. Each project will help you test your skills in implementing machine learning algorithms and techniques. A basic understanding of machine learning and working knowledge of R programming is necessary to get the most out of this Learning Path.
Cory Lesmeister has over fourteen years of quantitative experience and is currently a senior data scientist for the advanced analytics team at Cummins, Inc. in Columbus, Indiana. He has spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. He also has several years of experience in the insurance and banking industries, both as a consultant and as a manager of marketing analytics. A former US Army active duty and reserve officer, Cory was stationed in Baghdad, Iraq, in 2009 serving as the strategic advisor to the 29,000-person Iraqi Oil Police, succeeding where others failed by acquiring and delivering promised equipment to help the country secure and protect its oil infrastructure. He has a BBA in aviation administration from the University of North Dakota and a commercial helicopter license. Dr. Sunil Kumar Chinnamgari has a Ph.D. in computer science and he specializes in machine learning and natural language processing. He is an AI researcher with more than 14 years of industry experience. Currently, he works in the capacity of a lead data scientist with a US financial giant. He has published several research papers in Scopus and IEEE journals and is a frequent speaker at various meetups. He is an avid coder and has won multiple hackathons. In his spare time, Sunil likes to teach, travel, and spend time with family.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 755
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: May 2019
Production reference: 1160519
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-83864-177-1
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry-leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Cory Lesmeister has over fourteen years of quantitative experience and is currently a senior data scientist for the Advanced Analytics team at Cummins, Inc. in Columbus, Indiana. Cory spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. He also has several years of experience in the insurance and banking industries, both as a consultant and as a manager of marketing analytics. A former US Army active duty and reserve officer, Cory was stationed in Baghdad, Iraq, in 2009 serving as the strategic advisor to the 29,000-person Iraqi Oil Police, succeeding where others failed by acquiring and delivering promised equipment to help the country secure and protect its oil infrastructure. Cory has a BBA in Aviation Administration from the University of North Dakota and a commercial helicopter license.
Dr. Sunil Kumar Chinnamgari has a Ph.D. in computer science (specializing in machine learning and natural language processing). He is an AI researcher with more than 14 years of industry experience. Currently, he works in the capacity of a lead data scientist with a US financial giant. He has published several research papers in Scopus and IEEE journals and is a frequent speaker at various meet-ups. He is an avid coder and has won multiple hackathons. In his spare time, Sunil likes to teach, travel, and spend time with family.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Advanced Machine Learning with R
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
Preparing and Understanding Data
Overview
Reading the data
Handling duplicate observations
Descriptive statistics
Exploring categorical variables
Handling missing values
Zero and near-zero variance features
Treating the data
Correlation and linearity
Summary
Linear Regression
Univariate linear regression
Building a univariate model
Reviewing model assumptions
Multivariate linear regression
Loading and preparing the data
Modeling and evaluation – stepwise regression
Modeling and evaluation – MARS
Reverse transformation of natural log predictions
Summary
Logistic Regression
Classification methods and linear regression
Logistic regression
Model training and evaluation
Training a logistic regression algorithm
Weight of evidence and information value
Feature selection
Cross-validation and logistic regression
Multivariate adaptive regression splines
Model comparison
Summary
Advanced Feature Selection in Linear Models
Regularization overview
Ridge regression
LASSO
Elastic net
Data creation
Modeling and evaluation
Ridge regression
LASSO
Elastic net
Summary
K-Nearest Neighbors and Support Vector Machines
K-nearest neighbors
Support vector machines
Manipulating data
Dataset creation
Data preparation
Modeling and evaluation
KNN modeling
Support vector machine
Summary
Tree-Based Classification
An overview of the techniques
Understanding a regression tree
Classification trees
Random forest
Gradient boosting
Datasets and modeling
Classification tree
Random forest
Extreme gradient boosting – classification
Feature selection with random forests
Summary
Neural Networks and Deep Learning
Introduction to neural networks
Deep learning – a not-so-deep overview
Deep learning resources and advanced methods
Creating a simple neural network
Data understanding and preparation
Modeling and evaluation
An example of deep learning
Keras and TensorFlow background
Loading the data
Creating the model function
Model training
Summary
Creating Ensembles and Multiclass Methods
Ensembles
Data understanding
Modeling and evaluation
Random forest model
Creating an ensemble
Summary
Cluster Analysis
Hierarchical clustering
Distance calculations
K-means clustering
Gower and PAM
Gower
PAM
Random forest
Dataset background
Data understanding and preparation
Modeling 
Hierarchical clustering
K-means clustering
Gower and PAM
Random forest and PAM
Summary
Principal Component Analysis
An overview of the principal components
Rotation
Data
Data loading and review
Training and testing datasets
PCA modeling
Component extraction
Orthogonal rotation and interpretation
Creating scores from the components
Regression with MARS
Test data evaluation
Summary
Association Analysis
An overview of association analysis
Creating transactional data
Data understanding
Data preparation
Modeling and evaluation
Summary
Time Series and Causality
Univariate time series analysis
Understanding Granger causality
Time series data
Data exploration
Modeling and evaluation
Univariate time series forecasting
Examining the causality
Linear regression
Vector autoregression
Summary
Text Mining
Text mining framework and methods
Topic models
Other quantitative analysis
Data overview
Data frame creation
Word frequency
Word frequency in all addresses
Lincoln's word frequency
Sentiment analysis
N-grams
Topic models
Classifying text
Data preparation
LASSO model
Additional quantitative analysis
Summary
Exploring the Machine Learning Landscape
ML versus software engineering
Types of ML methods
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Transfer learning
ML terminology – a quick review
Deep learning
Big data
Natural language processing
Computer vision
Cost function
Model accuracy
Confusion matrix
Predictor variables
Response variable
Dimensionality reduction
Class imbalance problem
Model bias and variance
Underfitting and overfitting
Data preprocessing
Holdout sample
Hyperparameter tuning
Performance metrics
Feature engineering
Model interpretability
ML project pipeline
Business understanding
Understanding and sourcing the data
Preparing the data 
Model building and evaluation
Model deployment
Learning paradigm
Datasets
Summary
Predicting Employee Attrition Using Ensemble Models
Philosophy behind ensembling 
Getting started
Understanding the attrition problem and the dataset 
K-nearest neighbors model for benchmarking the performance
Bagging
Bagged classification and regression trees (treeBag) implementation
Support vector machine bagging (SVMBag) implementation
Naive Bayes (nbBag) bagging implementation
Randomization with random forests
Implementing an attrition prediction model with random forests
Boosting 
The GBM implementation
Building attrition prediction model with XGBoost
Stacking 
Building attrition prediction model with stacking
Summary
Implementing a Jokes Recommendation Engine
Fundamental aspects of recommendation engines
Recommendation engine categories
Content-based filtering
Collaborative filtering
Hybrid filtering
Getting started
Understanding the Jokes recommendation problem and the dataset
Converting the DataFrame
Dividing the DataFrame
Building a recommendation system with an item-based collaborative filtering technique
Building a recommendation system with a user-based collaborative filtering technique
Building a recommendation system based on an association-rule mining technique
The Apriori algorithm
Content-based recommendation engine
Differentiating between ITCF and content-based recommendations
Building a hybrid recommendation system for Jokes recommendations
Summary
References
Sentiment Analysis of Amazon Reviews with NLP
The sentiment analysis problem
Getting started
Understanding the Amazon reviews dataset
Building a text sentiment classifier with the BoW approach
Pros and cons of the BoW approach
Understanding word embedding
Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus
Building a text sentiment classifier with GloVe word embedding
Building a text sentiment classifier with fastText
Summary
Customer Segmentation Using Wholesale Data
Understanding customer segmentation
Understanding the wholesale customer dataset and the segmentation problem
Categories of clustering algorithms
Identifying the customer segments in wholesale customer data using k-means clustering
Working mechanics of the k-means algorithm
Identifying the customer segments in the wholesale customer data using DIANA
Identifying the customer segments in the wholesale customers data using AGNES
Summary
Image Recognition Using Deep Neural Networks
Technical requirements
Understanding computer vision
Achieving computer vision with deep learning
Convolutional Neural Networks
Layers of CNNs
Introduction to the MXNet framework
Understanding the MNIST dataset
Implementing a deep learning network for handwritten digit recognition
Implementing dropout to avoid overfitting
Implementing the LeNet architecture with the MXNet library
Implementing computer vision with pretrained models
Summary
Credit Card Fraud Detection Using Autoencoders
Machine learning in credit card fraud detection
Autoencoders explained
Types of AEs based on hidden layers
Types of AEs based on restrictions
Applications of AEs
The credit card fraud dataset
Building AEs with the H2O library in R
Autoencoder code implementation for credit card fraud detection
Summary
Automatic Prose Generation with Recurrent Neural Networks
Understanding language models
Exploring recurrent neural networks
Comparison of feedforward neural networks and RNNs
Backpropagation through time
Problems and solutions to gradients in RNN
Exploding gradients
Vanishing gradients
Building an automated prose generator with an RNN
Implementing the project
Summary
Winning the Casino Slot Machines with Reinforcement Learning
Understanding RL
Comparison of RL with other ML algorithms
Terminology of RL
The multi-arm bandit problem
Strategies for solving MABP
The epsilon-greedy algorithm
Boltzmann or softmax exploration
Decayed epsilon greedy
The upper confidence bound algorithm
Thompson sampling
Multi-arm bandit – real-world use cases
Solving the MABP with UCB and Thompson sampling algorithms
Summary
Creating a Package
Creating a new package
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
R is one of the most popular languages when it comes to exploring the mathematical side of machine learning and easily performing computational statistics.
This Learning Path shows you how to leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. You'll tackle realistic projects such as building powerful machine learning models with ensembles to predict employee attrition. You'll explore different clustering techniques to segment customers using wholesale data and use TensorFlow and Keras-R for performing advanced computations. Each chapter will help you implement advanced machine learning algorithms using real-world examples. You’ll also be introduced to reinforcement learning along with its various use cases and models. Additionally, this book provides you with a glimpse into how some of these black-box models can be diagnosed and understood.
By the end of this Learning Path, you’ll be equipped with the skills you need to deploy machine learning techniques in your own projects.
If you’re a data analyst, data scientist, or machine learning developer who wants to master machine learning techniques using R, this is an ideal Learning Path for you. Each project will help you test your skills in implementing machine learning algorithms and techniques. A basic understanding of machine learning and working knowledge of R programming is necessary to get the most out of this Learning Path.
Chapter 1, Preparing and Understanding Data, covers the loading of data and demonstrates how to obtain an understanding of its structure and dimensions, as well as how to install the necessary packages.
Chapter 2, Linear Regression, provides you with a solid foundation before learning advanced methods such as Support Vector Machines and Gradient Boosting. No more solid foundation exists than the least squares linear regression.
Chapter 3, Logistic Regression, presents a discussion on how logistic regression and discriminant analysis is used in order to predict a categorical outcome. Multivariate adaptive regression splines have been added. This technique performs well, handles non-linearity, and is easy to explain.
Chapter 4, Advanced Feature Selection in Linear Models, shows regularization techniques to help improve the predictive ability and interpretability as feature selection is a critical and often extremely challenging component of machine learning. It also includes techniques not only for regression but also for a classification problem.
Chapter 5, K-Nearest Neighbors and Support Vector Machines, begins the exploration of the more advanced and nonlinear techniques. The real power of machine learning will be unveiled.
Chapter 6, Tree-Based Classification, offers some of the most powerful predictive abilities of all the machine learning techniques, especially for classification problems. Single decision trees will be discussed along with the more advanced random forests and boosted trees. It also contains very popular techniques provided by the XGBOOST package.
Chapter 7, Neural Networks and Deep Learning, shows some of the most exciting machine learning methods currently used. Inspired by how the brain works, neural networks and their more recent and advanced offshoot, Deep Learning, will be put to the test. It also includes code for the H2O package, including hyperparameter search.
Chapter 8, Creating Ensembles and Multiclass Methods, has completely new content, involving the utilization of several great packages.
Chapter 9, Cluster Analysis, covers unsupervised learning. Instead of trying to make a prediction, the goal will focus on uncovering the latent structure of observations. Three clustering methods will be discussed: hierarchical, k-means, and partitioning around medoids. It also includes the methodology for executing unsupervised learning with random forests.
Chapter 10, Principal Component Analysis, continues the examination of unsupervised learning with principal components analysis, which is used to uncover the latent structure of the features. Once this is done, the new features will be used in a supervised learning exercise.
Chapter 11, Association Analysis, explains association analysis and applies not only to making recommendations, product placement, and promotional pricing, but can also be used in manufacturing, web usage, and healthcare.
Chapter 12, Time Series and Causality, discusses univariate forecast models, bivariate regression, and Granger causality models, including an analysis of carbon emissions and climate change, along with a demonstration of different causality test methods.
Chapter 13, Text Mining, demonstrates a framework for quantitative text mining and the building of topic models. Along with time series, the world of data contains vast volumes of data in a textual format. With so much data as text, it is critically important to understand how to manipulate, code, and analyze the data in order to provide meaningful insights.
Chapter 14, Exploring the Machine Learning Landscape, will briefly review the various ML concepts that a practitioner must know. In this chapter, we will cover topics such as supervised learning, reinforcement learning, unsupervised learning, and real-world ML uses cases.
Chapter 15, Predicting Employee Attrition Using Ensemble Models, covers the creation of powerful ML models through ensemble learning. We will introduce the problem at hand and then attempt to explore the dataset with exploratory data analysis (EDA). Then in the preprocessing phase, we will create new features using prior domain experience. Once the dataset is fully prepared, models will be created using multiple ensemble techniques, such as bagging, boosting, stacking, and randomization. Lastly, we will deploy the finally selected model for production.
Chapter 16, Implementing a Joke Recommendation Engine, introduces recommendation engines. We start by understanding the concepts and types of collaborative filtering algorithms. We will then build a recommendation engine to provide personalized joke recommendations using collaborative filtering approaches such as user-based collaborative filters and item-based collaborative filters. Apart from this, we will be exploring various libraries available in R that can be used to build recommendation systems.
Chapter 17, Sentiment Analysis of Amazon Reviews with NLP, covers sentiment analysis, which entails finding the sentiment of a sentence and labeling it as positive, negative, or neutral and covers the various techniques that can be used to analyze text. We will understand text-mining concepts and the various ways that text is labeled based on the tone. Apart from using various popular R text-mining libraries to preprocess the reviews to be classified, we will also be leveraging a wide range of text representations, such as a bag of words, word2vec, fastText, and Glove.
Chapter 18, Customer Segmentation Using Wholesale Data, covers the segmentation, grouping, or clustering of customers, which can be achieved through unsupervised learning. In this chapter, we learn the various techniques of customer segmentation.We will be applying advanced clustering techniques, such as k-means, DIANA, and AGNES. We will explore the ML techniques for dealing with such ambiguity and have ML find out the number of groups possible based on the underlying characteristics of the input data. Evaluating the output of the clustering algorithms is an area that is often challenging to practitioners.
Chapter 19, Image Recognition Using Deep Neural Networks, covers convolutional neural networks (CNNs). We explore why CNNs work so well with computer vision problems such as object detection. We will learn about all of these concepts by applying a CNN in the building of a multi-class classification model on a popular open dataset called MNIST. We will learn about the various preprocessing techniques that can be applied to the image data in order to use the data with deep learning models.
Chapter 20, Credit Card Fraud Detection Using Autoencoders, covers autoencoders and how they are different from the other deep learning networks, such as recurrent neural networks (RNNs)and CNNs. We will learn about autoencoders by implementing a project that identifies credit card fraud. We will become familiar with dimensionality reduction and how it can be used to identify credit card fraud detection.
Chapter 21, Automatic Prose Generation with Recurrent Neural Networks, introduces some deep neural networks (DNNs). We will implement a neural network from scratch and will learn how to apply an RNN by doing a project. We will create an application based on long short-term memory (LSTM) network, a variant of RNNs that generates text automatically. To accomplish this task, we make use of the MXNet framework, which extends its support for the R language to perform deep learning.
Chapter 22, Winning the Casino Slot Machines with Reinforcement Learning, begins with an explanation of RL. We discuss the various concepts of RL, including strategies for solving what is called as the multi-arm bandit problem. We implement a project that uses UCB and Thompson sampling techniques in order to solve the multi-arm bandit problem.
Appendix, Creating a Package, includes additional data packages.
Assuming the reader has a working knowledge of R and of basic statistics, this book will provide the skills and tools required to get the reader up and running with R and ML as quickly and painlessly as possible. There will probably always be detractors who complain that it does not offer enough math or does not do this, or that, or the other thing, but my answer to that is that these books already exist! Why duplicate what has already been done, and very well, for that matter? Again, I have sought to provide something different, something to hold the reader's attention and allow them to succeed in this competitive and rapidly changing field.
The projects covered in this book are intended to expose you to practical knowledge on the implementation of various ML techniques to real-world problems. It is expected that you have a good working knowledge of R and some basic understanding of ML. Basic knowledge of ML and R is a must prior to starting this project.
It should also be noted that the code for the projects is implemented using R version 3.5.2 (2018-12-20), nicknamed Eggshell Igloo. The project code has been successfully tested on Linux Mint 18.3 Sylvia. There is no reason to believe that the code does not work on other platforms, such as Windows; however, this is not something that has been tested by the author.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
A block of code is set as follows:
html, body, #map { height: 100%; margin: 0; padding: 0}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)
exten => s,102,Voicemail(b100)
exten => i,1,Voicemail(s0)
Any command-line input or output is written as follows:
$ mkdir css
$ cd css
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.
The following are the topics that we'll cover in this chapter:
Overview
Reading the data
Handling duplicate observations
Descriptive statistics
Exploring categorical variables
Handling missing values
Zero and near-zero variance features
Treating the data
Correlation and linearity
If you haven't been exposed to large, messy datasets, then be patient, for it's only a matter of time. If you've encountered such data, has it been in a domain where you have little subject matter expertise? If not, then once again I proffer that it's only a matter of time. Some of the common problems that make up this term messy data include the following:
Missing or invalid values
Novel levels in a categorical feature that show up in algorithm production
High cardinality in categorical features such as zip codes
High dimensionality
Duplicate observations
So this begs the question what are we to do? Well, first we need to look at what are the critical tasks that need to be performed during this phase of the process. The following tasks serve as the foundation for building a learning algorithm. They're from the paper by SPSS, CRISP-DM 1.0, a step-by-step data-mining guide available at https://the-modeling-agency.com/crisp-dm.pdf:
Data understanding:
Collect
Describe
Explore
Verify
Data preparation:
Select
Clean
Construct
Integrate
Format
Certainly this is an excellent enumeration of the process, but what do we really need to do? I propose that, in practical terms we can all relate to, the following must be done once the data is joined and loaded into your machine, cloud, or whatever you use:
Understand the data structure
Dedupe observations
Eliminate zero variance features and low variance features as desired
Handle missing values
Create dummy features (one-hot encoding)
Examine and deal with highly correlated features and those with perfect linear relationships
Scale as necessary
Create other features as desired
Many feel that this is a daunting task. I don't and, in fact, I quite enjoy it. If done correctly and with a judicious application of judgment, it should reduce the amount of time spent at this first stage of a project and facilitate training your learning algorithm. None of the previous steps are challenging, but it can take quite a bit of time to write the code to perform each task.
Well, that's the benefit of this chapter. The example to follow will walk you through the tasks and the R code that accomplishes it. The code is flexible enough that you should be able to apply it to your projects. Additionally, it will help you gain an understanding of the data at a point you can intelligently discuss it with Subject Matter Experts (SMEs) if, in fact, they're available.
In the practical exercise that follows, we'll work with a small dataset. However, it suffers from all of the problems described earlier. Don't let the small size fool you, as we'll take what we learn here and use it for the more massive datasets to come in subsequent chapters.
As background, the data we'll use I put together painstakingly by hand. It's the Order of Battle for the opposing armies at the Battle of Gettysburg, fought during the American Civil War, July 1st-3rd, 1863, and the casualties reported by the end of the day on July 3rd. I purposely chose this data because I'm reasonably sure you know very little about it. Don't worry, I'm the SME on the battle here and will walk you through it every step of the way. The one thing that we won't cover in this chapter is dealing with large volumes of textual features, which we'll discuss later in this book. Enough said already; let's get started!
This first task will load the data and show how to get a how level understanding of its structure and dimensions as well as install the necessary packages.
You have two ways to access the data, which resides on GitHub. You can download gettysburg.csv directly from the site at this link: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/gettysburg.csv, or you can use the RCurl package. An example of how to use the package is available here: https://github.com/opetchey/RREEBES/wiki/Reading-data-and-code-from-an-online-github-repository.
Let's assume you have the file in your working directory, so let's begin by installing the necessary packages:
install.packages("caret")install.packages("janitor")install.packages("readr")install.packages("sjmisc")install.packages("skimr")install.packages("tidyverse")install.packages("vtreat")
Let me make a quick note about how I've learned (the hard way) about how to correctly write code. With the packages installed, we could now specifically call the libraries into the R environment. However, it's a best practice and necessary when putting code into production that a function that isn't in base R be specified. First, this helps you and unfortunate others to read your code with an understanding of which library is mapped to a specific function. It also eliminates potential errors because different packages call different functions the same thing. The example that comes to my mind is the tsoutliers() function. The function is available in the forecast package and was in the tsoutliers package during earlier versions. Now I know this extra typing might seem unwieldy and unnecessary, but once you discipline yourself to do it, you'll find that it's well worth the effort.
There's one library we'll call and that's magrittr, which allows the use of a pipe-operator, %>%, to chain code together:
library(magrittr)
We're now ready to load the .csv file. In doing so, let's utilize the read_csv() function from readr as it's faster than base R and creates a tibble dataframe. In most cases, using tibbles in a tidyverse style is easier to write and understand. If you want to learn all the benefits of tidyverse, check out their website: tidyverse.org.
The only thing we need to specify in the function is our filename:
gettysburg <- read_csv("~/gettysburg.csv")
Here's a look at the column (feature) names:
colnames(gettysburg)
[1] "type" "state" "regiment_or_battery" "brigade"
[5] "division" "corps" "army" "july1_Commander"
[9] "Cdr_casualty" "men" "killed" "wounded"
[13] "captured" "missing" "total_casualties" "3inch_rifles"
[17] "4.5inch_rifles" "10lb_parrots" "12lb_howitzers" "12lb_napoleons"
[21] "6lb_howitzers" "24lb_howitzers" "20lb_parrots" "12lb_whitworths"
[25] "14lb_rifles" "total_guns"
We have 26 features in this data, and some of you're asking yourself things like, what the heck is a 20 pound parrot? If you put it in a search engine, you'll probably end up with the bird and not the 20 pound Parrot rifled artillery gun. You can see the dimensions of the data in RStudio in your Global Environment view, or you can dig on your own to see there're 590 observations:
dim(gettysburg)
[1] 590 26
So we have 590 observations of 26 features, but this data suffers from the issues that permeate large and complex data. Next, we'll explore if there're any duplicate observations and how to deal with them efficiently.
Dealing with missing values can be a little tricky as there's a number of ways to approach the task. We've already seen in the section on descriptive statistics that there're missing values. First of all, let's get a full accounting of the missing quantity by feature, then we shall discuss how to deal with them. What I'm going to demonstrate in the following is how to put the count by feature into a dataframe that we can explore within RStudio:
na_count <- sapply(gettysburg, function(y) sum(length(which(is.na( y )))))na_df <- data.frame(na_count)View(na_df)
The following is a screenshot produced by the preceding code, after sorting the dataframe by descending count:
You can clearly see the count of missing by feature with the most missing is ironically named missing with a total of 17 observations.
So what should we do here or, more appropriately, what can we do here? There're several choices:
Do nothing
: However, some R functions will omit NAs and some functions will fail and produce an error.
Omit all observations with NAs
: In massive datasets, they may make sense, but we run the risk of losing information.
Impute values
: They could be something as simple as substituting the median value for the missing one or creating an algorithm to impute the values.
Dummy coding
: Turn the missing into a value such as 0 or -999, and code a dummy feature where if the feature for a specific observation is missing, the dummy is coded 1, otherwise, it's coded 0.
I could devote an entire chapter, indeed a whole book on the subject, delving into missing at random and others, but I was trained—and, in fact, shall insist—on the latter method. It's never failed me and the others can be a bit problematic. The benefit of dummy coding—or indicator coding, if you prefer—is that you don't lose information. In fact, missing-ness might be an essential feature in and of itself.
So, here's an example of how I manually code a dummy feature and turn the NAs into zeroes:
gettysburg$missing_isNA <- ifelse(is.na(gettysburg$missing), 1, 0)gettysburg$missing[is.na(gettysburg$missing)] <- 0
The first iteration of code creates a dummy feature for the missing feature and the second changes any NAs in missing to zero. In the upcoming section, where the dataset is fully processed (treated), the other missing values will be imputed.
This chapter looked at the common problems in large, messy datasets common in machine learning projects. These include, but are not limited to the following:
Missing or invalid values
Novel levels in a categorical feature that show up in algorithm production
High cardinality in categorical features such as zip code
High dimensionality
Duplicate observations
This chapter provided a disciplined approach to dealing with these problems by showing how to explore the data, treat it, and create a dataframe that you can use for developing your learning algorithm. It's also flexible enough that you can modify the code to suit your circumstances. This methodology should make what many feels is the most arduous, time-consuming, and least enjoyable part of the job an easy task.
With this task behind us, we can now get started on our first modeling task using linear regression in the following chapter.
It's essential that we get started with a simple yet extremely effective technique that's been used for a long time: linear regression. Albert Einstein is believed to have remarked at one time or another that things should be made as simple as possible, but no simpler. This is sage advice and a good rule of thumb in the development of algorithms for machine learning. Considering the other techniques that we'll discuss later, there's no simpler model than tried and tested linear regression, which uses the least squares approach to predict a quantitative outcome. We can consider it to be the foundation of all the methods that we'll discuss later, many of which are mere extensions. If you can master the linear regression method, well then quite frankly I believe you can master the rest of this book. Therefore, let's consider this as a good starting point for our journey towards becoming a machine learning guru.
This chapter covers introductory material and an expert in this subject can skip ahead to the next topic. Otherwise, ensure that you thoroughly understand this topic before venturing to other, more complex learning methods. I believe you'll discover that many of your projects can be addressed by just applying what's discussed in the following sections. Linear regression is probably the most straightforward model to explain to your customers, most of whom will have at least a cursory understanding of R-squared. Many of them will have been exposed to it at great depth and hence will be comfortable with variable contribution, collinearity, and the like.
The following are the topics that we'll be covering in this chapter:
Univariate linear regression
Multivariate linear regression
We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as follows:
We can state it as the expected value of Y is a function of the parameters (the intercept) plus (the slope) times x, plus an error term e. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences:
This, with simple substitution, yields the following:
Before we begin with an application, I want to point out that if you read the headlines of various research breakthroughs, you should do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, R—and any other software, for that matter—will give us a solution regardless of the input. However, just because the math makes sense and a high correlation or R-squared statistic is reported doesn't mean that the conclusion is valid.
To drive this point home, let's have a look at the famous Anscombe dataset, which is available in R. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties but when plotted show something very different. I've used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is an excellent place to start should you have a similar need. It's a brief digression before moving on to serious modeling:
> #call up and explore the data> data(anscombe)> attach(anscombe)> anscombe