R: Unleash Machine Learning Techniques - Raghav Bali - E-Book

R: Unleash Machine Learning Techniques E-Book

Raghav Bali

0,0
91,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Find out how to build smarter machine learning systems with R. Follow this three module course to become a more fluent machine learning practitioner.

About This Book

  • Build your confidence with R and find out how to solve a huge range of data-related problems
  • Get to grips with some of the most important machine learning techniques being used by data scientists and analysts across industries today
  • Don't just learn – apply your knowledge by following featured practical projects covering everything from financial modeling to social media analysis

Who This Book Is For

Aimed for intermediate-to-advanced people (especially data scientist) who are already into the field of data science

What You Will Learn

  • Get to grips with R techniques to clean and prepare your data for analysis, and visualize your results
  • Implement R machine learning algorithms from scratch and be amazed to see the algorithms in action
  • Solve interesting real-world problems using machine learning and R as the journey unfolds
  • Write reusable code and build complete machine learning systems from the ground up
  • Learn specialized machine learning techniques for text mining, social network data, big data, and more
  • Discover the different types of machine learning models and learn which is best to meet your data needs and solve your analysis problems
  • Evaluate and improve the performance of machine learning models
  • Learn specialized machine learning techniques for text mining, social network data, big data, and more

In Detail

R is the established language of data analysts and statisticians around the world. And you shouldn't be afraid to use it...

This Learning Path will take you through the fundamentals of R and demonstrate how to use the language to solve a diverse range of challenges through machine learning. Accessible yet comprehensive, it provides you with everything you need to become more a more fluent data professional, and more confident with R.

In the first module you'll get to grips with the fundamentals of R. This means you'll be taking a look at some of the details of how the language works, before seeing how to put your knowledge into practice to build some simple machine learning projects that could prove useful for a range of real world problems.

For the following two modules we'll begin to investigate machine learning algorithms in more detail. To build upon the basics, you'll get to work on three different projects that will test your skills. Covering some of the most important algorithms and featuring some of the most popular R packages, they're all focused on solving real problems in different areas, ranging from finance to social media.

This Learning Path has been curated from three Packt products:

  • R Machine Learning By Example By Raghav Bali, Dipanjan Sarkar
  • Machine Learning with R Learning - Second Edition By Brett Lantz
  • Mastering Machine Learning with R By Cory Lesmeister

Style and approach

This is an enticing learning path that starts from the very basics to gradually pick up pace as the story unfolds. Each concept is first defined in the larger context of things succinctly, followed by a detailed explanation of their application. Each topic is explained with the help of a project that solves a real-world problem involving hands-on work thus giving you a deep insight into the world of machine learning.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1440

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

R: Unleash Machine Learning Techniques
R: Unleash Machine Learning Techniques
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
I. Module 1
1. Getting Started with R and Machine Learning
Delving into the basics of R
Using R as a scientific calculator
Operating on vectors
Special values
Data structures in R
Vectors
Creating vectors
Indexing and naming vectors
Arrays and matrices
Creating arrays and matrices
Names and dimensions
Matrix operations
Lists
Creating and indexing lists
Combining and converting lists
Data frames
Creating data frames
Operating on data frames
Working with functions
Built-in functions
User-defined functions
Passing functions as arguments
Controlling code flow
Working with if, if-else, and ifelse
Working with switch
Loops
Advanced constructs
lapply and sapply
apply
tapply
mapply
Next steps with R
Getting help
Handling packages
Machine learning basics
Machine learning – what does it really mean?
Machine learning – how is it used in the world?
Types of machine learning algorithms
Supervised machine learning algorithms
Unsupervised machine learning algorithms
Popular machine learning packages in R
Summary
2. Let's Help Machines Learn
Understanding machine learning
Algorithms in machine learning
Perceptron
Families of algorithms
Supervised learning algorithms
Linear regression
K-Nearest Neighbors (KNN)
Collecting and exploring data
Normalizing data
Creating training and test data sets
Learning from data/training the model
Evaluating the model
Unsupervised learning algorithms
Apriori algorithm
K-Means
Summary
3. Predicting Customer Shopping Trends with Market Basket Analysis
Detecting and predicting trends
Market basket analysis
What does market basket analysis actually mean?
Core concepts and definitions
Techniques used for analysis
Making data driven decisions
Evaluating a product contingency matrix
Getting the data
Analyzing and visualizing the data
Global recommendations
Advanced contingency matrices
Frequent itemset generation
Getting started
Data retrieval and transformation
Building an itemset association matrix
Creating a frequent itemsets generation workflow
Detecting shopping trends
Association rule mining
Loading dependencies and data
Exploratory analysis
Detecting and predicting shopping trends
Visualizing association rules
Summary
4. Building a Product Recommendation System
Understanding recommendation systems
Issues with recommendation systems
Collaborative filters
Core concepts and definitions
The collaborative filtering algorithm
Predictions
Recommendations
Similarity
Building a recommender engine
Matrix factorization
Implementation
Result interpretation
Production ready recommender engines
Extract, transform, and analyze
Model preparation and prediction
Model evaluation
Summary
5. Credit Risk Detection and Prediction – Descriptive Analytics
Types of analytics
Our next challenge
What is credit risk?
Getting the data
Data preprocessing
Dealing with missing values
Datatype conversions
Data analysis and transformation
Building analysis utilities
Analyzing the dataset
Saving the transformed dataset
Next steps
Feature sets
Machine learning algorithms
Summary
6. Credit Risk Detection and Prediction – Predictive Analytics
Predictive analytics
How to predict credit risk
Important concepts in predictive modeling
Preparing the data
Building predictive models
Evaluating predictive models
Getting the data
Data preprocessing
Feature selection
Modeling using logistic regression
Modeling using support vector machines
Modeling using decision trees
Modeling using random forests
Modeling using neural networks
Model comparison and selection
Summary
7. Social Media Analysis – Analyzing Twitter Data
Social networks (Twitter)
Data mining @social networks
Mining social network data
Data and visualization
Word clouds
Treemaps
Pixel-oriented maps
Other visualizations
Getting started with Twitter APIs
Overview
Registering the application
Connect/authenticate
Extracting sample tweets
Twitter data mining
Frequent words and associations
Popular devices
Hierarchical clustering
Topic modeling
Challenges with social network data mining
References
Summary
8. Sentiment Analysis of Twitter Data
Understanding Sentiment Analysis
Key concepts of sentiment analysis
Subjectivity
Sentiment polarity
Opinion summarization
Feature extraction
Approaches
Applications
Challenges
Sentiment analysis upon Tweets
Polarity analysis
Classification-based algorithms
Labeled dataset
Support Vector Machines
Ensemble methods
Boosting
Cross-validation
Summary
II. Module 2
1. Introducing Machine Learning
The origins of machine learning
Uses and abuses of machine learning
Machine learning successes
The limits of machine learning
Machine learning ethics
How machines learn
Data storage
Abstraction
Generalization
Evaluation
Machine learning in practice
Types of input data
Types of machine learning algorithms
Matching input data to algorithms
Machine learning with R
Installing R packages
Loading and unloading R packages
Summary
2. Managing and Understanding Data
R data structures
Vectors
Factors
Lists
Data frames
Matrixes and arrays
Managing data with R
Saving, loading, and removing R data structures
Importing and saving data from CSV files
Exploring and understanding data
Exploring the structure of data
Exploring numeric variables
Measuring the central tendency – mean and median
Measuring spread – quartiles and the five-number summary
Visualizing numeric variables – boxplots
Visualizing numeric variables – histograms
Understanding numeric data – uniform and normal distributions
Measuring spread – variance and standard deviation
Exploring categorical variables
Measuring the central tendency – the mode
Exploring relationships between variables
Visualizing relationships – scatterplots
Examining relationships – two-way cross-tabulations
Summary
3. Lazy Learning – Classification Using Nearest Neighbors
Understanding nearest neighbor classification
The k-NN algorithm
Measuring similarity with distance
Choosing an appropriate k
Preparing data for use with k-NN
Why is the k-NN algorithm lazy?
Example – diagnosing breast cancer with the k-NN algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Transformation – normalizing numeric data
Data preparation – creating training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Transformation – z-score standardization
Testing alternative values of k
Summary
4. Probabilistic Learning – Classification Using Naive Bayes
Understanding Naive Bayes
Basic concepts of Bayesian methods
Understanding probability
Understanding joint probability
Computing conditional probability with Bayes' theorem
The Naive Bayes algorithm
Classification with Naive Bayes
The Laplace estimator
Using numeric features with Naive Bayes
Example – filtering mobile phone spam with the Naive Bayes algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – cleaning and standardizing text data
Data preparation – splitting text documents into words
Data preparation – creating training and test datasets
Visualizing text data – word clouds
Data preparation – creating indicator features for frequent words
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
5. Divide and Conquer – Classification Using Decision Trees and Rules
Understanding decision trees
Divide and conquer
The C5.0 decision tree algorithm
Choosing the best split
Pruning the decision tree
Example – identifying risky bank loans using C5.0 decision trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating random training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Boosting the accuracy of decision trees
Making mistakes more costlier than others
Understanding classification rules
Separate and conquer
The 1R algorithm
The RIPPER algorithm
Rules from decision trees
What makes trees and rules greedy?
Example – identifying poisonous mushrooms with rule learners
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
6. Forecasting Numeric Data – Regression Methods
Understanding regression
Simple linear regression
Ordinary least squares estimation
Correlations
Multiple linear regression
Example – predicting medical expenses using linear regression
Step 1 – collecting data
Step 2 – exploring and preparing the data
Exploring relationships among features – the correlation matrix
Visualizing relationships among features – the scatterplot matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Model specification – adding non-linear relationships
Transformation – converting a numeric variable to a binary indicator
Model specification – adding interaction effects
Putting it all together – an improved regression model
Understanding regression trees and model trees
Adding regression to trees
Example – estimating the quality of wines with regression trees and model trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Visualizing decision trees
Step 4 – evaluating model performance
Measuring performance with the mean absolute error
Step 5 – improving model performance
Summary
7. Black Box Methods – Neural Networks and Support Vector Machines
Understanding neural networks
From biological to artificial neurons
Activation functions
Network topology
The number of layers
The direction of information travel
The number of nodes in each layer
Training neural networks with backpropagation
Example – Modeling the strength of concrete with ANNs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Understanding Support Vector Machines
Classification with hyperplanes
The case of linearly separable data
The case of nonlinearly separable data
Using kernels for non-linear spaces
Example – performing OCR with SVMs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
8. Finding Patterns – Market Basket Analysis Using Association Rules
Understanding association rules
The Apriori algorithm for association rule learning
Measuring rule interest – support and confidence
Building a set of rules with the Apriori principle
Example – identifying frequently purchased groceries with association rules
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating a sparse matrix for transaction data
Visualizing item support – item frequency plots
Visualizing the transaction data – plotting the sparse matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Sorting the set of association rules
Taking subsets of association rules
Saving association rules to a file or data frame
Summary
9. Finding Groups of Data – Clustering with k-means
Understanding clustering
Clustering as a machine learning task
The k-means clustering algorithm
Using distance to assign and update clusters
Choosing the appropriate number of clusters
Example – finding teen market segments using k-means clustering
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – dummy coding missing values
Data preparation – imputing the missing values
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
10. Evaluating Model Performance
Measuring performance for classification
Working with classification prediction data in R
A closer look at confusion matrices
Using confusion matrices to measure performance
Beyond accuracy – other measures of performance
The kappa statistic
Sensitivity and specificity
Precision and recall
The F-measure
Visualizing performance trade-offs
ROC curves
Estimating future performance
The holdout method
Cross-validation
Bootstrap sampling
Summary
11. Improving Model Performance
Tuning stock models for better performance
Using caret for automated parameter tuning
Creating a simple tuned model
Customizing the tuning process
Improving model performance with meta-learning
Understanding ensembles
Bagging
Boosting
Random forests
Training random forests
Evaluating random forest performance
Summary
12. Specialized Machine Learning Topics
Working with proprietary files and databases
Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
Querying data in SQL databases
Working with online data and services
Downloading the complete text of web pages
Scraping data from web pages
Parsing XML documents
Parsing JSON from web APIs
Working with domain-specific data
Analyzing bioinformatics data
Analyzing and visualizing network data
Improving the performance of R
Managing very large datasets
Generalizing tabular data structures with dplyr
Making data frames faster with data.table
Creating disk-based data frames with ff
Using massive matrices with bigmemory
Learning faster with parallel computing
Measuring execution time
Working in parallel with multicore and snow
Taking advantage of parallel with foreach and doParallel
Parallel cloud computing with MapReduce and Hadoop
GPU computing
Deploying optimized learning algorithms
Building bigger regression models with biglm
Growing bigger and faster random forests with bigrf
Training and evaluating models in parallel with caret
Summary
III. Module 3
1. A Process for Success
The process
Business understanding
Identify the business objective
Assess the situation
Determine the analytical goals
Produce a project plan
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Algorithm flowchart
Summary
2. Linear Regression – The Blocking and Tackling of Machine Learning
Univariate linear regression
Business understanding
Multivariate linear regression
Business understanding
Data understanding and preparation
Modeling and evaluation
Other linear model considerations
Qualitative feature
Interaction term
Summary
3. Logistic Regression and Discriminant Analysis
Classification methods and linear regression
Logistic regression
Business understanding
Data understanding and preparation
Modeling and evaluation
The logistic regression model
Logistic regression with cross-validation
Discriminant analysis overview
Discriminant analysis application
Model selection
Summary
4. Advanced Feature Selection in Linear Models
Regularization in a nutshell
Ridge regression
LASSO
Elastic net
Business case
Business understanding
Data understanding and preparation
Modeling and evaluation
Best subsets
Ridge regression
LASSO
Elastic net
Cross-validation with glmnet
Model selection
Summary
5. More Classification Techniques – K-Nearest Neighbors and Support Vector Machines
K-Nearest Neighbors
Support Vector Machines
Business case
Business understanding
Data understanding and preparation
Modeling and evaluation
KNN modeling
SVM modeling
Model selection
Feature selection for SVMs
Summary
6. Classification and Regression Trees
Introduction
An overview of the techniques
Regression trees
Classification trees
Random forest
Gradient boosting
Business case
Modeling and evaluation
Regression tree
Classification tree
Random forest regression
Random forest classification
Gradient boosting regression
Gradient boosting classification
Model selection
Summary
7. Neural Networks
Neural network
Deep learning, a not-so-deep overview
Business understanding
Data understanding and preparation
Modeling and evaluation
An example of deep learning
H2O background
Data preparation and uploading it to H2O
Create train and test datasets
Modeling
Summary
8. Cluster Analysis
Hierarchical clustering
Distance calculations
K-means clustering
Gower and partitioning around medoids
Gower
PAM
Business understanding
Data understanding and preparation
Modeling and evaluation
Hierarchical clustering
K-means clustering
Clustering with mixed data
Summary
9. Principal Components Analysis
An overview of the principal components
Rotation
Business understanding
Data understanding and preparation
Modeling and evaluation
Component extraction
Orthogonal rotation and interpretation
Creating factor scores from the components
Regression analysis
Summary
10. Market Basket Analysis and Recommendation Engines
An overview of a market basket analysis
Business understanding
Data understanding and preparation
Modeling and evaluation
An overview of a recommendation engine
User-based collaborative filtering
Item-based collaborative filtering
Singular value decomposition and principal components analysis
Business understanding and recommendations
Data understanding, preparation, and recommendations
Modeling, evaluation, and recommendations
Summary
11. Time Series and Causality
Univariate time series analysis
Bivariate regression
Granger causality
Business understanding
Data understanding and preparation
Modeling and evaluation
Univariate time series forecasting
Time series regression
Examining the causality
Summary
12. Text Mining
Text mining framework and methods
Topic models
Other quantitative analyses
Business understanding
Data understanding and preparation
Modeling and evaluation
Word frequency and topic models
Additional quantitative analysis
Summary
A. R Fundamentals
Introduction
Getting R up and running
Using R
Data frames and matrices
Summary stats
Installing and loading the R packages
Summary
A. Bibliography
Index

R: Unleash Machine Learning Techniques

R: Unleash Machine Learning Techniques

Find out how to build smarter machine learning systems with R. Follow this three module course to become a more fluent machine learning practitioner

A course in three modules

BIRMINGHAM - MUMBAI

R: Unleash Machine Learning Techniques

Copyright © 2016 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: September 2016

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78712-734-0

www.packtpub.com

Credits

Authors

Raghav Bali

Dipanjan Sarkar

Brett Lantz

Cory Lesmeister

Reviewers

Alexey Grigorev

Vijayakumar Nattamai Jawaharlal

Kent S. Johnson

Mzabalazo Z. Ngwenya

Anuj Saxena

Vikram Dhillon

Miro Kopecky

Pavan Narayanan

Doug Ortiz

Shivani Rao, PhD

Content Development Editor

Parshva Sheth

Graphics

Abhinash Sahu

Production Coordinator

Melwyn Dsa

Preface

 

"He who defends everything, defends nothing."

  --Frederick the Great

Machine learning is a very broad topic. The following quote sums it up nicely: The first problem facing you is the bewildering variety of learning algorithms available. Which one to use? There are literally thousands available, and hundreds more are published each year. (Domingo, P., 2012.) It would therefore be irresponsible to try and cover everything in the chapters that follow because, to paraphrase Frederick the Great, we would achieve nothing. With this constraint in mind, we hope to provide a solid foundation of algorithms and business considerations that will allow the reader to walk away and, first of all, take on any machine learning tasks with complete confidence, and secondly, be able to help themselves in figuring out other algorithms and topics. Essentially, if this course significantly helps you to help yourself, then I would consider this a victory. Don't think of this course as a destination but rather, as a path to self-discovery.

What this learning path covers

Module 1, R Machine Learning By Example, Data science and machine learning are some of the top buzzwords in the technical world today. From retail stores to Fortune 500 companies, everyone is working hard to make machine learning give them data-driven insights to grow their businesses. With powerful data manipulation features, machine learning packages, and an active developer community, R empowers users to build sophisticated machine learning systems to solve real-world data problems. This module takes you on a data-driven journey that starts with the very basics of R and machine learning and gradually builds upon the concepts to work on projects that tackle real-world problems.

Module 2, Machine Learning with R, Machine learning, at its core, is concerned with the algorithms that transform information into actionable intelligence. This fact makes machine learning well-suited to the present-day era of big data. Without machine learning, it would be nearly impossible to keep up with the massive stream of information. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start using machine learning. R offers a powerful but easy-to-learn set of tools that canassist you with finding data insights. By combining hands-on case studies with the essential theory that you need to understand how things work under the hood, this book provides all the knowledge that you will need to start applying machine learning to your own projects.

Module 3 Mastering Machine Learning with R, The world of R can be as bewildering as the world of machine learning! There is seemingly an endless number of R packages with a plethora of blogs, websites, discussions, and papers of various quality and complexity from the community that supports R. This is a great reservoir of information and probably R's greatest strength, but I've always believed that an entity's greatest strength can also be its greatest weakness. R's vast community of knowledge can quickly overwhelm and/or sidetrack you and your efforts. Show me a problem and give me ten different R programmers and I'll show you ten different ways the code is written to solve the problem. As I've written each chapter, I've endeavored to capture the critical elements that can assist you in using R to understand, prepare, and model the data. I am no R programming expert by any stretch of the imagination, but again, I like to think that I can provide a solid foundation herein. Another thing that lit a fire under me to write this book was an incident that happened in the hallways of a former employer a couple of years ago. My team had an IT contractor to support the management of our databases. As we were walking and chatting about big data and the like, he mentioned that he had bought a book about machine learning with R and another about machine learning with Python. He stated that he could do all the programming, but all of the statistics made absolutely no sense to him. I have always kept this conversation at the back of my mind throughout the writing process. It has been a very challenging task to balance the technical and theoretical with the practical. One could, and probably someone has, turned the theory of each chapter to its own book. I used a heuristic of sorts to aid me in deciding whether a formula or technical aspect was in the scope, which was would this help me or the readers in the discussions with team members and business leaders? If I felt it might help, I would strive to provide the necessary details. I also made a conscious effort to keep the datasets used in the practical exercises large enough to be interesting but small enough to allow you to gain insight without becoming overwhelmed.

This book is not about big data, but make no mistake about it, the methods and concepts that we will discuss can be scaled to big data. In short, this module will appeal to a broad group of individuals, from IT experts seeking to understand and interpret machine learning algorithms to statistical gurus desiring to incorporate the power of R into their analysis. However, even those that are well-versed in both IT and statistics—experts if you will—should be able to pick up quite a few tips and tricks to assist them in their efforts.

What you need for this learning path

This software applies to all the chapters of the book:

Windows / Mac OS X / LinuxR 3.2.0 (or higher)RStudio Desktop 0.99 (or higher)

For hardware, there are no specific requirements, since R can run on any PC that has Mac, Linux, or Windows, but a physical memory of minimum 4 GB is preferred to run some of the iterative algorithms smoothly.

Who this learning path is for

Aimed for intermediate-to-advanced people (especially data scientist) who are already into the field of data science

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at

https://github.com/PacktPublishing/R-Maching-Learning-Techniques

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part I. Module 1

R Machine Learning By Example

Understand the fundamentals of machine learning with R and build your own dynamic algorithms to tackle complicated real-world problems successfully

Chapter 1. Getting Started with R and Machine Learning

This introductory chapter will get you started with the basics of R which include various constructs, useful data structures, loops and vectorization. If you are already an R wizard, you can skim through these sections and dive right into the next part which talks about what machine learning actually represents as a domain and the main areas it encompasses. We will also talk about different machine learning techniques and algorithms used in each area. Finally, we will conclude by looking at some of the most popular machine learning packages in R, some of which we will be using in the subsequent chapters.

If you are a data or machine learning enthusiast, surely you would have heard by now that being a data scientist is referred to as the sexiest job of the 21st century by Harvard Business Review.

Note

Reference: https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

There is a huge demand in the current market for data scientists, primarily because their main job is to gather crucial insights and information from both unstructured and structured data to help their business and organization grow strategically.

Some of you might be wondering how machine learning or R relate to all this! Well, to be a successful data scientist, one of the major tools you need in your toolbox is a powerful language capable of performing complex statistical calculations and working with various types of data and building models which help you get previously unknown insights and R is the perfect language for that! Machine learning forms the foundation of the skills you need to build to become a data analyst or data scientist, this includes using various techniques to build models to get insights from data.

This book will provide you with some of the essential tools you need to be well versed with both R and machine learning by not only looking at concepts but also applying those concepts in real-world examples. Enough talk; now let's get started on our journey into the world of machine learning with R!

In this chapter, we will cover the following aspects:

Delving into the basics of RUnderstanding the data structures in RWorking with functionsControlling code flowTaking further steps with RUnderstanding machine learning basicsFamiliarizing yourself with popular machine learning packages in R

Delving into the basics of R

It is assumed here that you are at least familiar with the basics of R or have worked with R before. Hence, we won't be talking much about downloading and installations. There are plenty of resources on the web which provide a lot of information on this. I recommend that you use RStudio which is an Integrated Development Environment (IDE), which is much better than the base R Graphical User Interface (GUI). You can visit https://www.rstudio.com/ to get more information about it.

Note

For details about the R project, you can visit https://www.r-project.org/ to get an overview of the language. Besides this, R has a vast arsenal of wonderful packages at its disposal and you can view everything related to R and its packages at https://cran.r-project.org/ which contains all the archives.

You must already be familiar with the R interactive interpreter, often called a Read-Evaluate-Print Loop (REPL). This interpreter acts like any command line interface which asks for input and starts with a > character, which indicates that R is waiting for your input. If your input spans multiple lines, like when you are writing a function, you will see a + prompt in each subsequent line, which means that you didn't finish typing the complete expression and R is asking you to provide the rest of the expression.

It is also possible for R to read and execute complete files containing commands and functions which are saved in files with an .R extension. Usually, any big application consists of several .R files. Each file has its own role in the application and is often called as a module. We will be exploring some of the main features and capabilities of R in the following sections.

Using R as a scientific calculator

The most basic constructs in R include variables and arithmetic operators which can be used to perform simple mathematical operations like a calculator or even complex statistical calculations.

> 5 + 6[1] 11> 3 * 2[1] 6> 1 / 0[1] Inf

Remember that everything in R is a vector. Even the output results indicated in the previous code snippet. They have a leading [1] symbol indicating it is a vector of size 1.

You can also assign values to variables and operate on them just like any other programming language.

> num <- 6> num ^ 2[1] 36> num[1] 6 # a variable changes value only on re-assignment> num <- num ^ 2 * 5 + 10 / 3> num[1] 183.3333

Operating on vectors

The most basic data structure in R is a vector. Basically, anything in R is a vector, even if it is a single number just like we saw in the earlier example! A vector is basically a sequence or a set of values. We can create vectors using the : operator or the c function which concatenates the values to create a vector.

> x <- 1:5> x[1] 1 2 3 4 5> y <- c(6, 7, 8 ,9, 10)> y[1] 6 7 8 9 10> z <- x + y> z[1] 7 9 11 13 15

You can clearly in the previous code snippet, that we just added two vectors together without using any loop, using just the + operator. This is known as vectorization and we will be discussing more about this later on. Some more operations on vectors are shown next:

> c(1,3,5,7,9) * 2[1] 2 6 10 14 18> c(1,3,5,7,9) * c(2, 4)[1] 2 12 10 28 18 # here the second vector gets recycled

Output:

> factorial(1:5)[1] 1 2 6 24 120> exp(2:10) # exponential function[1] 7.389056 20.085537 54.598150 148.413159 403.428793 1096.633158[7] 2980.957987 8103.083928 22026.465795> cos(c(0, pi/4)) # cosine function[1] 1.0000000 0.7071068> sqrt(c(1, 4, 9, 16))[1] 1 2 3 4> sum(1:10)[1] 55

You might be confused with the second operation where we tried to multiply a smaller vector with a bigger vector but we still got a result! If you look closely, R threw a warning also. What happened in this case is, since the two vectors were not equal in size, the smaller vector in this case c(2, 4) got recycled or repeated to become c(2, 4, 2, 4, 2) and then it got multiplied with the first vector c(1, 3, 5, 7 ,9) to give the final result vector, c(2, 12, 10, 28, 18). The other functions mentioned here are standard functions available in base R along with several other functions.

Tip

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the topClick on Code Downloads & ErrataEnter the name of the book in the Search boxSelect the book for which you're looking to download the code filesChoose from the drop-down menu where you purchased this book fromClick on Code Download

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Special values

Since you will be dealing with a lot of messy and dirty data in data analysis and machine learning, it is important to remember some of the special values in R so that you don't get too surprised later on if one of them pops up.

> 1 / 0[1] Inf> 0 / 0[1] NaN> Inf / NaN[1] NaN> Inf / Inf[1] NaN> log(Inf)[1] Inf> Inf + NA[1] NA

The main values which should concern you here are Inf which stands for Infinity, NaN which is Not a Number, and NA which indicates a value that is missing or Not Available. The following code snippet shows some logical tests on these special values and their results. Do remember that TRUE and FALSE are logical data type values, similar to other programming languages.

> vec <- c(0, Inf, NaN, NA)> is.finite(vec)[1] TRUE FALSE FALSE FALSE> is.nan(vec)[1] FALSE FALSE TRUE FALSE> is.na(vec)[1] FALSE FALSE TRUE TRUE> is.infinite(vec)[1] FALSE TRUE FALSE FALSE

The functions are pretty self-explanatory from their names. They clearly indicate which values are finite, which are finite and checks for NaN and NA values respectively. Some of these functions are very useful when cleaning dirty data.