R Machine Learning Projects - Dr. Sunil Kumar Chinnamgari - E-Book

R Machine Learning Projects E-Book

Dr. Sunil Kumar Chinnamgari

0,0
36,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Master a range of machine learning domains with real-world projects using TensorFlow for R, H2O, MXNet, and more

Key Features



  • Master machine learning, deep learning, and predictive modeling concepts in R 3.5
  • Build intelligent end-to-end projects for finance, retail, social media, and a variety of domains
  • Implement smart cognitive models with helpful tips and best practices

Book Description



R is one of the most popular languages when it comes to performing computational statistics (statistical computing) easily and exploring the mathematical side of machine learning. With this book, you will leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization.

This book will help you test your knowledge and skills, guiding you on how to build easily through to complex machine learning projects. You will first learn how to build powerful machine learning models with ensembles to predict employee attrition. Next, you'll implement a joke recommendation engine and learn how to perform sentiment analysis on Amazon reviews. You'll also explore different clustering techniques to segment customers using wholesale data. In addition to this, the book will get you acquainted with credit card fraud detection using autoencoders, and reinforcement learning to make predictions and win on a casino slot machine.

By the end of the book, you will be equipped to confidently perform complex tasks to build research and commercial projects for automated operations.

What you will learn



  • Explore deep neural networks and various frameworks that can be used in R
  • Develop a joke recommendation engine to recommend jokes that match users' tastes
  • Create powerful ML models with ensembles to predict employee attrition
  • Build autoencoders for credit card fraud detection
  • Work with image recognition and convolutional neural networks
  • Make predictions for casino slot machine using reinforcement learning
  • Implement NLP techniques for sentiment analysis and customer segmentation

Who this book is for



If you're a data analyst, data scientist, or machine learning developer who wants to master machine learning concepts using R by building real-world projects, this is the book for you. Each project will help you test your skills in implementing machine learning algorithms and techniques. A basic understanding of machine learning and working knowledge of R programming is necessary to get the most out of this book.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 399

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



R Machine Learning Projects
Implement supervised, unsupervised, and reinforcement learning techniques using R 3.5

 

 

 

 

 

 

 

 

 

 

Dr. Sunil Kumar Chinnamgari

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

R Machine Learning Projects

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Yogesh DeokarContent Development Editor: Atikho Sapuni RishanaTechnical Editor: Vibhuti GawdeCopy Editor: Safis EditingProject Coordinator: Kirti PisatProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jisha ChirayilProduction Coordinator: Tejal Daruwale Soni

First published: January 2019

Production reference: 1100119

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78980-794-3

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Dedicated to my loving wife, HimaBindu. Sometimes I wonder what am I without you; I've never told you this, but you actually define me.
I'd like to thank my dear friend Nanditha Siva for being there by my side in the difficult times. You are the one who has always had faith in me that "I am something".

Contributors

About the author

Dr. Sunil Kumar Chinnamgari has a PhD in computer science (specializing in machine learning and natural language processing). He is an AI researcher with more than 14 years of industry experience. Currently, he works in the capacity of a lead data scientist with a US financial giant. He has published several research papers in Scopus and IEEE journals, and is a frequent speaker at various meet-ups. He is an avid coder and has won multiple hackathons. In his spare time, Sunil likes to teach, travel, and spend time with family.

About the reviewers

Davor Lozić is a senior software engineer interested in various subjects, especially computer security, algorithms, and data structures. He manages teams of more than 15 engineers and is a professor teaching about database systems. You can contact him at [email protected]. He likes cats! If you want to talk about any aspect of technology, or if you have funny pictures of cats, feel free to contact him.

 

 

Giuseppe Ciaburro holds a PhD in environmental technical physics and two master's degrees. His research focuses on machine learning applications in the study of urban sound environments. He works at Built Environment Control Laboratory—Università degli Studi della Campania Luigi Vanvitelli (Italy). He has over 15 years of work experience in programming (in Python, R, and MATLAB), first in the field of combustion and then in acoustics and noise control. He has several publications to his credit.

 

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

R Machine Learning Projects

About Packt

Why subscribe?

Packt.com

Dedication

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Exploring the Machine Learning Landscape

ML versus software engineering

Types of ML methods

Supervised learning

Unsupervised learning

Semi-supervised learning

Reinforcement learning

Transfer learning

ML terminology – a quick review

Deep learning

Big data

Natural language processing

Computer vision

Cost function

Model accuracy

Confusion matrix

Predictor variables

Response variable

Dimensionality reduction

Class imbalance problem

Model bias and variance

Underfitting and overfitting

Data preprocessing

Holdout sample

Hyperparameter tuning

Performance metrics

Feature engineering

Model interpretability

ML project pipeline

Business understanding

Understanding and sourcing the data

Preparing the data 

Model building and evaluation

Model deployment

Learning paradigm

Datasets

Summary

Predicting Employee Attrition Using Ensemble Models

Philosophy behind ensembling 

Getting started

Understanding the attrition problem and the dataset 

K-nearest neighbors model for benchmarking the performance

Bagging

Bagged classification and regression trees (treeBag) implementation

Support vector machine bagging (SVMBag) implementation

Naive Bayes (nbBag) bagging implementation

Randomization with random forests

Implementing an attrition prediction model with random forests

Boosting 

The GBM implementation

Building attrition prediction model with XGBoost

Stacking 

Building attrition prediction model with stacking

Summary

Implementing a Jokes Recommendation Engine

Fundamental aspects of recommendation engines

Recommendation engine categories

Content-based filtering

Collaborative filtering

Hybrid filtering

Getting started

Understanding the Jokes recommendation problem and the dataset

Converting the DataFrame

Dividing the DataFrame

Building a recommendation system with an item-based collaborative filtering technique

Building a recommendation system with a user-based collaborative filtering technique

Building a recommendation system based on an association-rule mining technique

The Apriori algorithm

Content-based recommendation engine

Differentiating between ITCF and content-based recommendations

Building a hybrid recommendation system for Jokes recommendations

Summary

References

Sentiment Analysis of Amazon Reviews with NLP

The sentiment analysis problem

Getting started

Understanding the Amazon reviews dataset

Building a text sentiment classifier with the BoW approach

Pros and cons of the BoW approach

Understanding word embedding

Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus

Building a text sentiment classifier with GloVe word embedding

Building a text sentiment classifier with fastText

Summary

Customer Segmentation Using Wholesale Data

Understanding customer segmentation

Understanding the wholesale customer dataset and the segmentation problem

Categories of clustering algorithms

Identifying the customer segments in wholesale customer data using k-means clustering

Working mechanics of the k-means algorithm

Identifying the customer segments in the wholesale customer data using DIANA

Identifying the customer segments in the wholesale customers data using AGNES

Summary

Image Recognition Using Deep Neural Networks

Technical requirements

Understanding computer vision

Achieving computer vision with deep learning

Convolutional Neural Networks

Layers of CNNs

Introduction to the MXNet framework

Understanding the MNIST dataset

Implementing a deep learning network for handwritten digit recognition

Implementing dropout to avoid overfitting

Implementing the LeNet architecture with the MXNet library

Implementing computer vision with pretrained models

Summary

Credit Card Fraud Detection Using Autoencoders

Machine learning in credit card fraud detection

Autoencoders explained

Types of AEs based on hidden layers

Types of AEs based on restrictions

Applications of AEs

The credit card fraud dataset

Building AEs with the H2O library in R

Autoencoder code implementation for credit card fraud detection

Summary

Automatic Prose Generation with Recurrent Neural Networks

Understanding language models

Exploring recurrent neural networks

Comparison of feedforward neural networks and RNNs

Backpropagation through time

Problems and solutions to gradients in RNN

Exploding gradients

Vanishing gradients

Building an automated prose generator with an RNN

Implementing the project

Summary

Winning the Casino Slot Machines with Reinforcement Learning

Understanding RL

Comparison of RL with other ML algorithms

Terminology of RL

The multi-arm bandit problem

Strategies for solving MABP

The epsilon-greedy algorithm

Boltzmann or softmax exploration

Decayed epsilon greedy

The upper confidence bound algorithm

Thompson sampling

Multi-arm bandit – real-world use cases

Solving the MABP with UCB and Thompson sampling algorithms

Summary

The Road Ahead

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

R is one of the most popular languages when it comes to performing computational statistics (statistical computing) easily and exploring the mathematical side of machine learning. With this book, you will leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization.

This book will help you test your knowledge and skills, guiding you on how to build easy through to complex machine learning projects. You will first learn how to build powerful machine learning models with ensembles to predict employee attrition. Next, you’ll implement a joke recommendation engine to perform sentiment analysis on Amazon reviews. You’ll also explore different clustering techniques to segment customers using wholesale data. In addition to this, the book will get you acquainted with credit card fraud detection using autoencoders, and reinforcement learning to make predictions and win on a casino slot machine.

By the end of the book, you will be equipped to confidently perform complex tasks to build research and commercial projects for automated operations.

Who this book is for

This book is for data analysts, data scientists, and ML developers who wish to master the concepts of ML using R by building real-world projects. Each project will help you test your expertise to implement the working mechanisms of ML algorithms and techniques. A basic understanding of ML and a working knowledge of R programming is a must.

What this book covers

Chapter 1, Exploring the Machine Learning Landscape, will briefly review the various ML concepts that a practitioner must know. In this chapter, we will cover topics such as supervised learning, reinforcement learning, unsupervised learning, and real-world ML uses cases.

Chapter 2, Predicting Employee Attrition Using Ensemble Models, covers the creation of powerful ML models through ensemble learning. The project covered in this chapter is from the human resources domain. Retention of talented employees is a key challenge faced by corporations. If we were able to predict the attrition of an employee well in advance, it is possible that the human resources or management team could do something to save the potential attrition from becoming real. It just so happens that it is possible to predict employee attrition through the application of ML. This chapter makes use of an IBM-curated public dataset that provides a pseudo employee attrition population and characteristics.We start the chapter with an introduction to the problem at hand and then attempt to explore the dataset with exploratory data analysis (EDA). The next step is the preprocessing phase, which includes the creation of new features using prior domain experience. Once the dataset is fully prepared, models will be created using multiple ensemble techniques, such as bagging, boosting , stacking, and randomization. Lastly, we will deploy the finally selected model for production. We will also learn about the concepts underlying the various ensemble techniques used to create the models.

Chapter 3, Implementing a Joke Recommendation Engine, introduces recommendation engines, which are designed to predict the ratings that a user would give to content such as movies and music. Based on what a user has previously liked or seen and using other profiling attributes, a recommendation engine suggests new content that the user might like. Such engines have gained a lot of significance in recent years. We explore the exciting area of recommendation systems by working on a joke recommendation engine project. In this chapter, we start by understanding the concepts and types of collaborative filtering algorithms. We will then build a recommendation engine to provide personalized joke recommendations using collaborative filtering approaches such as user-based collaborative filters and item-based collaborative filters. The dataset used for this project is a open dataset called the Jester jokes dataset. Apart from this, we will be exploring various libraries available in R that can be used to build recommendation systems, and we will be comparing the performances obtained from these approaches. Additionally, we leverage the market basket analysis technique, a pretty popular technique in the marketing domain, to discern relationships between various jokes.

Chapter 4,Sentiment Analysis of Amazon Reviews with NLP, covers sentiment analysis, which entails finding the sentiment of a sentence and labeling it as positive, negative, or neutral. This chapter introduces sentiment analysis and covers the various techniques that can be used to analyze text. We will understand text-mining concepts and the various ways that text is labeled based on the tone.

We will apply sentiment analysis to Amazon product review data. This dataset contains millions of Amazon customer reviews and star ratings. It is a classification task where we will be categorizing each review as positive, negative, or neutral depending on the tone. Apart from using various popular R text-mining libraries to preprocess the reviews to be classified, we will also be leveraging a wide range of text representations, such as bag of words, word2vec, fastText, and Glove. Each of the text representations is then used as input for ML algorithms to perform classification. In the course of implementing each of these techniques, we will also learn about the concepts behind these techniques and also explore other instances where we could successfully apply them.

Chapter 5, Customer Segmentation Using Wholesale Data, covers the segmentation, grouping, or clustering of customers, which can be achieved through unsupervised learning. We explore the various aspects of customer grouping in this chapter. Customer segmentation is an important tool used by product sellers to understand their customers and gather information. Customers can be segmented based on different criteria, such as age and spending patterns. In this chapter, we learn the various techniques of customer segmentation. For the project, we use a dataset containing wholesale transactions. This dataset is available in the UCI Machine Learning Repository. We will be applying advanced clustering techniques, such as k-means, DIANA, and AGNES. At times, we will not know the number of groups that exist in the dataset at hand. We will explore the ML techniques for dealing with such ambiguity and have ML find out the number of groups possible based on the underlying characteristics of the input data. Evaluating the output of the clustering algorithms is an area that is often challenging to practitioners. We also explore this area so as to have a well-rounded understanding of applying clustering algorithms to real-world problems.

Chapter 6, Image Recognition Using Deep Neural Networks, covers convolutional neural networks (CNNs), which are a type of deep neural network and are popular in computer vision applications. In this chapter, we learn about the fundamental concepts underlying CNNs. We explore why CNNs work so well with computer vision problems such as object detection. We discuss the aspects of transfer learning and how it works in tandem with CNNs to solve computer vision problems. As elsewhere in the book, we'll be going by the philosophy of learning by doing. We will learn about all of these concepts by applying a CNN in the building of a multi-class classification model on a popular open dataset called MNIST. The objective of the project is to classify given images of handwritten digits. The project explores the methodology for creating features from raw images. We will learn about the various preprocessing techniques that can be applied to the image data in order use the data with deep learning models.

Chapter 7, Credit Card Fraud Detection Using Autoencoders, covers autoencoders, which are yet another type of unsupervised deep learning network. We start the chapter by understanding autoencoders and how they are different from the other deep learning networks, such as recurrent neural networks (RNNs)and CNNs. We will learn about autoencoders by implementing a project that identifies credit card fraud. Credit card companies are constantly seeking ways to detect credit card fraud. Fraud detection is a key aspect for banks to protect their revenues. It can be achieved through the application of ML in the finance domain for the specific fraud detection problem. A fraud is usually an anomalous event that requires immediate action. In this chapter, we will use an autoencoder to detect fraud. Autoencoders are neural networks that contain a bottleneck layer whose dimensionality is smaller than the input data. In this chapter, we will become familiar with dimensionality reduction and how it can be used to identify credit card fraud detection. For the project, we will be using the H2O deep learning framework in tandem with R. As far as the dataset is concerned, we use an open dataset that contains credit card transactions of European card holders from September 2013. There are a total of 284,807 transactions, out of which 492 are fraudulent.

Chapter 8, Automatic Prose Generation with Recurrent Neural Networks, introduces some deep neural networks (DNNs) that have recently received a lot of attention. This is due to their success in obtaining great results in various areas of ML, from face recognition and object detection to music generation and neural art. This chapter introduces the concepts necessary for understanding deep learning. We discuss the nuts and bolts of neural networks, such as neurons, hidden layers, various activation functions, techniques for dealing with problems faced in neural networks, and using optimization algorithms to get weights in neural networks. We will also implement a neural network from scratch to demonstrate these concepts. The content of this chapter will help us get foundational knowledge on neural networks. Then, we will learn how to apply an RNN by doing a project. It has always been thought that creative tasks such as authoring stories, writing poems, and painting pictures can only be achieved by humans. This is no longer true, thanks to deep learning! Technology can now accomplish creative tasks. We will create an application based on long short-term memory (LSTM) network, a variant of RNNs that generates text automatically. To accomplish this task, we make use of the MXNet framework, which extends its support for the R language to perform deep learning. In the course of implementing this project, we will also learn more about the concepts surrounding RNNs and LSTMs.

Chapter 9, Winning the Casino Slot Machines with Reinforcement Learning, begins with an explanation of RL. We discuss the various concepts of RL, including strategies for solving what is called as the multi-arm bandit problem. We implement a project that uses UCB and Thompson sampling techniques in order to solve the multi-arm bandit problem.

Appendix, The Road Ahead, briefly discuss the advancements in the ML world and the need to stay on top of them.

To get the most out of this book

The projects covered in this book are intended to expose you to practical knowledge on the implementation of various ML techniques to real-world problems. It is expected that you have a good working knowledge of R and some basic understanding of ML. Basic knowledge of ML and R is a must prior to starting this project.

It should also be noted that the code for the projects is implemented using R version 3.5.2 (2018-12-20), nicknamed Eggshell Igloo. The project code has been successfully tested on Linux Mint 18.3 Sylvia. There is no reason to believe that the code does not work on other platforms, such as Windows; however, this is not something that has been tested by the author.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-Machine-Learning-Projects. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789807943_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The rsample library incorporates this dataset, and we can make use of this dataset directly from the library."

A block of code is set as follows:

setwd("~/Desktop/chapter 2") library(rsample)data(attrition) str(attrition) mydata<-attrition

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "You may recollect the Customers Who Bought This Item Also Bought This heading on Amazon (or any e-commerce site) where recommendations are shown."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Exploring the Machine Learning Landscape

Machine learning (ML) is an amazing subfield of Artificial Intelligence (AI) that tries to mimic the learning behavior of humans. Similar to the way a baby learns by observing the examples it encounters, an ML algorithm learns the outcome or response to a future incident by observing the data points that are provided as input to it.

In this chapter, we will cover the following topics:

ML versus software engineering

Types of ML methods

ML terminology—a quick review

ML project pipeline

Learning paradigm

Datasets

ML versus software engineering

With most people transitioning from traditional software engineering practice to ML, it is important to understand the underlying difference between both areas. Superficially, both of these areas seem to generate some sort of code to perform a particular task. An interesting fact to observe is that, unlike software engineering where a programmer explicitly writes a program with various responses based on several conditions, the ML algorithm infers the rules of the game by observing the input examples. The rules that are learned are further used for better decision making when new input data is fed to the system.

As you can observe in the following diagram, automatically inferring the actions from data without manual intervention is the key differentiator between ML and traditional programming:

Another key differentiator of ML from traditional programming is that the knowledge acquired through ML is able to generalize beyond the training samples by successfully interpreting data that the algorithm has never seen before, while a program coded in traditional programming can only perform the responses that were included as part of the code.

Yet another differentiator is that in software engineering, there are certain specific ways to solve a problem at hand. Given an algorithm developed based on certain assumptions of inputs and the conditions incorporated, you will be able to guarantee the output that will be obtained given an input. In the ML world, it is not possible to provide such assurances on the output obtained from the algorithms. It is also very difficult in the ML world to confirm if a particular technique is better than another without actually trying both the techniques on the dataset for the problem at hand.

ML and software engineering are not the same! ML projects may involve some software engineering in them, but ML cannot be considered to be the same as software engineering.

While there is more than one formal definition that exists for ML, the following mentioned are a few key definitions encountered often:

"Machine learning is the science of getting computers to act without being explicitly programmed."                                                                                                                     —Stanford
"Machine learning is based on algorithms that can learn from data without relying on rules-based programming."                                                                                                               —McKinsey and Co.

With the rise of data as the fuel of the future, the terms AI, ML, data mining, data science, and data analytics are used interchangeably by industry practitioners. It is important to understand the key differences between these terms to avoid confusion.

The terms AI, ML, data mining, data science, and data analytics, though used interchangeably, are not the same!

Let's take a look at the following terms:

AI

: AI is a paradigm where machines are able to perform tasks in a smart way. It may be observed that in the definition of AI, it is not specified whether the smartness of machines may be achieved manually or automatically. Therefore, it is safe to assume that even a program written with several

if...else

 or

switch...case

 statements that has then been infused with a machine to carry out tasks may be considered to be AI.

ML

: ML, on the other hand, is a way for the machine to achieve smartness by learning from the data that is provided as input and, thereby, we have a smart machine performing a task. It may be observed that ML achieves the same objective of AI except that the smartness is achieved automatically. Therefore, it can be concluded that ML is simply a way to achieve AI.

Data mining

: Data mining is a specific field that focuses on discovering the unknown properties of the datasets. The primary objective of data mining is to extract rules from large amounts of data provided as input, whereas in ML, an algorithm not only infers rules from the data input, but also uses the rules to perform predictions on any new, incoming data.

Data analytics

Data analytics is a field that encompasses performing fundamental descriptive statistics, data visualization, and data points communication for conclusions. Data analytics may be considered to be a basic level within data science. It is normal for practitioners to perform data analytics on the input data provided for data mining or ML exercises. Such analysis on data is generally termed as

exploratory data analysis (EDA)

.

Data science

Data science is an umbrella term that includes data analytics, data mining, ML, and any specific domain expertise pertaining to the field of work. Data science is a concept that includes several aspects of handling the data such as acquiring the data from one or more sources, data cleansing, data preparation, and creating new data points based on existing data. It includes performing data analytics. It also encompasses using one or more data mining or ML techniques on the data to infer knowledge to create an algorithm that performs a task on unseen data. This concept also includes deploying the algorithm in a way that it is useful to perform the designated tasks in the future.

The following is a Venn diagram which demonstrates the skills required by a professional working in the data science ambit. It has three circles, each of which defines a specific skill that a data science professional should have:

Let's explore the following skills mentioned in the preceding diagram:

Math & Statistic Knowledge

: This skill is required to analyze the statistical properties of the data.

Hacking Skills

: Programming skills play a key role in order to process the data in a quick manner. The ML algorithm is applied to create an output that will perform the prediction on unseen data.

Substantive Expertise

: This skill refers to the domain expertise in the field of the problem at hand. It helps the professional to be able to provide proper inputs to the system from which it can learn and to assess the appropriateness of the inputs and results obtained.

To be a successful data science professional you need to have math, programming skills, as well as knowledge of the business domain.

As we can see, AI, data science, data analytics, data mining, and ML are all interlinked. All of these areas are the most in-demand domains in the industry right now. The right skill sets in combination with real-world experience will lead to a strong career in these areas which are currently trending. As ML forms the core of the leading space, the next section explores the various types of ML methods that may be applied to several real-world problems.

ML is everywhere! Most of the time, we may be using something that is ML-based but don’t realize its existence or the influence that it has on our lives! Let's explore together some very popular devices or applications that we experience on a daily basis, which are powered by ML:

Virtual personal assistants

(

VPAs

) such as

Google Allo

,

Alexa

,

Google Now

,

Google Home

,

Siri

, and so on 

Smart maps that show you traffic predictions, given your source and destination

Demand-based price surging in Uber or similar transportation services

Automated video surveillance in airports, railway stations, and other public places

Face recognition of individuals in pictures posted on social media sites such as Facebook

Personalized news feeds served to you on Facebook

Advertisements served to you on YouTube

People you may know

suggestions on Facebook and other similar sites

Job recommendations on LinkedIn, based on your profile

Automated responses on Google Mail

Chatbots that you converse with in online customer support forums

Search engine results filtering

Email spam filtering

Of course, the list does not end here. The preceding applications mentioned are just a few of the basic ones that illustrate the influence that ML has on our lives today. It is not astonishing to quote that there is no subject area that ML has not touched!

The topics in this section are by no means an exhaustive description of ML, but just a quick touch point to get us started on a journey of exploration. Now that we have a basic understanding of what ML is and where it can be applied, let's delve deeper into other ML-related topics in the next section.

Types of ML methods

Several types of tasks that aim at solving real-world problems can be achieved thanks to ML. An ML method generally means a group of specific types of algorithms that are suitable for solving a particular kind of problem and the method addresses any constraints that the problem brings along with it. For example, a constraint of a particular problem could be the availability of labeled data that can be provided as input to the learning algorithm.

Essentially, the popular ML methods are supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and transfer learning. The rest of this section details each of these methods.

Supervised learning

A supervised learning algorithm is applied when one is very clear about the result that needs to be achieved from a problem, however one is unsure about the relationships between the data that affects the output. We would like the ML algorithm that we apply on the data to perceive these relationships between different data elements so as to achieve the desired output.

The concept can be better explained with an example—at a bank, prior to extending a loan, they would like to predict if a loan applicant would pay the loan back. In this case, the problem is very clear. If a loan is extended to a prospective customer X, there are two possibilities: that X would successfully repay the loan or X would not repay the loan. The bank would like to use ML to identify the category into which customer X falls; that is, a successful repayer of the loan or a repayment defaulter.

While the problem definition that is to be solved is clear, please note that the features of a customer that will contribute to successful loan repayment or non-repayment are not clear and this is something we would like the ML algorithm to learn by observing the patterns in the data.

The major challenge here is that we need to provide input data that represents both customers that repaid their loans successfully and also customers that failed to repay. The bank can simply look at the historical data to get the records of customers in both categories and then label each record as paid or unpaid categories as appropriate.

The records, thus labeled, now become input to a supervised learning algorithm so that it can learn the patterns of both categories of customers. The process of learning from the labeled data is called training and the output obtained (algorithm) from the learning process is called a model. Ideally, the bank would keep some part of the labeled data aside from training data so as to be able to test the model created, and this data is termed as test data. It should be no surprise that the labeled data that is used for training the model is called training data.

Once the model has been built, measurements are obtained by testing the model with test data to ensure the model yields a satisfactory level of performance, otherwise model-building iterations are carried out until the desired model performance is obtained. The model that achieved the desired performance on test data can be used by the bank to infer if any new loan applicant will be a future defaulter at all and, if so, make a better decision in terms of extending a loan to that applicant.

In a nutshell, supervised ML algorithms are employed when the objective is very clear and labeled data is available as input for the algorithm to learn the patterns from. The following diagram summarizes the supervised learning process:

Supervised learning can be further divided into two categories, namely classification and regression. The prediction of a bank loan defaulter explained in this section is an example of classification and it aims to predict a label of a nominal type such as yes or no. On the other hand, it is also possible to predict numeric values (continuous values) and this type of prediction is called regression. An example of regression is predicting the monthly rental of a home in a prime location of a city based on features such as the demand for houses in the area, the number of bedrooms, the dimensions of the house, and accessibility to public transportation.

Several supervised learning algorithms exist, and a few popularly known algorithms in this area include classification and regression trees (CART), logistic regression, linear regression, Naive Bayes, neural networks, k-nearest neighbors (KNN), and support vector machine (SVM).

Unsupervised learning

The availability of labeled data is not very common and manually labeling data is also not cheap. This is the situation where unsupervised learning comes into play.

For example, one small boutique firm wants to roll out a promotion to its customers, who are registered on their Facebook page. While the business objective is clear—that a promotion needs to be rolled out to customers—it is unclear as to which customer falls under which group. Unlike the supervised learning method where prior knowledge existed in terms of bad debtors and good debtors, in this case there are no such clues.

When the customer information is given as input to unsupervised learning algorithms, it tries to identify the patterns in the data and thereby groups the data of the customers with similar kinds of attributes.

Birds of the same feather flock together is the principle followed in customer grouping with unsupervised learning.

The reasoning behind the formation of these organic groups from the grouping exercise may not be very intuitive. It may take some research to identify the factors that contributed to the gathering of a set of customers in a group. Most of the time, this research is manual and the data points in each group need verifying. This research may form the basis to determine the groups to which the particular promotion at hand needs to be rolled out. This application of unsupervised learning is called clustering. The following diagram shows the application of unsupervised ML to cluster the data points:

There are a number of clustering algorithms. However, the most popular ones are namely, k-means clustering, k-modes clustering, hierarchical clustering, fuzzy clustering, and so on.

Other forms of unsupervised learning do exist. For example, in retail industry, an unsupervised learning method called association rule mining is applied on customer purchases to identify the goods that are purchased together. In this case, unlike supervised learning, there is no need for labels at all. The task involved only requires the ML algorithm to identify the latent associations between the products that are billed together by customers. Having the information from association rule mining helps retailers place the products that are bought together in proximity. The idea is that customers can be intuitively encouraged to buy the extra products.

A priori, equivalence class transformation (Eclat), and frequency pattern growth (FPG) are popular among the several algorithms that exist to perform association rule mining.

Yet another form of unsupervised learning is anomaly detection or outlier detection. The goal of the exercise is to identify data points that do not belong to the rest of the elements that are given as input to the unsupervised learning algorithm. Similar to association rule mining, due to the nature of the problem at hand, there is no requirement for labels to be made use of by the algorithm to achieve the goal.

Fraud detection is an important application of anomaly detection in the credit cards industry. Credit card transactions are monitored in real time and any spurious transaction patterns are flagged immediately to avoid losses to the credit card user as well as the credit card provider. The unusual pattern that is monitored for could be a huge transaction in a foreign currency rather than that of a normal currency in which the particular customer generally transacts. It could be transactions in physical stores located in two different continents on the same day. The general idea is to be able to flag up a pattern that is a deviation from the norm.

K-means clustering and one-class SVM are two well-known unsupervised ML algorithms that are used to observe abnormalities in the population.

Overall, it may be understood that unsupervised learning is unarguably a very important method, given that labeled data used for training is a scarce resource.

Semi-supervised learning

Semi-supervised learning is a hybrid of both supervised and unsupervised methods. ML requires large amounts of data for training. Most of the time, a directly proportional relationship is observed between the amount of data used for model training and the performance of the model.

In niche domains such as medical imagining, a large amount of image data (MRIs, x-rays, CT scans) is available. However, the time and availability of qualified radiologists to label these images is scarce. In this situation, we might end up getting only a few images labeled by radiologists.

Semi-supervised learning takes advantage of the few labeled images by building an initial model that is used to label the large amount of unlabeled data that exists in the domain. Once the large amount of labeled data is available, a supervised ML algorithm may be used to train and create a final model that is used for prediction tasks on the unseen data. The following diagram illustrates the steps involved in semi-supervised learning:

Speech analysis, protein synthesis, and web content classifications are certain areas where large amounts of unlabeled data and fewer amounts of labeled data are available. Semi-supervised learning is applied in these areas with successful results.

Generative adversarial networks (GANs), semi-supervised support vector machines (S3VMs), graph-based methods, and Markov chain methods are well-known methods among others in the semi-supervised ML area.

Reinforcement learning

Reinforcement learning (RL) is an ML method that is neither supervised learning nor unsupervised learning. In this method, a reward definition is provided as input to this kind of a learning algorithm at the start. As the algorithm is not provided with labeled data for training, this type of learning algorithm cannot be categorized as supervised learning. On the other hand, it is not categorized as unsupervised learning, as the algorithm is fed with information on reward definition that guides the algorithm through taking the steps to solve the problem at hand.

Reinforcement learning aims to improve the strategies used to solve any problem continuously by relying on the feedback received. The goal is to maximize the rewards while taking steps to solve the problem. The rewards obtained are computed by the algorithm itself going by the rewards and penalty definitions. The idea is to achieve optimal steps that maximize the rewards to solve the problem at hand.

The following diagram is an illustration depicting a robot automatically determining the ideal behavior through a reinforcement learning method within the specific context of fire:

A machine outplaying humans in an Atari video game is termed as one of the foremost success stories of reinforcement learning. To achieve this feat, a large number of example games played by humans are fed as input to the algorithm that learned the steps to take to maximize the reward. The reward in this case is the final score. The algorithm, post learning from the example inputs, just simulated the pattern at each step of the game that eventually maximized the score obtained.

Though it might appear that reinforcement learning can be applied to game scenarios only, there are numerous use cases for this method in industry as well. The following examples mentioned are three such use cases:

Dynamic pricing of goods and services based on spontaneous supply and demand targeted at achieving profit maximization is achieved through a variant of reinforcement learning called

Q-learning

.

Effective use of space in warehouses is a key challenge faced by inventory management professionals. Market demand fluctuations, the large availability of inventory stocks, and delays in refilling the inventory are the key constraints that affect space utilization. Reinforcement learning algorithms are used to optimize the time to procure inventory as well as to reduce the time to retrieve the goods from warehouses, thereby directly impacting the space management issue referred to as a problem in the inventory management area.

Prolonged treatments and differential drug administration is required in medical science to treat diseases such as cancer. The treatments are highly personalized, based on the characteristics of the patient. Treatment often involves variations of the treatment strategy at various stages. This kind of treatment plan is typically referred to as a 

dynamic treatment regime

(

DTR

). Reinforcement learning helps with processing the clinical trials data to come up with the appropriate personalized DTR for the patient, based on the characteristics of the patient that are fed in as inputs to the reinforcement learning algorithm.

There are four very popular reinforcement learning algorithms, namely Q-learning, state-action-reward-state-action (SARSA), deep Q network (DQN), and deep deterministic policy gradient (DDPG).