Mastering Predictive Analytics with R - Second Edition - James D. Miller - E-Book

Mastering Predictive Analytics with R - Second Edition E-Book

James D Miller

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Master the craft of predictive modeling in R by developing strategy, intuition, and a solid foundation in essential concepts

About This Book

  • Grasping the major methods of predictive modeling and moving beyond black box thinking to a deeper level of understanding
  • Leveraging the flexibility and modularity of R to experiment with a range of different techniques and data types
  • Packed with practical advice and tips explaining important concepts and best practices to help you understand quickly and easily

Who This Book Is For

Although budding data scientists, predictive modelers, or quantitative analysts with only basic exposure to R and statistics will find this book to be useful, the experienced data scientist professional wishing to attain master level status , will also find this book extremely valuable.. This book assumes familiarity with the fundamentals of R, such as the main data types, simple functions, and how to move data around. Although no prior experience with machine learning or predictive modeling is required, there are some advanced topics provided that will require more than novice exposure.

What You Will Learn

  • Master the steps involved in the predictive modeling process
  • Grow your expertise in using R and its diverse range of packages
  • Learn how to classify predictive models and distinguish which models are suitable for a particular problem
  • Understand steps for tidying data and improving the performing metrics
  • Recognize the assumptions, strengths, and weaknesses of a predictive model
  • Understand how and why each predictive model works in R
  • Select appropriate metrics to assess the performance of different types of predictive model
  • Explore word embedding and recurrent neural networks in R
  • Train models in R that can work on very large datasets

In Detail

R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems.

The book begins with a dedicated chapter on the language of models and the predictive modeling process. You will understand the learning curve and the process of tidying data. Each subsequent chapter tackles a particular type of model, such as neural networks, and focuses on the three important questions of how the model works, how to use R to train it, and how to measure and assess its performance using real-world datasets. How do you train models that can handle really large datasets? This book will also show you just that. Finally, you will tackle the really important topic of deep learning by implementing applications on word embedding and recurrent neural networks.

By the end of this book, you will have explored and tested the most popular modeling techniques in use on real- world datasets and mastered a diverse range of techniques in predictive analytics using R.

Style and approach

This book takes a step-by-step approach in explaining the intermediate to advanced concepts in predictive analytics. Every concept is explained in depth, supplemented with practical examples applicable in a real-world setting.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 687

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Mastering Predictive Analytics with R Second Edition
Credits
About the Authors
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Gearing Up for Predictive Modeling
Models
Learning from data
The core components of a model
Our first model – k-nearest neighbors
Types of model
Supervised, unsupervised, semi-supervised, and reinforcement learning models
Parametric and nonparametric models
Regression and classification models
Real-time and batch machine learning models
The process of predictive modeling
Defining the model's objective
Collecting the data
Picking a model
Pre-processing the data
Exploratory data analysis
Feature transformations
Encoding categorical features
Missing data
Outliers
Removing problematic features
Feature engineering and dimensionality reduction
Training and assessing the model
Repeating with different models and final model selection
Deploying the model
Summary
2. Tidying Data and Measuring Performance
Getting started
Tidying data
Categorizing data quality
The first step
The next step
The final step
Performance metrics
Assessing regression models
Assessing classification models
Assessing binary classification models
Cross-validation
Learning curves
Plot and ping
Summary
3. Linear Regression
Introduction to linear regression
Assumptions of linear regression
Simple linear regression
Estimating the regression coefficients
Multiple linear regression
Predicting CPU performance
Predicting the price of used cars
Assessing linear regression models
Residual analysis
Significance tests for linear regression
Performance metrics for linear regression
Comparing different regression models
Test set performance
Problems with linear regression
Multicollinearity
Outliers
Feature selection
Regularization
Ridge regression
Least absolute shrinkage and selection operator (lasso)
Implementing regularization in R
Polynomial regression
Summary
4. Generalized Linear Models
Classifying with linear regression
Introduction to logistic regression
Generalized linear models
Interpreting coefficients in logistic regression
Assumptions of logistic regression
Maximum likelihood estimation
Predicting heart disease
Assessing logistic regression models
Model deviance
Test set performance
Regularization with the lasso
Classification metrics
Extensions of the binary logistic classifier
Multinomial logistic regression
Predicting glass type
Ordinal logistic regression
Predicting wine quality
Poisson regression
Negative Binomial regression
Summary
5. Neural Networks
The biological neuron
The artificial neuron
Stochastic gradient descent
Gradient descent and local minima
The perceptron algorithm
Linear separation
The logistic neuron
Multilayer perceptron networks
Training multilayer perceptron networks
The back propagation algorithm
Predicting the energy efficiency of buildings
Evaluating multilayer perceptrons for regression
Predicting glass type revisited
Predicting handwritten digits
Receiver operating characteristic curves
Radial basis function networks
Summary
6. Support Vector Machines
Maximal margin classification
Support vector classification
Inner products
Kernels and support vector machines
Predicting chemical biodegration
Predicting credit scores
Multiclass classification with support vector machines
Summary
7. Tree-Based Methods
The intuition for tree models
Algorithms for training decision trees
Classification and regression trees
CART regression trees
Tree pruning
Missing data
Regression model trees
CART classification trees
C5.0
Predicting class membership on synthetic 2D data
Predicting the authenticity of banknotes
Predicting complex skill learning
Tuning model parameters in CART trees
Variable importance in tree models
Regression model trees in action
Improvements to the M5 model
Summary
8. Dimensionality Reduction
Defining DR
Correlated data analyses
Scatterplots
Causation
The degree of correlation
Reporting on correlation
Principal component analysis
Using R to understand PCA
Independent component analysis
Defining independence
ICA pre-processing
Factor analysis
Explore and confirm
Using R for factor analysis
The output
NNMF
Summary
9. Ensemble Methods
Bagging
Margins and out-of-bag observations
Predicting complex skill learning with bagging
Predicting heart disease with bagging
Limitations of bagging
Boosting
AdaBoost
AdaBoost for binary classification
Predicting atmospheric gamma ray radiation
Predicting complex skill learning with boosting
Limitations of boosting
Random forests
The importance of variables in random forests
XGBoost
Summary
10. Probabilistic Graphical Models
A little graph theory
Bayes' theorem
Conditional independence
Bayesian networks
The Naïve Bayes classifier
Predicting the sentiment of movie reviews
Predicting promoter gene sequences
Predicting letter patterns in English words
Summary
11. Topic Modeling
An overview of topic modeling
Latent Dirichlet Allocation
The Dirichlet distribution
The generative process
Fitting an LDA model
Modeling the topics of online news stories
Model stability
Finding the number of topics
Topic distributions
Word distributions
LDA extensions
Modeling tweet topics
Word clouding
Summary
12. Recommendation Systems
Rating matrix
Measuring user similarity
Collaborative filtering
User-based collaborative filtering
Item-based collaborative filtering
Singular value decomposition
Predicting recommendations for movies and jokes
Loading and pre-processing the data
Exploring the data
Evaluating binary top-N recommendations
Evaluating non-binary top-N recommendations
Evaluating individual predictions
Other approaches to recommendation systems
Summary
13. Scaling Up
Starting the project
Data definition
Experience
Data of scale – big data
Using Excel to gauge your data
Characteristics of big data
Volume
Varieties
Sources and spans
Structure
Statistical noise
Training models at scale
Pain by phase
Specific challenges
Heterogeneity
Scale
Location
Timeliness
Privacy
Collaborations
Reproducibility
A path forward
Opportunities
Bigger data, bigger hardware
Breaking up
Sampling
Aggregation
Dimensional reduction
Alternatives
Chunking
Alternative language integrations
Summary
14. Deep Learning
Machine learning or deep learning
What is deep learning?
An alternative to manual instruction
Growing importance
Deeper data?
Deep learning for IoT
Use cases
Word embedding
Word prediction
Word vectors
Numerical representations of contextual similarities
Netflix learns
Implementations
Deep learning architectures
Artificial neural networks
Recurrent neural networks
Summary
Index

Mastering Predictive Analytics with R Second Edition

Mastering Predictive Analytics with R Second Edition

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2015

Second edition: August 2017

Production reference: 1140817

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78712-139-3

www.packtpub.com

Credits

Authors

James D. Miller

Rui Miguel Forte

Reviewer

Davor Lozić

Commissioning Editor

Amey Varangaonkar

Acquisition Editor

Divya Poojari

Content Development Editor

Deepti Thore

Technical Editor

Nilesh Sawakhande

Copy Editor

Safis Editing

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Pratik Shirodkar

Graphics

Tania Dutta

Production Coordinator

Shantanu Zagade

Cover Work

Shantanu Zagade

About the Authors

James D. Miller is an IBM-certified expert, creative innovator, accomplished director, senior project leader, and application/system architect. He has over 35 years of extensive experience in application and system design and development across multiple platforms and technologies. His experience includes introducing customers to new technologies and platforms, integrating with IBM Watson Analytics, Cognos BI, and TM1. He has worked in web architecture design, systems analysis, GUI design and testing, database modeling, systems analysis, design and development of OLAP, web and mainframe applications and systems utilization, IBM Watson Analytics, IBM Cognos BI and TM1 (TM1 rules, TI, TM1Web, and Planning Manager), Cognos Framework Manager, dynaSight - ArcPlan, ASP, DHTML, XML, IIS, MS Visual Basic and VBA, Visual Studio, Perl, Splunk, WebSuite, MS SQL Server, Oracle, and Sybase server. James's responsibilities have also included all aspects of Windows and SQL solution development and design, such as analysis; GUI (and website) design; data modeling; table, screen/form, and script development; SQL (and remote stored procedures and triggers) development/testing; test preparation; and management and training of programming staff.

His other experience includes the development of ETL infrastructures, such as data transfer automation between mainframe (DB2, Lawson, Great Plains, and so on) system and client/server SQL Server, web-based applications, and the integration of enterprise applications and data sources. James has been a web application development manager responsible for the design, development, QA, and delivery of multiple websites, including online trading applications and warehouse process control and scheduling systems, as well as administrative and control applications. He was also responsible for the design, development, and administration of a web-based financial reporting system for a 450-million dollar organization, reporting directly to the CFO and his executive team.

Furthermore, he has been responsible for managing and directing multiple resources in various management roles, including as project and team leader, lead developer, and application development director. James has authored Cognos TM1 Developers Certification Guide, Mastering Splunk, and a number of white papers on best practices, including Establishing a Center of Excellence. He continues to post blogs on a number of relevant topics based on personal experiences and industry best practices. James is a perpetual learner, continuing to pursue new experiences and certifications. He currently holds the following technical certifications: IBM Certified Business Analyst - Cognos TM1 IBM Cognos TM1 Master 385 Certification (perfect score of 100%), IBM Certified Advanced Solution Expert - Cognos TM1, IBM Cognos TM1 10.1 Administrator Certification C2020-703 (perfect score of 100%), IBM OpenPages Developer Fundamentals C2020-001-ENU (98% in exam), IBM Cognos 10 BI Administrator C2020-622 (98% in exam), and IBM Cognos 10 BI Professional C2020-180.

He specializes in the evaluation and introduction of innovative and disruptive technologies, cloud migration, IBM Watson Analytics, Cognos BI and TM1 application design and development, OLAP, Visual Basic, SQL Server, forecasting and planning, international application development, business intelligence, project development and delivery, and process improvement.

I'd like to thank, Nanette L. Miller, and remind her that "Your destiny is my destiny. Your happiness is my happiness." I'd also like to thank Shelby Elizabeth and Paige Christina, who are both women of strength and beauty and whom I have no doubt will have a lasting, loving effect on the world.

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, with over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes, and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured at a number of seminars, specialization programs, and R schools for working data science professionals in Athens.

His core programming knowledge is in R and Java, and he has extensive experience of a variety of database technologies, such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a master's degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.

Behind every great adventure is a good story and writing a book is no exception. Many people contributed to making this book a reality. I would like to thank the many students I have taught at AUEB whose dedication and support has been nothing short of overwhelming. They should be rest assured that I have learned just as much from them as they have learned from me, if not more. I also want to thank Damianos Chatziantoniou for conceiving a pioneering graduate data science program in Greece. Workable has been a crucible for working alongside incredibly talented and passionate engineers on exciting data science projects that help businesses around the globe. For this, I would like to thank my colleagues and in particular the founders, Nick and Spyros, who created a diamond in the rough. I would like to thank Subho, Govindan, and all the folks at Packt for their professionalism and patience. My family and extended family have been an incredible source of support on this project. In particular, I would like to thank my father, Libanio, for inspiring me to pursue a career in the sciences and my mother, Marianthi, for always believing in me far more than anyone else ever could. My wife, Despoina, patiently and fiercely stood by my side even as this book kept me away from her during her first pregnancy. Last but not least, my baby daughter slept quietly and kept a cherubic vigil over her father during the book review phase. She helped me in ways words cannot describe.

About the Reviewer

Davor Lozić is a senior software engineer interested in various subjects, especially computer security, algorithms, and data structures. He manages a team of more than 15 engineers and is a part-time assistant professor who lectures about database systems and interoperability. You can visit his website at http://warriorkitty.com. He likes cats! If you want to talk about any aspect of technology or if you have funny pictures of cats, feel free to contact him.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787121399.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

Predictive analytics incorporates a variety of statistical techniques from predictive modeling, machine learning, and data mining that aim to analyze current and historical facts to produce results referred to as predictions about the future or otherwise unknown events.

R is an open source programming language that is widely used among statisticians and data miners for predictive modeling and data mining. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems.

This book builds upon its first edition, meaning to be both a guide and a reference to the reader wanting to move beyond the basics of predictive modeling. The book begins with a dedicated chapter on the language of models as well as the predictive modeling process. Each subsequent chapter tackles a particular type of model, such as neural networks, and focuses on the three important questions of how the model works, how to use R to train it, and how to measure and assess its performance using real-world datasets.

This second edition provides up-to-date in-depth information on topics such as Performance Metrics and Learning Curves, Polynomial Regression, Poisson and Negative Binomial Regression, back-propagation, Radial Basis Function Networks, and more. A chapter has also been added that focuses on working with very large datasets. By the end of this book, you will have explored and tested the most popular modeling techniques in use on real-world datasets and mastered a diverse range of techniques in predictive analytics.

What this book covers

Chapter 1, Gearing Up for Predictive Modeling, helps you set up and get ready to start looking at individual models and case studies, then describes the process of predictive modeling in a series of steps, and introduces several fundamental distinctions.

Chapter 2, Tidying Data and Measuring Performance, covers performance metrics, learning curves, and a process for tidying data.

Chapter 3, Linear Regression, explains the classic starting point for predictive modeling; it starts from the simplest single variable model and moves on to multiple regression, over-fitting, regularization, and describes regularized extensions of linear regression.

Chapter 4, Generalized Linear Models, follows on from linear regression, and in this chapter, introduces logistic regression as a form of binary classification, extends this to multinomial logistic regression, and uses these as a platform to present the concepts of sensitivity and specificity.

Chapter 5, Neural Networks, explains that the model of logistic regression can be seen as a single layer perceptron. This chapter discusses neural networks as an extension of this idea, along with their origins and explores their power.

Chapter 6, Support Vector Machines, covers a method of transforming data into a different space using a kernel function and as an attempt to find a decision line that maximizes the margin between the classes.

Chapter 7, Tree-Based Methods, presents various tree-based methods that are popularly used, such as decision trees and the famous C5.0 algorithm. Regression trees are also covered, as well as random forests, making the link with the previous chapter on bagging. Cross validation methods for evaluating predictors are presented in the context of these tree-based methods.

Chapter 8, Dimensionality Reduction, covers PCA, ICA, Factor analysis, and Non-negative Matrix factorization.

Chapter 9, Ensemble Methods, discusses methods for combining either many predictors, or multiple trained versions of the same predictor. This chapter introduces the important notions of bagging and boosting and how to use the AdaBoost algorithm to improve performance on one of the previously analyzed datasets using a single classifier.

Chapter 10, Probabilistic Graphical Models, introduces the Naive Bayes classifier as the simplest graphical model following a discussion of conditional probability and Bayes' rule. The Naive Bayes classifier is showcased in the context of sentiment analysis. Hidden Markov Models are also introduced and demonstrated through the task of next word prediction.

Chapter 11, Topic Modeling, provides step-by-step instructions for making predictions on topic models. It will also demonstrate methods of dimensionality reduction to summarize and simplify the data.

Chapter 12, Recommendation Systems, explores different approaches to building recommender systems in R, using nearest neighbor approaches, clustering, and algorithms such as collaborative filtering.

Chapter 13, Scaling Up, explains working with very large datasets, including some worked examples of how to train some models we've seen so far with very large datasets.

Chapter 14, Deep Learning, tackles the really important topic of deep learning using examples such as word embedding and recurrent neural networks (RNNs).

What you need for this book

In order to work with and to run the code examples found in this book, the following should be noted:

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and MacOS. To download R, there are a variety of locations available, including https://www.rstudio.com/products/rstudio/download.R includes extensive accommodations for accessing documentation and searching for help. A good source of information is at http://www.r-project.org/help.html.The capabilities of R are extended through user-created packages. Various packages are referred to and used throughout this book and the features of and access to each will be detailed as they are introduced. For example, the wordcloud package is introduced in Chapter 11, Topic Modeling to plot a cloud of words shared across documents. This is found at https://cran.r-project.org/web/packages/wordcloud/index.html.

Who this book is for

It would be helpful if the reader has had some experience with predictive analytics and the R programming language; however, this book will also be of value to readers who are new to these topics but are keen to get started as quickly as possible.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Predictive-Analytics-with-R-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringPredictiveAnalyticswithRSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Types of model

With a broad idea of the basic components of a model, we are ready to explore some of the common distinctions that modelers use to categorize different models.

Supervised, unsupervised, semi-supervised, and reinforcement learning models

We've already looked at the iris dataset, which consists of four features and one output variable, namely the species variable. Having the output variable available for all the observations in the training data is the defining characteristic of the supervised learning setting, which represents the most frequent scenario encountered. In a nutshell, the advantage of training a model under the supervised learning setting is that we have the correct answer that we should be predicting for the data points in our training data. As we saw in the previous section, kNN is a model that uses supervised learning, because the model makes its prediction for an input point by combining the values of the output variable for a small number of neighbors to that point. In this book, we will primarily focus on supervised learning.

Using the availability of the value of the output variable as a way to discriminate between different models, we can also envisage a second scenario in which the output variable is not specified. This is known as the unsupervised learning setting. An unsupervised version of the iris dataset would consist of only the four features. If we don't have the species output variable available to us, then we clearly have no idea as to which species each observation refers to. Indeed, we won't know how many species of flower are represented in the dataset, or how many observations belong to each species. At first glance, it would seem that without this information, no useful predictive task could be carried out. In fact, what we can do is examine the data and create groups of observations based on how similar they are to each other, using the four features available to us. This process is known as clustering. One benefit of clustering is that we can discover natural groups of data points in our data; for example, we might be able to discover that the flower samples in an unsupervised version of our iris set form three distinct groups that correspond to three different species.

Between unsupervised and supervised methods, which are two absolutes in terms of the availability of the output variable, reside the semi-supervised and reinforcement learning settings. Semi-supervised models are built using data for which a (typically quite small) fraction contains the values for the output variable, while the rest of the data is completely unlabeled. Many such models first use the labeled portion of the dataset in order to train the model coarsely, then incorporate the unlabeled data by projecting labels predicted by the model trained up this point.

In a reinforcement learning setting the output variable is not available, but other information that is directly linked with the output variable is provided. One example is predicting the next best move to win a chess game, based on data from complete chess games. Individual chess moves do not have output values in the training data, but for every game, the collective sequence of moves for each player resulted in either a win or a loss. Due to space constraints, semi-supervised and reinforcement settings aren't covered in this book.

Parametric and nonparametric models

In a previous section, we noted how most of the models we will encounter are parametric models, and we saw an example of a simple linear model. Parametric models have the characteristic that they tend to define a functional form. This means that they reduce the problem of selecting between all possible functions for the target function to a particular family of functions that form a parameter set. Selecting the specific function that will define the model essentially involves selecting precise values for the parameters. So, returning to our example of a three-feature linear model, we can see that we have the two following possible choices of parameters (the choices are infinite, of course; here we just demonstrate two specific ones):

Here, we have used a subscript on the output Y variable to denote the two different possible models. Which of these might be a better choice? The answer is that it depends on the data. If we apply each of our models on the observations in our dataset, we will get the predicted output for every observation. With supervised learning, every observation in our training data is labeled with the correct value of the output variable. To assess our model's goodness of fit, we can define an error function that measures the degree to which our predicted outputs differ from the correct outputs. We then use this to pick between our two candidate models in this case, but more generally to iteratively improve a model by moving through a sequence of progressively better candidate models.

Some parametric models are more flexible than linear models, meaning that they can be used to capture a greater variety of possible functions. Linear models, which require that the output be a linearly weighted combination of the input features, are considered strict. We can intuitively see that a more flexible model is more likely to allow us to approximate our input data with greater accuracy; however, when we look at overfitting, we'll see that this is not always a good thing. Models that are more flexible also tend to be more complex and, thus, training them often proves to be harder than training less flexible models.

Models are not necessarily parameterized, in fact, the class of models that have no parameters is known (unsurprisingly) as nonparametric models. Nonparametric models generally make no assumptions on the particular form of the output function. There are different ways of constructing a target function without parameters. Splines are a common example of a nonparametric model. The key idea behind splines is that we envisage the output function, whose form is unknown to us, as being defined exactly at the points that correspond to all the observations in our training data. Between the points, the function is locally interpolated using smooth polynomial functions. Essentially, the output function is built in a piecewise manner in the space between the points in our training data. Unlike most scenarios, splines will guarantee 100% accuracy on the training data, whereas it is perfectly normal to have some errors in our training data. Another good example of a nonparametric model is the k-nearest neighbor algorithm that we've already seen.

Regression and classification models

The distinction between regression and classification models has to do with the type of output we are trying to predict, and is generally relevant to supervised learning. Regression models try to predict a numerical or quantitative value, such as the stock market index, the amount of rainfall, or the cost of a project. Classification models try to predict a value from a finite (though still possibly large) set of classes or categories. Examples of this include predicting the topic of a website, the next word that will be typed by a user, a person's gender, or whether a patient has a particular disease given a series of symptoms. The majority of models that we will study in this book fall quite neatly into one of these two categories, although a few, such as neural networks, can be adapted to solve both types of problem. It is important to stress here that the distinction made is on the output only, and not on whether the feature values that are used to predict the output are quantitative or qualitative themselves. In general, features can be encoded in a way that allows both qualitative and quantitative features to be used in regression and classification models alike. Earlier, when we built a kNN model to predict the species of iris based on measurements of flower samples, we were solving a classification problem as our species output variable could take only one of three distinct labels.

The kNN approach can also be used in a regression setting; in this case, the model combines the numerical values of the output variable for the selected nearest neighbors by taking the mean or median in order to make its final prediction. Thus, kNN is also a model that can be used in both regression and classification settings.

Real-time and batch machine learning models

Predictive models can use real-time machine learning or they can involve batch learning. The term real-time machine learning can refer to two different scenarios, although it certainly does not refer to the idea that real-time machine learning involves making a prediction in real time, that is, within a predefined time limit that is typically small. For example, once trained, a neural network model can produce its prediction of the output using only a few computations (depending on the number of inputs and network layers). This is not, however, what we mean when we talk about real-time machine learning.