E-Book
45,59 €

Mastering Predictive Analytics with R - Second Edition E-Book

James D Miller

0,0

45,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Master the craft of predictive modeling in R by developing strategy, intuition, and a solid foundation in essential concepts

About This Book

Grasping the major methods of predictive modeling and moving beyond black box thinking to a deeper level of understanding
Leveraging the flexibility and modularity of R to experiment with a range of different techniques and data types
Packed with practical advice and tips explaining important concepts and best practices to help you understand quickly and easily

Who This Book Is For

Although budding data scientists, predictive modelers, or quantitative analysts with only basic exposure to R and statistics will find this book to be useful, the experienced data scientist professional wishing to attain master level status , will also find this book extremely valuable.. This book assumes familiarity with the fundamentals of R, such as the main data types, simple functions, and how to move data around. Although no prior experience with machine learning or predictive modeling is required, there are some advanced topics provided that will require more than novice exposure.

What You Will Learn

Master the steps involved in the predictive modeling process
Grow your expertise in using R and its diverse range of packages
Learn how to classify predictive models and distinguish which models are suitable for a particular problem
Understand steps for tidying data and improving the performing metrics
Recognize the assumptions, strengths, and weaknesses of a predictive model
Understand how and why each predictive model works in R
Select appropriate metrics to assess the performance of different types of predictive model
Explore word embedding and recurrent neural networks in R
Train models in R that can work on very large datasets

In Detail

R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems.

The book begins with a dedicated chapter on the language of models and the predictive modeling process. You will understand the learning curve and the process of tidying data. Each subsequent chapter tackles a particular type of model, such as neural networks, and focuses on the three important questions of how the model works, how to use R to train it, and how to measure and assess its performance using real-world datasets. How do you train models that can handle really large datasets? This book will also show you just that. Finally, you will tackle the really important topic of deep learning by implementing applications on word embedding and recurrent neural networks.

By the end of this book, you will have explored and tested the most popular modeling techniques in use on real- world datasets and mastered a diverse range of techniques in predictive analytics using R.

Style and approach

This book takes a step-by-step approach in explaining the intermediate to advanced concepts in predictive analytics. Every concept is explained in depth, supplemented with practical examples applicable in a real-world setting.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 687

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Mastering Predictive Analytics with R Second Edition

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Gearing Up for Predictive Modeling

Models

Learning from data

The core components of a model

Our first model – k-nearest neighbors

Types of model

Supervised, unsupervised, semi-supervised, and reinforcement learning models

Parametric and nonparametric models

Regression and classification models

Real-time and batch machine learning models

The process of predictive modeling

Defining the model's objective

Collecting the data

Picking a model

Pre-processing the data

Exploratory data analysis

Feature transformations

Encoding categorical features

Missing data

Outliers

Removing problematic features

Feature engineering and dimensionality reduction

Training and assessing the model

Repeating with different models and final model selection

Deploying the model

Summary

2. Tidying Data and Measuring Performance

Getting started

Tidying data

Categorizing data quality

The first step

The next step

The final step

Performance metrics

Assessing regression models

Assessing classification models

Assessing binary classification models

Cross-validation

Learning curves

Plot and ping

Summary

3. Linear Regression

Introduction to linear regression

Assumptions of linear regression

Simple linear regression

Estimating the regression coefficients

Multiple linear regression

Predicting CPU performance

Predicting the price of used cars

Assessing linear regression models

Residual analysis

Significance tests for linear regression

Performance metrics for linear regression

Comparing different regression models

Test set performance

Problems with linear regression

Multicollinearity

Outliers

Feature selection

Regularization

Ridge regression

Least absolute shrinkage and selection operator (lasso)

Implementing regularization in R

Polynomial regression

Summary

4. Generalized Linear Models

Classifying with linear regression

Introduction to logistic regression

Generalized linear models

Interpreting coefficients in logistic regression

Assumptions of logistic regression

Maximum likelihood estimation

Predicting heart disease

Assessing logistic regression models

Model deviance

Test set performance

Regularization with the lasso

Classification metrics

Extensions of the binary logistic classifier

Multinomial logistic regression

Predicting glass type

Ordinal logistic regression

Predicting wine quality

Poisson regression

Negative Binomial regression

Summary

5. Neural Networks

The biological neuron

The artificial neuron

Stochastic gradient descent

Gradient descent and local minima

The perceptron algorithm

Linear separation

The logistic neuron

Multilayer perceptron networks

Training multilayer perceptron networks

The back propagation algorithm

Predicting the energy efficiency of buildings

Evaluating multilayer perceptrons for regression

Predicting glass type revisited

Predicting handwritten digits

Receiver operating characteristic curves

Radial basis function networks

Summary

6. Support Vector Machines

Maximal margin classification

Support vector classification

Inner products

Kernels and support vector machines

Predicting chemical biodegration

Predicting credit scores

Multiclass classification with support vector machines

Summary

7. Tree-Based Methods

The intuition for tree models

Algorithms for training decision trees

Classification and regression trees

CART regression trees

Tree pruning

Missing data

Regression model trees

CART classification trees

C5.0

Predicting class membership on synthetic 2D data

Predicting the authenticity of banknotes

Predicting complex skill learning

Tuning model parameters in CART trees

Variable importance in tree models

Regression model trees in action

Improvements to the M5 model

Summary

8. Dimensionality Reduction

Defining DR

Correlated data analyses

Scatterplots

Causation

The degree of correlation

Reporting on correlation

Principal component analysis

Using R to understand PCA

Independent component analysis

Defining independence

ICA pre-processing

Factor analysis

Explore and confirm

Using R for factor analysis

The output

NNMF

Summary

9. Ensemble Methods

Bagging

Margins and out-of-bag observations

Predicting complex skill learning with bagging

Predicting heart disease with bagging

Limitations of bagging

Boosting

AdaBoost

AdaBoost for binary classification

Predicting atmospheric gamma ray radiation

Predicting complex skill learning with boosting

Limitations of boosting

Random forests

The importance of variables in random forests

XGBoost

Summary

10. Probabilistic Graphical Models

A little graph theory

Bayes' theorem

Conditional independence

Bayesian networks

The Naïve Bayes classifier

Predicting the sentiment of movie reviews

Predicting promoter gene sequences

Predicting letter patterns in English words

Summary

11. Topic Modeling

An overview of topic modeling

Latent Dirichlet Allocation

The Dirichlet distribution

The generative process

Fitting an LDA model

Modeling the topics of online news stories

Model stability

Finding the number of topics

Topic distributions

Word distributions

LDA extensions

Modeling tweet topics

Word clouding

Summary

12. Recommendation Systems

Rating matrix

Measuring user similarity

Collaborative filtering

User-based collaborative filtering

Item-based collaborative filtering

Singular value decomposition

Predicting recommendations for movies and jokes

Loading and pre-processing the data

Exploring the data

Evaluating binary top-N recommendations

Evaluating non-binary top-N recommendations

Evaluating individual predictions

Other approaches to recommendation systems

Summary

13. Scaling Up

Starting the project

Data definition

Experience

Data of scale – big data

Using Excel to gauge your data

Characteristics of big data

Volume

Varieties

Sources and spans

Structure

Statistical noise

Training models at scale

Pain by phase

Specific challenges

Heterogeneity

Scale

Location

Timeliness

Privacy

Collaborations

Reproducibility

A path forward

Opportunities

Bigger data, bigger hardware

Breaking up

Sampling

Aggregation

Dimensional reduction

Alternatives

Chunking

Alternative language integrations

Summary

14. Deep Learning

Machine learning or deep learning

What is deep learning?

An alternative to manual instruction

Growing importance

Deeper data?

Deep learning for IoT

Use cases

Word embedding

Word prediction

Word vectors

Numerical representations of contextual similarities

Netflix learns

Implementations

Deep learning architectures

Artificial neural networks

Recurrent neural networks

Summary

Index

Mastering Predictive Analytics with R Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2015

Second edition: August 2017

Production reference: 1140817

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78712-139-3

www.packtpub.com

Credits

Authors

James D. Miller

Rui Miguel Forte

Reviewer

Davor Lozić

Commissioning Editor

Amey Varangaonkar

Acquisition Editor

Divya Poojari

Content Development Editor

Deepti Thore

Technical Editor

Nilesh Sawakhande

Copy Editor

Safis Editing

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Pratik Shirodkar

Graphics

Tania Dutta

Production Coordinator

Shantanu Zagade

Cover Work

Shantanu Zagade

About the Authors

James D. Miller is an IBM-certified expert, creative innovator, accomplished director, senior project leader, and application/system architect. He has over 35 years of extensive experience in application and system design and development across multiple platforms and technologies. His experience includes introducing customers to new technologies and platforms, integrating with IBM Watson Analytics, Cognos BI, and TM1. He has worked in web architecture design, systems analysis, GUI design and testing, database modeling, systems analysis, design and development of OLAP, web and mainframe applications and systems utilization, IBM Watson Analytics, IBM Cognos BI and TM1 (TM1 rules, TI, TM1Web, and Planning Manager), Cognos Framework Manager, dynaSight - ArcPlan, ASP, DHTML, XML, IIS, MS Visual Basic and VBA, Visual Studio, Perl, Splunk, WebSuite, MS SQL Server, Oracle, and Sybase server. James's responsibilities have also included all aspects of Windows and SQL solution development and design, such as analysis; GUI (and website) design; data modeling; table, screen/form, and script development; SQL (and remote stored procedures and triggers) development/testing; test preparation; and management and training of programming staff.

His other experience includes the development of ETL infrastructures, such as data transfer automation between mainframe (DB2, Lawson, Great Plains, and so on) system and client/server SQL Server, web-based applications, and the integration of enterprise applications and data sources. James has been a web application development manager responsible for the design, development, QA, and delivery of multiple websites, including online trading applications and warehouse process control and scheduling systems, as well as administrative and control applications. He was also responsible for the design, development, and administration of a web-based financial reporting system for a 450-million dollar organization, reporting directly to the CFO and his executive team.

Furthermore, he has been responsible for managing and directing multiple resources in various management roles, including as project and team leader, lead developer, and application development director. James has authored Cognos TM1 Developers Certification Guide, Mastering Splunk, and a number of white papers on best practices, including Establishing a Center of Excellence. He continues to post blogs on a number of relevant topics based on personal experiences and industry best practices. James is a perpetual learner, continuing to pursue new experiences and certifications. He currently holds the following technical certifications: IBM Certified Business Analyst - Cognos TM1 IBM Cognos TM1 Master 385 Certification (perfect score of 100%), IBM Certified Advanced Solution Expert - Cognos TM1, IBM Cognos TM1 10.1 Administrator Certification C2020-703 (perfect score of 100%), IBM OpenPages Developer Fundamentals C2020-001-ENU (98% in exam), IBM Cognos 10 BI Administrator C2020-622 (98% in exam), and IBM Cognos 10 BI Professional C2020-180.

He specializes in the evaluation and introduction of innovative and disruptive technologies, cloud migration, IBM Watson Analytics, Cognos BI and TM1 application design and development, OLAP, Visual Basic, SQL Server, forecasting and planning, international application development, business intelligence, project development and delivery, and process improvement.

I'd like to thank, Nanette L. Miller, and remind her that "Your destiny is my destiny. Your happiness is my happiness." I'd also like to thank Shelby Elizabeth and Paige Christina, who are both women of strength and beauty and whom I have no doubt will have a lasting, loving effect on the world.

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, with over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes, and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured at a number of seminars, specialization programs, and R schools for working data science professionals in Athens.

His core programming knowledge is in R and Java, and he has extensive experience of a variety of database technologies, such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a master's degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.

Behind every great adventure is a good story and writing a book is no exception. Many people contributed to making this book a reality. I would like to thank the many students I have taught at AUEB whose dedication and support has been nothing short of overwhelming. They should be rest assured that I have learned just as much from them as they have learned from me, if not more. I also want to thank Damianos Chatziantoniou for conceiving a pioneering graduate data science program in Greece. Workable has been a crucible for working alongside incredibly talented and passionate engineers on exciting data science projects that help businesses around the globe. For this, I would like to thank my colleagues and in particular the founders, Nick and Spyros, who created a diamond in the rough. I would like to thank Subho, Govindan, and all the folks at Packt for their professionalism and patience. My family and extended family have been an incredible source of support on this project. In particular, I would like to thank my father, Libanio, for inspiring me to pursue a career in the sciences and my mother, Marianthi, for always believing in me far more than anyone else ever could. My wife, Despoina, patiently and fiercely stood by my side even as this book kept me away from her during her first pregnancy. Last but not least, my baby daughter slept quietly and kept a cherubic vigil over her father during the book review phase. She helped me in ways words cannot describe.

About the Reviewer

Davor Lozić is a senior software engineer interested in various subjects, especially computer security, algorithms, and data structures. He manages a team of more than 15 engineers and is a part-time assistant professor who lectures about database systems and interoperability. You can visit his website at http://warriorkitty.com. He likes cats! If you want to talk about any aspect of technology or if you have funny pictures of cats, feel free to contact him.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787121399.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

Predictive analytics incorporates a variety of statistical techniques from predictive modeling, machine learning, and data mining that aim to analyze current and historical facts to produce results referred to as predictions about the future or otherwise unknown events.

R is an open source programming language that is widely used among statisticians and data miners for predictive modeling and data mining. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems.

This book builds upon its first edition, meaning to be both a guide and a reference to the reader wanting to move beyond the basics of predictive modeling. The book begins with a dedicated chapter on the language of models as well as the predictive modeling process. Each subsequent chapter tackles a particular type of model, such as neural networks, and focuses on the three important questions of how the model works, how to use R to train it, and how to measure and assess its performance using real-world datasets.

This second edition provides up-to-date in-depth information on topics such as Performance Metrics and Learning Curves, Polynomial Regression, Poisson and Negative Binomial Regression, back-propagation, Radial Basis Function Networks, and more. A chapter has also been added that focuses on working with very large datasets. By the end of this book, you will have explored and tested the most popular modeling techniques in use on real-world datasets and mastered a diverse range of techniques in predictive analytics.

What this book covers

Chapter 1, Gearing Up for Predictive Modeling, helps you set up and get ready to start looking at individual models and case studies, then describes the process of predictive modeling in a series of steps, and introduces several fundamental distinctions.

Chapter 2, Tidying Data and Measuring Performance, covers performance metrics, learning curves, and a process for tidying data.

Chapter 3, Linear Regression, explains the classic starting point for predictive modeling; it starts from the simplest single variable model and moves on to multiple regression, over-fitting, regularization, and describes regularized extensions of linear regression.

Chapter 4, Generalized Linear Models, follows on from linear regression, and in this chapter, introduces logistic regression as a form of binary classification, extends this to multinomial logistic regression, and uses these as a platform to present the concepts of sensitivity and specificity.

Chapter 5, Neural Networks, explains that the model of logistic regression can be seen as a single layer perceptron. This chapter discusses neural networks as an extension of this idea, along with their origins and explores their power.

Chapter 6, Support Vector Machines, covers a method of transforming data into a different space using a kernel function and as an attempt to find a decision line that maximizes the margin between the classes.

Chapter 7, Tree-Based Methods, presents various tree-based methods that are popularly used, such as decision trees and the famous C5.0 algorithm. Regression trees are also covered, as well as random forests, making the link with the previous chapter on bagging. Cross validation methods for evaluating predictors are presented in the context of these tree-based methods.

Chapter 8, Dimensionality Reduction, covers PCA, ICA, Factor analysis, and Non-negative Matrix factorization.

Chapter 9, Ensemble Methods, discusses methods for combining either many predictors, or multiple trained versions of the same predictor. This chapter introduces the important notions of bagging and boosting and how to use the AdaBoost algorithm to improve performance on one of the previously analyzed datasets using a single classifier.

Chapter 10, Probabilistic Graphical Models, introduces the Naive Bayes classifier as the simplest graphical model following a discussion of conditional probability and Bayes' rule. The Naive Bayes classifier is showcased in the context of sentiment analysis. Hidden Markov Models are also introduced and demonstrated through the task of next word prediction.

Chapter 11, Topic Modeling, provides step-by-step instructions for making predictions on topic models. It will also demonstrate methods of dimensionality reduction to summarize and simplify the data.

Chapter 12, Recommendation Systems, explores different approaches to building recommender systems in R, using nearest neighbor approaches, clustering, and algorithms such as collaborative filtering.

Chapter 13, Scaling Up, explains working with very large datasets, including some worked examples of how to train some models we've seen so far with very large datasets.

Chapter 14, Deep Learning, tackles the really important topic of deep learning using examples such as word embedding and recurrent neural networks (RNNs).

What you need for this book

In order to work with and to run the code examples found in this book, the following should be noted:

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and MacOS. To download R, there are a variety of locations available, including https://www.rstudio.com/products/rstudio/download.R includes extensive accommodations for accessing documentation and searching for help. A good source of information is at http://www.r-project.org/help.html.The capabilities of R are extended through user-created packages. Various packages are referred to and used throughout this book and the features of and access to each will be detailed as they are introduced. For example, the wordcloud package is introduced in Chapter 11, Topic Modeling to plot a cloud of words shared across documents. This is found at https://cran.r-project.org/web/packages/wordcloud/index.html.

Who this book is for

It would be helpful if the reader has had some experience with predictive analytics and the R programming language; however, this book will also be of value to readers who are new to these topics but are keen to get started as quickly as possible.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Predictive-Analytics-with-R-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringPredictiveAnalyticswithRSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Types of model

With a broad idea of the basic components of a model, we are ready to explore some of the common distinctions that modelers use to categorize different models.

Supervised, unsupervised, semi-supervised, and reinforcement learning models

We've already looked at the iris dataset, which consists of four features and one output variable, namely the species variable. Having the output variable available for all the observations in the training data is the defining characteristic of the supervised learning setting, which represents the most frequent scenario encountered. In a nutshell, the advantage of training a model under the supervised learning setting is that we have the correct answer that we should be predicting for the data points in our training data. As we saw in the previous section, kNN is a model that uses supervised learning, because the model makes its prediction for an input point by combining the values of the output variable for a small number of neighbors to that point. In this book, we will primarily focus on supervised learning.

Using the availability of the value of the output variable as a way to discriminate between different models, we can also envisage a second scenario in which the output variable is not specified. This is known as the unsupervised learning setting. An unsupervised version of the iris dataset would consist of only the four features. If we don't have the species output variable available to us, then we clearly have no idea as to which species each observation refers to. Indeed, we won't know how many species of flower are represented in the dataset, or how many observations belong to each species. At first glance, it would seem that without this information, no useful predictive task could be carried out. In fact, what we can do is examine the data and create groups of observations based on how similar they are to each other, using the four features available to us. This process is known as clustering. One benefit of clustering is that we can discover natural groups of data points in our data; for example, we might be able to discover that the flower samples in an unsupervised version of our iris set form three distinct groups that correspond to three different species.

Between unsupervised and supervised methods, which are two absolutes in terms of the availability of the output variable, reside the semi-supervised and reinforcement learning settings. Semi-supervised models are built using data for which a (typically quite small) fraction contains the values for the output variable, while the rest of the data is completely unlabeled. Many such models first use the labeled portion of the dataset in order to train the model coarsely, then incorporate the unlabeled data by projecting labels predicted by the model trained up this point.

In a reinforcement learning setting the output variable is not available, but other information that is directly linked with the output variable is provided. One example is predicting the next best move to win a chess game, based on data from complete chess games. Individual chess moves do not have output values in the training data, but for every game, the collective sequence of moves for each player resulted in either a win or a loss. Due to space constraints, semi-supervised and reinforcement settings aren't covered in this book.

Parametric and nonparametric models

In a previous section, we noted how most of the models we will encounter are parametric models, and we saw an example of a simple linear model. Parametric models have the characteristic that they tend to define a functional form. This means that they reduce the problem of selecting between all possible functions for the target function to a particular family of functions that form a parameter set. Selecting the specific function that will define the model essentially involves selecting precise values for the parameters. So, returning to our example of a three-feature linear model, we can see that we have the two following possible choices of parameters (the choices are infinite, of course; here we just demonstrate two specific ones):

Here, we have used a subscript on the output Y variable to denote the two different possible models. Which of these might be a better choice? The answer is that it depends on the data. If we apply each of our models on the observations in our dataset, we will get the predicted output for every observation. With supervised learning, every observation in our training data is labeled with the correct value of the output variable. To assess our model's goodness of fit, we can define an error function that measures the degree to which our predicted outputs differ from the correct outputs. We then use this to pick between our two candidate models in this case, but more generally to iteratively improve a model by moving through a sequence of progressively better candidate models.

Some parametric models are more flexible than linear models, meaning that they can be used to capture a greater variety of possible functions. Linear models, which require that the output be a linearly weighted combination of the input features, are considered strict. We can intuitively see that a more flexible model is more likely to allow us to approximate our input data with greater accuracy; however, when we look at overfitting, we'll see that this is not always a good thing. Models that are more flexible also tend to be more complex and, thus, training them often proves to be harder than training less flexible models.

Models are not necessarily parameterized, in fact, the class of models that have no parameters is known (unsurprisingly) as nonparametric models. Nonparametric models generally make no assumptions on the particular form of the output function. There are different ways of constructing a target function without parameters. Splines are a common example of a nonparametric model. The key idea behind splines is that we envisage the output function, whose form is unknown to us, as being defined exactly at the points that correspond to all the observations in our training data. Between the points, the function is locally interpolated using smooth polynomial functions. Essentially, the output function is built in a piecewise manner in the space between the points in our training data. Unlike most scenarios, splines will guarantee 100% accuracy on the training data, whereas it is perfectly normal to have some errors in our training data. Another good example of a nonparametric model is the k-nearest neighbor algorithm that we've already seen.

Regression and classification models

The distinction between regression and classification models has to do with the type of output we are trying to predict, and is generally relevant to supervised learning. Regression models try to predict a numerical or quantitative value, such as the stock market index, the amount of rainfall, or the cost of a project. Classification models try to predict a value from a finite (though still possibly large) set of classes or categories. Examples of this include predicting the topic of a website, the next word that will be typed by a user, a person's gender, or whether a patient has a particular disease given a series of symptoms. The majority of models that we will study in this book fall quite neatly into one of these two categories, although a few, such as neural networks, can be adapted to solve both types of problem. It is important to stress here that the distinction made is on the output only, and not on whether the feature values that are used to predict the output are quantitative or qualitative themselves. In general, features can be encoded in a way that allows both qualitative and quantitative features to be used in regression and classification models alike. Earlier, when we built a kNN model to predict the species of iris based on measurements of flower samples, we were solving a classification problem as our species output variable could take only one of three distinct labels.

The kNN approach can also be used in a regression setting; in this case, the model combines the numerical values of the output variable for the selected nearest neighbors by taking the mean or median in order to make its final prediction. Thus, kNN is also a model that can be used in both regression and classification settings.

Real-time and batch machine learning models

Predictive models can use real-time machine learning or they can involve batch learning. The term real-time machine learning can refer to two different scenarios, although it certainly does not refer to the idea that real-time machine learning involves making a prediction in real time, that is, within a predefined time limit that is typically small. For example, once trained, a neural network model can produce its prediction of the output using only a few computations (depending on the number of inputs and network layers). This is not, however, what we mean when we talk about real-time machine learning.