Mastering Machine Learning Algorithms - Giuseppe Bonaccorso c/o Quandoo - E-Book

Mastering Machine Learning Algorithms E-Book

Giuseppe Bonaccorso c/o Quandoo

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Explore and master the most important algorithms for solving complex machine learning problems.

Key Features

  • Discover high-performing machine learning algorithms and understand how they work in depth.
  • One-stop solution to mastering supervised, unsupervised, and semi-supervised machine learning algorithms and their implementation.
  • Master concepts related to algorithm tuning, parameter optimization, and more

Book Description

Machine learning is a subset of AI that aims to make modern-day computer systems smarter and more intelligent. The real power of machine learning resides in its algorithms, which make even the most difficult things capable of being handled by machines. However, with the advancement in the technology and requirements of data, machines will have to be smarter than they are today to meet the overwhelming data needs; mastering these algorithms and using them optimally is the need of the hour.

Mastering Machine Learning Algorithms is your complete guide to quickly getting to grips with popular machine learning algorithms. You will be introduced to the most widely used algorithms in supervised, unsupervised, and semi-supervised machine learning, and will learn how to use them in the best possible manner. Ranging from Bayesian models to the MCMC algorithm to Hidden Markov models, this book will teach you how to extract features from your dataset and perform dimensionality reduction by making use of Python-based libraries such as scikit-learn. You will also learn how to use Keras and TensorFlow to train effective neural networks.

If you are looking for a single resource to study, implement, and solve end-to-end machine learning problems and use-cases, this is the book you need.

What you will learn

  • Explore how a ML model can be trained, optimized, and evaluated
  • Understand how to create and learn static and dynamic probabilistic models
  • Successfully cluster high-dimensional data and evaluate model accuracy
  • Discover how artificial neural networks work and how to train, optimize, and validate them
  • Work with Autoencoders and Generative Adversarial Networks
  • Apply label spreading and propagation to large datasets
  • Explore the most important Reinforcement Learning techniques

Who this book is for

This book is an ideal and relevant source of content for data science professionals who want to delve into complex machine learning algorithms, calibrate models, and improve the predictions of the trained model. A basic knowledge of machine learning is preferred to get the best out of this guide.

Giuseppe Bonaccorso is an experienced team leader/manager in Artificial Intelligence and Machine/Deep Learning solution design, management, and delivery. He got his M.Sc.Eng. in Electronics Engineering in 2005 from University of Catania, Italy and continued his studies at the University of Rome Tor Vergata, Italy and the University of Essex, UK. His main interests include Machine/Deep Learning, Reinforcement Learning, bio-inspired adaptive systems, and Neural Language Processing.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 696

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering Machine Learning Algorithms

 

 

Expert techniques to implement popular machine learning algorithms and fine-tune your models

 

 

 

 

 

 

 

 

 

Giuseppe Bonaccorso

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Mastering Machine Learning Algorithms

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor:Pravin DhandreAcquisition Editor:Divya PoojariContent Development Editor:Eisha DsouzaTechnical Editors:Jovita Alva, Ishita VoraCopy Editor:Safis EditingProject Coordinator:Shweta H BirwatkarProofreader: Safis EditingIndexer:Priyanka DhadkeGraphics:Jisha ChirayilProduction Coordinator:Aparna Bhagat

First published: May 2018

Production reference: 1240518

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78862-111-3

www.packtpub.com

To my parents, who always supported me in the journey of life!
– Giuseppe Bonaccorso
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Giuseppe Bonaccorso is an experienced team leader/manager in AI, machine/deep learning solution design, management, and delivery. He got his M.Sc.Eng. in Electronics in 2005 from University of Catania, Italy, and continued his studies at University of Rome Tor Vergata and University of Essex, UK. His main interests include machine/deep learning, reinforcement learning, big data, bio-inspired adaptive systems, cryptocurrencies, and NLP. 

I want to thank the people who have been close to me and have supported me, especially my parents, who never stopped encouraging me.

About the reviewer

FrancescoAzzola is an electronics engineer with over 15 years of experience in computer programming. He is the author of Android Things Projects by Packt. He loves creating IoT projects using Arduino, Raspberry Pi, Android, and other IoT platforms. He is interested in convergence of IoT and mobile applications. He is certified in SCEA, SCWCD, and SCJP.

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Mastering Machine Learning Algorithms

Dedication

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Machine Learning Model Fundamentals

Models and data

Zero-centering and whitening

Training and validation sets

Cross-validation

Features of a machine learning model

Capacity of a model

Vapnik-Chervonenkis capacity

Bias of an estimator

Underfitting

Variance of an estimator

Overfitting

The Cramér-Rao bound

Loss and cost functions

Examples of cost functions

Mean squared error

Huber cost function

Hinge cost function

Categorical cross-entropy

Regularization

Ridge

Lasso

ElasticNet

Early stopping

Summary

Introduction to Semi-Supervised Learning

Semi-supervised scenario

Transductive learning

Inductive learning

Semi-supervised assumptions

Smoothness assumption

Cluster assumption

Manifold assumption

Generative Gaussian mixtures

Example of a generative Gaussian mixture

Weighted log-likelihood

Contrastive pessimistic likelihood estimation

Example of contrastive pessimistic likelihood estimation

Semi-supervised Support Vector Machines (S3VM)

Example of S3VM

Transductive Support Vector Machines (TSVM)

Example of TSVM

Summary

Graph-Based Semi-Supervised Learning

Label propagation

Example of label propagation

Label propagation in Scikit-Learn

Label spreading

Example of label spreading

Label propagation based on Markov random walks

Example of label propagation based on Markov random walks

Manifold learning

Isomap

Example of Isomap

Locally linear embedding

Example of locally linear embedding

Laplacian Spectral Embedding

Example of Laplacian Spectral Embedding

t-SNE

Example of t-distributed stochastic neighbor embedding 

Summary

Bayesian Networks and Hidden Markov Models

Conditional probabilities and Bayes' theorem

Bayesian networks

Sampling from a Bayesian network

Direct sampling

Example of direct sampling

A gentle introduction to Markov chains

Gibbs sampling

Metropolis-Hastings sampling

Example of Metropolis-Hastings sampling

Sampling example using PyMC3

Hidden Markov Models (HMMs)

Forward-backward algorithm

Forward phase

Backward phase

HMM parameter estimation

Example of HMM training with hmmlearn

Viterbi algorithm

Finding the most likely hidden state sequence with hmmlearn

Summary

EM Algorithm and Applications

MLE and MAP learning

EM algorithm

An example of parameter estimation

Gaussian mixture

An example of Gaussian Mixtures using Scikit-Learn

Factor analysis

An example of factor analysis with Scikit-Learn

Principal Component Analysis

An example of PCA with Scikit-Learn

Independent component analysis

An example of FastICA with Scikit-Learn

Addendum to HMMs

Summary

Hebbian Learning and Self-Organizing Maps

Hebb's rule

Analysis of the covariance rule

Example of covariance rule application

Weight vector stabilization and Oja's rule

Sanger's network

Example of Sanger's network

Rubner-Tavan's network

Example of Rubner-Tavan's network

Self-organizing maps

Example of SOM

Summary

Clustering Algorithms

k-Nearest Neighbors

KD Trees

Ball Trees

Example of KNN with Scikit-Learn

K-means

K-means++

Example of K-means with Scikit-Learn

Evaluation metrics

Homogeneity score

Completeness score

Adjusted Rand Index

Silhouette score

Fuzzy C-means

Example of fuzzy C-means with Scikit-Fuzzy

Spectral clustering

Example of spectral clustering with Scikit-Learn

Summary

Ensemble Learning

Ensemble learning fundamentals

Random forests

Example of random forest with Scikit-Learn

AdaBoost

AdaBoost.SAMME

AdaBoost.SAMME.R

AdaBoost.R2

Example of AdaBoost with Scikit-Learn

Gradient boosting

Example of gradient tree boosting with Scikit-Learn

Ensembles of voting classifiers

Example of voting classifiers with Scikit-Learn

Ensemble learning as model selection

Summary

Neural Networks for Machine Learning

The basic artificial neuron

Perceptron

Example of a perceptron with Scikit-Learn

Multilayer perceptrons

Activation functions

Sigmoid and hyperbolic tangent

Rectifier activation functions

Softmax

Back-propagation algorithm

Stochastic gradient descent

Weight initialization

Example of MLP with Keras

Optimization algorithms

Gradient perturbation

Momentum and Nesterov momentum

SGD with momentum in Keras

RMSProp

RMSProp with Keras

Adam

Adam with Keras

AdaGrad

AdaGrad with Keras

AdaDelta

AdaDelta with Keras

Regularization and dropout

Dropout

Example of dropout with Keras

Batch normalization

Example of batch normalization with Keras

Summary

Advanced Neural Models

Deep convolutional networks

Convolutions

Bidimensional discrete convolutions

Strides and padding

Atrous convolution

Separable convolution

Transpose convolution

Pooling layers

Other useful layers

Examples of deep convolutional networks with Keras

Example of a deep convolutional network with Keras and data augmentation

Recurrent networks

Backpropagation through time (BPTT)

LSTM

GRU

Example of an LSTM network with Keras

Transfer learning

Summary

Autoencoders

Autoencoders

An example of a deep convolutional autoencoder with TensorFlow

Denoising autoencoders

An example of a denoising autoencoder with TensorFlow

Sparse autoencoders

Adding sparseness to the Fashion MNIST deep convolutional autoencoder

Variational autoencoders

An example of a variational autoencoder with TensorFlow

Summary

Generative Adversarial Networks

Adversarial training

Example of DCGAN with TensorFlow

Wasserstein GAN (WGAN)

Example of WGAN with TensorFlow

Summary

Deep Belief Networks

MRF

RBMs

DBNs

Example of unsupervised DBN in Python

Example of Supervised DBN with Python

Summary

Introduction to Reinforcement Learning

Reinforcement Learning fundamentals

Environment

Rewards

Checkerboard environment in Python

Policy

Policy iteration

Policy iteration in the checkerboard environment

Value iteration

Value iteration in the checkerboard environment

TD(0) algorithm

TD(0) in the checkerboard environment

Summary

Advanced Policy Estimation Algorithms

TD(λ) algorithm

TD(λ) in a more complex Checkerboard environment

Actor-Critic TD(0) in the checkerboard environment

SARSA algorithm

SARSA in the checkerboard environment

Q-learning

Q-learning in the checkerboard environment

Q-learning using a neural network

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

In the last few years, machine learning has become a more and more important field in the majority of industries. Many tasks once considered impossible to automate are now completely managed by computers, allowing human beings to focus on more creative tasks. This revolution has been made possible by the dramatic improvement of standard algorithms, together with a continuous reduction in hardware prices. The complexity that was a huge obstacle only a decade ago is now a problem than even a personal computer can solve. The general availability of high-level open source frameworks has allowed everybody to design and train extremely powerful models.

The main goal of this book is to introduce the reader to complex techniques (such as semi-supervised and manifold learning, probabilistic models, and neural networks), balancing mathematical theory with practical examples written in Python. I wanted to keep a pragmatic approach, focusing on the applications but not neglecting the necessary theoretical foundation. In my opinion, a good knowledge of this field can be acquired only by understanding the underlying logic, which is always expressed using mathematical concepts. This extra effort is rewarded with a more solid awareness of every specific choice and helps the reader understand how to apply, modify, and improve all the algorithms in specific business contexts.

Machine learning is an extremely wide field and it's impossible to cover all the topics in a book. In this case, I've done my best to cover a selection of algorithms belonging to supervised, semi-supervised, unsupervised, and Reinforcement Learning, providing all the references necessary to further explore each of them. The examples have been designed to be easy to understand without any deep insight into the code; in fact, I believe it's more important to show the general cases and let the reader improve and adapt them to cope with particular scenarios. I apologize for mistakes: even if many revisions have been made, it's possible that some details (both in the formulas and in the code) got away. I hope this book will be the starting point for many professionals struggling to enter this fascinating world with a pragmatic and business-oriented viewpoint!

Who this book is for

The ideal audience for this book is computer science students and professionals who want to acquire detailed knowledge of complex machine learning algorithms and applications. The approach is always pragmatic; however, the theoretical part requires some advanced mathematical skills that all graduates (in computer science, engineering, mathematics, or science) should have acquired. The book can be also utilized by more business-oriented professionals (such as CPOs and product managers) to understand how machine learning can be employed to improve existing products and businesses.

What this book covers

Chapter 1, Machine Learning Model Fundamentals, explains the most important theoretical concepts regarding machine learning models, including bias, variance, overfitting, underfitting, data normalization, and cost functions. It can be skipped by those readers with a strong knowledge of these concepts.

Chapter 2, Introduction to Semi-Supervised Learning, introduces the reader to the main elements of semi-supervised learning, focusing on inductive and transductive learning algorithms.

Chapter 3, Graph-Based Semi-Supervised Learning, continues the exploration of semi-supervised learning algorithms belonging to the families of graph-based and manifold learning models. Label propagation and non-linear dimensionality reduction are analyzed in different contexts, providing some effective solutions that can be immediately exploited using Scikit-Learn functionalities.

Chapter 4, Bayesian Networks and Hidden Markov Models, introduces the concepts of probabilistic modeling using direct acyclic graphs, Markov chains, and sequential processes.

Chapter 5, EM Algorithm and Applications, explains the generic structure of the Expectation-Maximization (EM) algorithm. We discuss some common applications, such as Gaussian mixture, Principal Component Analysis, Factor Analysis, and Independent Component Analysis. This chapter requires deep mathematical knowledge; however, the reader can skip the proofs and concentrate on the final results.

Chapter 6, Hebbian Learning and Self-Organizing Maps, introduces Hebb's rule, which is one of the oldest neuro-scientific concepts and whose applications are incredibly powerful. The chapter explains how a single neuron works and presents two complex models (Sanger network and Rubner-Tavan network) that can perform a Principal Component Analysis without the input covariance matrix.

Chapter 7, Clustering Algorithms, introduces some common and important unsupervised algorithms, such as k-Nearest Neighbors (based on KD Trees and Ball Trees), K-means (with K-means++ initialization), fuzzy C-means, and spectral clustering. Some important metrics (such as Silhouette score/plots) are also analyzed.

Chapter 8, Ensemble Learning, explains the main concepts of ensemble learning (bagging, boosting, and stacking), focusing on Random Forests, AdaBoost (with its variants), Gradient Boosting, and Voting Classifiers.

Chapter 9, Neural Networks for Machine Learning, introduces the concepts of neural computation, starting with the behavior of a perceptron and continuing the analysis of multi-layer perceptron, activation functions, back-propagation, stochastic gradient descent (and the most important optimization algorithm), regularization, dropout, and batch normalization.

Chapter 10, Advanced Neural Models, continues the explanation of the most important deep learning methods focusing on convolutional networks, recurrent networks, LSTM, and GRU.

Chapter 11, Autoencoders, explains the main concepts of an autoencoder, discussing its application in dimensionality reduction, denoising, and data generation (variational autoencoders).

Chapter 12, Generative Adversarial Networks, explains the concept of adversarial training. We focus on Deep Convolutional GANs and Wasserstein GANs. Both techniques are extremely powerful generative models that can learn the structure of an input data distribution and generate brand new samples without any additional information.

Chapter 13, Deep Belief Networks, introduces the concepts of Markov random fields, Restricted Boltzmann Machines, and Deep Belief Networks. These models can be employed both in supervised and unsupervised scenarios with excellent performance.

Chapter 14, Introduction to Reinforcement Learning, explains the main concepts of Reinforcement Learning (agent, policy, environment, reward, and value) and applies them to introduce policy and value iteration algorithms and Temporal-Difference Learning (TD(0)). The examples are based on a custom checkerboard environment.

Chapter 15, Advanced Policy Estimation Algorithms, extends the concepts defined in the previous chapter, discussing the TD(λ) algorithm, TD(0) Actor-Critic, SARSA, and Q-Learning. A basic example of Deep Q-Learning is also presented to allow the reader to immediately apply these concepts to more complex environments.

To get the most out of this book

There are no strict prerequisites for this book; however, it's important to have basic-intermediate Python knowledge with a specific focus on NumPy. Whenever necessary, I will provide instructions/references to install specific packages and exploit more advanced functionalities. As Python is based on a semantic indentation, the published version can contain incorrect newlines that raise exceptions when executing the code. For this reason, I invite all readers without deep knowledge of this language to refer to the original source code provided with the book.

All the examples are based on Python 3.5+. I suggest using the Anaconda distribution (https://www.anaconda.com/download/), which is probably the most complete and powerful one for scientific projects. The majority of the required packages are already built in and it's very easy to install the new ones (sometimes with optimized versions). However, any other Python distribution can be used. Moreover, I invite readers to test the examples using Jupyter (formerly known as IPython) notebooks so as to avoid rerunning the whole example when a change is made. If instead an IDE is preferred, I suggest PyCharm, which offers many built-in functionalities that are very helpful in data-oriented and scientific projects (such as the internal Matplotlib viewer).

A good mathematics background is necessary to fully understand the theoretical part. In particular, basic skills in probability theory, calculus, and linear algebra are required. However, I advise you not to give up when a concept seems too difficult. The reference sections contain many useful books, and the majority of concepts are explained quite well on Wikipedia too. When something unknown is encountered, I suggest reading the specific documentation before continuing. In many cases, it's not necessary to have complete knowledge and even an introductory paragraph can be enough to understand their rationale.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Machine-Learning-Algorithms. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/MasteringMachineLearningAlgorithms_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Machine Learning Model Fundamentals

Machine learning models are mathematical systems that share many common features. Even if, sometimes, they have been defined only from a theoretical viewpoint, research advancement allows us to apply several concepts to better understand the behavior of complex systems such as deep neural networks. In this chapter, we're going to introduce and discuss some fundamental elements that some skilled readers may already know, but that, at the same time, offer several possible interpretations and applications.

In particular, in this chapter we're discussing the main elements of:

Data-generating processes

Finite datasets

Training and test split strategies

Cross-validation

Capacity, bias, and variance of a model

Vapnik-Chervonenkis theory

Cramér-Rao bound

Underfitting and overfitting

Loss and cost functions

Regularization

Models and data

Machine learning algorithms work with data. They create associations, find out relationships, discover patterns, generate new samples, and more, working with well-defined datasets. Unfortunately, sometimes the assumptions or the conditions imposed on them are not clear, and a lengthy training process can result in a complete validation failure. Even if this condition is stronger in deep learning contexts, we can think of a model as a gray box (some transparency is guaranteed by the simplicity of many common algorithms), where a vectorial input  is transformed into a vectorial output :

Schema of a generic model parameterized with the vector θ

In the previous diagram, the model has been represented by a pseudo-function that depends on a set of parameters defined by the vector θ. In this section, we are only considering parametric models, although there's a family of algorithms that are called non-parametric, because they are based only on the structure of the data. We're going to discuss some of them in upcoming chapters.

The task of a parametric learning process is therefore to find the best parameter set that maximizes a target function whose value is proportional to the accuracy (or the error, if we are trying to minimize them) of the model given a specific input X and output Y. This definition is not very rigorous, and it will be improved in the following sections; however, it's useful as a way to understand the context we're working in.

Then, the first question to ask is: What is the nature of X? A machine learning problem is focused on learning abstract relationships that allow a consistent generalization when new samples are provided. More specifically, we can define a stochastic data generating process with an associated joint probability distribution:

Sometimes, it's useful to express the joint probability p(x, y) as a product of the conditional p(y|x), which expresses the probability of a label given a sample, and the marginal probability of the samples p(x). This expression is particularly useful when the prior probability p(x) is known in semi-supervised contexts, or when we are interested in solving problems using the Expectation Maximization (EM) algorithm. We're going to discuss this approach in upcoming chapters.

In many cases, we are not able to derive a precise distribution; however, when considering a dataset, we always assume that it's drawn from the original data-generating distribution. This condition isn't a purely theoretical assumption, because, as we're going to see, whenever our data points are drawn from different distributions, the accuracy of the model can dramatically decrease.

If we sample N independent and identically distributed (i.i.d.) values from pdata, we can create a finite dataset X made up of k-dimensional real vectors:

In a supervised scenario, we also need the corresponding labels (with t output values):

When the output has more than two classes, there are different possible strategies to manage the problem. In classical machine learning, one of the most common approaches is One-vs-All, which is based on training N different binary classifiers where each label is evaluated against all the remaining ones. In this way, N-1 is performed to determine the right class. With shallow and deep neural networks, instead, it's preferable to use a softmax function to represent the output probability distribution for all classes:

This kind of output (zi represents the intermediate values, and the sum of the terms is normalized to 1) can be easily managed using the cross-entropy cost function (see the corresponding paragraph in the Loss and cost functions section). 

Features of a machine learning model

In this section, we're going to consider supervised models, and try to determine how it's possible to measure their theoretical potential accuracy and their ability to generalize correctly over all possible samples drawn from pdata. The majority of these concepts were developed before the deep learning age, but continue to have an enormous influence on research projects. The idea of capacity, for example, is an open-ended question that neuroscientists keep on asking themselves about the human brain. Modern deep learning models with dozens of layers and millions of parameters reopened the theoretical question from a mathematical viewpoint. Together with this, other elements, like the limits for the variance of an estimator, again attracted the limelight because the algorithms are becoming more and more powerful, and performances that once were considered far from any feasible solution are now a reality. Being able to train a model, so as to exploit its full capacity, maximize its generalization ability, and increase the accuracy, overcoming even human performances, is what a deep learning engineer nowadays has to expect from his work.

Capacity of a model

If we consider a supervised model as a set of parameterized functions, we can define representational capacity as the intrinsic ability of a certain generic function to map a relatively large number of data distributions. To understand this concept, let's consider a function f(x) that admits infinite derivatives, and rewrite it as a Taylor expansion:

We can decide to take only the first n terms, so to have an n-degree polynomial function. Consider a simple bi-dimensional scenario with six functions (starting from a linear one); we can observe the different behavior with a small set of data points:

Different behavior produced by six polynomial separating curves

The ability to rapidly change the curvature is proportional to the degree. If we choose a linear classifier, we can only modify its slope (the example is always in a bi-dimensional space) and the intercept. Instead, if we pick a higher-degree function, we have more possibilities to bend the curvature when it's necessary. If we consider n=1 and n=2 in the plot (on the top-right, they are the first and the second functions), with n=1, we can include the dot corresponding to x=11, but this choice has a negative impact on the dot at x=5.

Only a parameterized non-linear function can solve this problem efficiently, because this simple problem requires a representational capacity higher than the one provided by linear classifiers. Another classical example is the XOR function. For a long time, several researchers opposed perceptrons (linear neural networks), because they weren't able to classify a dataset generated by the XOR function. Fortunately, the introduction of multilayer perceptrons, with non-linear functions, allowed us to overcome this problem, and many whose complexity is beyond the possibilities of any classic machine learning model.

Vapnik-Chervonenkis capacity

 A mathematical formalization of the capacity of a classifier is provided by the Vapnik-Chervonenkis theory. To introduce the definition, it's first necessary to define the concept of shattering. If we have a class of sets C and a set M, we say that C shatters M if:

In other words, given any subset of M, it can be obtained as the intersection of a particular instance of C (cj) and M itself. If we now consider a model as a parameterized function:

We want to determine its capacity in relation to a finite dataset X:

According to the Vapnik-Chervonenkis theory, we can say that the model f shatters X if there are no classification errors for every possible label assignment. Therefore, we can define the Vapnik-Chervonenkis-capacity or VC-capacity (sometimes called VC-dimension) as the maximum cardinality of a subset of X so that f can shatter it.

For example, if we consider a linear classifier in a bi-dimensional space, the VC-capacity is equal to 3, because it's always possible to label three samples so that f shatters them; however, it's impossible to do it in all situations where N > 3. The XOR problem is an example that needs a VC-capacity higher than 3. Let's explore the following plot:

XOR problem with different separating curves

This particular label choice makes the set non-linearly separable. The only way to overcome this problem is to use higher-order functions (or non-linear ones). The curve lines (belonging to a classifier whose VC-capacity is greater than 3) can separate both the upper-left and the lower-right regions from the remaining space, but no straight line can do the same (while it can always separate one point from the other three).

Bias of an estimator

Let's now consider a parameterized model with a single vectorial parameter (this isn't a limitation, but only a didactic choice):

The goal of a learning process is to estimate the parameter θ so as, for example, to maximize the accuracy of a classification. We define the bias of an estimator (in relation to a parameter θ):

In other words, the bias is the difference between the expected value of the estimation and the real parameter value. Remember that the estimation is a function of X, and cannot be considered a constant in the sum.

An estimator is said to be unbiased if:

Moreover, the estimator is defined as consistent if the sequence of estimations converges (at least with probability 1) to the real value when k → ∞:

Given a dataset X whose samples are drawn from pdata, the accuracy of an estimator is inversely proportional to its bias. Low-bias (or unbiased) estimators are able to map the dataset X with high-precision levels, while high-bias estimators are very likely to have a capacity that isn't high enough for the problem to solve, and therefore their ability to detect the whole dynamic is poor. 

Let's now compute the derivative of the bias with respect to the vector θ (it will be useful later):

Consider that the last equation, thanks to the linearity of E[•], holds also if we add a term that doesn't depend on x to the estimation of θ. In fact, in line with the laws of probability, it's easy to verify that:

Underfitting

A model with a high bias is likely to underfit the training set. Let's consider the scenario shown in the following graph:

Underfitted classifier: The curve cannot separate correctly the two classes

Even if the problem is very hard, we could try to adopt a linear model and, at the end of the training process, the slope and the intercept of the separating line are about -1 and 0 (as shown in the plot); however, if we measure the accuracy, we discover that it's close to 0! Independently from the number of iterations, this model will never be able to learn the association between X and Y. This condition is called underfitting, and its major indicator is a very low training accuracy. Even if some data preprocessing steps can improve the accuracy, when a model is underfitted, the only valid solution is to adopt a higher-capacity model.

In a machine learning task, our goal is to achieve the maximum accuracy, starting from the training set and then moving on to the validation set. More formally, we can say that we want to improve our models so to get as close as possible to Bayes accuracy. This is not a well-defined value, but a theoretical upper limit that is possible to achieve using an estimator. In the following diagram, we see a representation of this process:

Accuracy level diagram

Bayes accuracy is often a purely theoretical limit and, for many tasks, it's almost impossible to achieve using even biological systems; however, advancements in the field of deep learning allow to create models that have a target accuracy slightly below the Bayes one. In general, there's no closed form for determining the Bayes accuracy, therefore human abilities are considered as a benchmark. In the previous classification example, a human being is immediately able to distinguish among different dot classes, but the problem can be very hard for a limited-capacity classifier. Some of the models we're going to discuss can solve this problem with a very high target accuracy, but at this point, we run another risk that can be understood after defining the concept of variance of an estimator.

Variance of an estimator

At the beginning of this chapter, we have defined the data generating process pdata, and we have assumed that our dataset X has been drawn from this distribution; however, we don't want to learn existing relationships limited to X, but we expect our model to be able to generalize correctly to any other subset drawn from pdata. A good measure of this ability is provided by the variance of the estimator:

The variance can be also defined as the square of the standard error (analogously to the standard deviation). A high variance implies dramatic changes in the accuracy when new subsets are selected, because the model has probably reached a very high training accuracy through an over-learning of a limited set of relationships, and it has almost completely lost its ability to generalize. 

Overfitting

If underfitting was the consequence of a low capacity and a high bias, overfitting is a phenomenon that a high variance can detect. In general, we can observe a very high training accuracy (even close to the Bayes level), but not a poor validation accuracy. This means that the capacity of the model is high enough or even excessive for the task (the higher the capacity, the higher the probability of large variances), and that the training set isn't a good representation of pdata. To understand the problem, consider the following classification scenarios:

Acceptable fitting (left), overfitted classifier (right)

The left plot has been obtained using logistic regression, while, for the right one, the algorithm is SVM with a sixth-degree polynomial kernel. If we consider the second model, the decision boundaries seem much more precise, with some samples just over them. Considering the shapes of the two subsets, it would be possible to say that a non-linear SVM can better capture the dynamics; however, if we sample another dataset from pdata and the diagonal tail becomes wider, logistic regression continues to classify the points correctly, while the SVM accuracy decreases dramatically. The second model is very likely to be overfitted, and some corrections are necessary. When the validation accuracy is much lower than the training one, a good strategy is to increase the number of training samples to consider the real pdata. In fact, it can happen that a training set is built starting from a hypothetical distribution that doesn't reflect the real one; or the number of samples used for the validation is too high, reducing the amount of information carried by the remaining samples. Cross-validation is a good way to assess the quality of datasets, but it can always happen that we find completely new subsets (for example, generated when the application is deployed in a production environment) that are misclassified, even if they were supposed to belong to pdata. If it's not possible to enlarge the training set, data augmentation could be a valid solution, because it allows creating artificial samples (for images, it's possible to mirror, rotate, or blur them) starting from the information stored in the known ones. Other strategies to prevent overfitting are based on a technique called regularization, which we're going to discuss in the last part of this chapter. For now, we can say that the effect of regularization is similar to a partial linearization, which implies a capacity reduction with a consequent variance decrease.

The Cramér-Rao bound

If it's theoretically possible to create an unbiased model (even asymptotically), this is not true for variance. To understand this concept, it's necessary to introduce an important definition: the Fisher information. If we have a parameterized model and a data-generating process pdata, we can define the likelihood function by considering the following parameters:

This function allows measuring how well the model describes the original data generating process. The shape of the likelihood can vary substantially, from well-defined, peaked curves, to almost flat surfaces. Let's consider the following graph, showing two examples based on a single parameter:

Very peaked likelihood (left), flatter likelihood (right)

We can immediately understand that, in the first case, the maximum likelihood can be easily reached by gradient ascent, because the surface is very peaked. In the second case, instead, the gradient magnitude is smaller, and it's rather easy to stop before reaching the actual maximum because of numerical imprecisions or tolerances. In worst cases, the surface can be almost flat in very large regions, with a corresponding gradient close to zero. Of course, we'd like to always work with very sharp and peaked likelihood functions, because they carry more information about their maximum. More formally, the Fisher information quantifies this value. For a single parameter, it is defined as follows:

The Fisher information is an unbounded non-negative number that is proportional to the amount of information carried by the log-likelihood; the use of logarithm has no impact on the gradient ascent, but it simplifies complex expressions by turning products into sums. This value can be interpreted as the speed of the gradient when the function is reaching the maximum; therefore, higher values imply better approximations, while a hypothetical value of zero means that the probability to determine the right parameter estimation is also null.

When working with a set of K parameters, the Fisher information becomes a positive semidefinite matrix:

This matrix is symmetric, and also has another important property: when a value is zero, it means that the corresponding couple of parameters are orthogonal for the purpose of the maximum likelihood estimation, and they can be considered separately. In many real cases, if a value is close to zero, it determines a very low correlation between parameters and, even if it's not mathematically rigorous, it's possible to decouple them anyway.

At this point, it's possible to introduce the Cramér-Rao bound, which states that for every unbiased estimator that adopts x (with probability distribution p(x; θ)) as a measure set, the variance of any estimator of θ is always lower-bounded according to the following inequality:

In fact, considering initially a generic estimator and exploiting Cauchy-Schwarz inequality with the variance and the Fisher information (which are both expressed as expected values), we obtain:

Now, if we use the expression for derivatives of the bias with respect to θ, considering that the expected value of the estimation of θ doesn't depend on x, we can rewrite the right side of the inequality as:

If the estimator is unbiased, the derivative on the right side is equal to zero, therefore, we get:

In other words, we can try to reduce the variance, but it will be always lower-bounded by the inverse Fisher information. Therefore, given a dataset and a model, there's always a limit to the ability to generalize. In some cases, this measure is easy to determine; however, its real value is theoretical, because it provides the likelihood function with another fundamental property: it carries all the information needed to estimate the worst case for variance. This is not surprising: when we discussed the capacity of a model, we saw how different functions could drive to higher or lower accuracies. If the training accuracy is high enough, this means that the capacity is appropriate or even excessive for the problem; however, we haven't considered the role of the likelihood p(X| θ).

High-capacity models, in particular, with small or low-informative datasets, can drive to flat likelihood surfaces with a higher probability than lower-capacity models. Therefore, the Fisher information tends to become smaller, because there are more and more parameter sets that yield similar probabilities, and this, at the end of the day, drives to higher variances and an increased risk of overfitting. To conclude this section, it's useful to consider a general empirical rule derived from the Occam's razor principle: whenever a simpler model can explain a phenomenon with enough accuracy, it doesn't make sense to increase its capacity. A simpler model is always preferable (when the performance is good and it represents accurately the specific problem), because it's normally faster both in the training and in the inference phases, and more efficient. When talking about deep neural networks, this principle can be applied in a more precise way, because it's easier to increase or decrease the number of layers and neurons until the desired accuracy has been achieved.

Loss and cost functions

At the beginning of this chapter, we discussed the concept of generic target function so as to optimize in order to solve a machine learning problem. More formally, in a supervised scenario, where we have finite datasets X and Y:

We can define the generic loss function for a single sample as:

J is a function of the whole parameter set, and must be proportional to the error between the true label and the predicted. Another important property is convexity. In many real cases, this is an almost impossible condition; however, it's always useful to look for convex loss functions, because they can be easily optimized through the gradient descent method. We're going to discuss this topic in Chapter 9, Neural Networks for Machine Learning. However, for now, it's useful to consider a loss function as an intermediate between our training process and a pure mathematical optimization. The missing link is the complete data. As already discussed, X is drawn from pdata, so it should represent the true distribution. Therefore, when minimizing the loss function, we're considering a potential subset of points, and never the whole real dataset. In many cases, this isn't a limitation, because, if the bias is null and the variance is small enough, the resulting model will show a good generalization ability (high training and validation accuracy); however, considering the data generating process, it's useful to introduce another measure called expected risk:

This value can be interpreted as an average of the loss function over all possible samples drawn from pdata. Minimizing the expected risk implies the maximization of the global accuracy. When working with a finite number of training samples, instead, it's common to define a cost function (often called a loss function as well, and not to be confused with the log-likelihood):

This is the actual function that we're going to minimize and, divided by the number of samples (a factor that doesn't have any impact), it's also called empirical risk, because it's an approximation (based on real data) of the expected risk. In other words, we want to find a set of parameters so that:

When the cost function has more than two parameters, it's very difficult and perhaps even impossible to understand its internal structure; however, we can analyze some potential conditions using a bidimensional diagram:

 Different kinds of points in a bidimensional scenario

The different situations we can observe are:

The

starting point

, where the cost function is usually very high due to the error.

Local minima

, where the gradient is null (and the second derivative is positive). They are candidates for the optimal parameter set, but unfortunately, if the concavity isn't too deep, an inertial movement or some noise can easily move the point away.

Ridges

(or

local maxima

), where the gradient is null, and the second derivative is negative. They are unstable points, because a minimum perturbation allows escaping, reaching lower-cost areas.

Plateaus

, or the region where the surface is almost flat and the gradient is close to zero. The only way to escape a plateau is to keep a residual kinetic energy—we're going to discuss this concept when talking about neural optimization algorithms (

Chapter 9

,

Neural Networks for Machine Learning

).

Global minimum

, the point we want to reach to optimize the cost function.