E-Book
39,59 €

Machine Learning Algorithms E-Book

Giuseppe Bonaccorso

0,0

39,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Machine learning has gained tremendous popularity for its powerful and fast predictions with large datasets. However, the true forces behind its powerful output are the complex algorithms involving substantial statistical analysis that churn large datasets and generate substantial insight.
This second edition of Machine Learning Algorithms walks you through prominent development outcomes that have taken place relating to machine learning algorithms, which constitute major contributions to the machine learning process and help you to strengthen and master statistical interpretation across the areas of supervised, semi-supervised, and reinforcement learning. Once the core concepts of an algorithm have been covered, you’ll explore real-world examples based on the most diffused libraries, such as scikit-learn, NLTK, TensorFlow, and Keras. You will discover new topics such as principal component analysis (PCA), independent component analysis (ICA), Bayesian regression, discriminant analysis, advanced clustering, and gaussian mixture.
By the end of this book, you will have studied machine learning algorithms and be able to put them into production to make your machine learning applications more innovative.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 560

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Machine Learning AlgorithmsSecond Edition

Popular algorithms for data science and machine learning

Giuseppe Bonaccorso

BIRMINGHAM - MUMBAI

Machine Learning Algorithms Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor:Divya PoojariContent Development Editor:Eisha DsouzaTechnical Editor: Jovita AlvaCopy Editor: Safis EditingProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer:Tejal Daruwale SoniGraphics:Jisha ChirayilProduction Coordinator:Nilesh Mohite

First published: July 2017 Second edition: August 2018

Production reference: 1280818

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-799-9

www.packtpub.com

To my family and to all the people who always believed in me and encouraged me in this long journey!

– Giuseppe Bonaccorso

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Giuseppe Bonaccorso is an experienced team leader/manager in AI, machine/deep learning solution design, management, and delivery. He got his MScEng in electronics in 2005 from the University of Catania, Italy, and continued his studies at the University of Rome Tor Vergata and the University of Essex, UK. His main interests include machine/deep learning, reinforcement learning, big data, bio-inspired adaptive systems, cryptocurrencies, and NLP.

I want to thank the people who have been close to me and have supported me, especially my parents, who never stopped encouraging me.

About the reviewer

Doug Ortiz is an experienced enterprise cloud, big data, data analytics, and solutions architect who has architected, designed, developed, re-engineered, and integrated enterprise solutions. Other expertise includes Amazon Web Services, Azure, Google Cloud, business intelligence, Hadoop, Spark, NoSQL databases, and SharePoint, to name a few.

He is the founder of Illustris, LLC and is reachable at [email protected].

Huge thanks to my wonderful wife, Milla, Maria, Nikolay, and our children for all their support.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Machine Learning Algorithms Second Edition

Dedication

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

A Gentle Introduction to Machine Learning

Introduction – classic and adaptive machines

Descriptive analysis

Predictive analysis

Only learning matters

Supervised learning

Unsupervised learning

Semi-supervised learning

Reinforcement learning

Computational neuroscience

Beyond machine learning – deep learning and bio-inspired adaptive systems

Machine learning and big data

Summary

Important Elements in Machine Learning

Data formats

Multiclass strategies

One-vs-all

One-vs-one

Learnability

Underfitting and overfitting

Error measures and cost functions

PAC learning

Introduction to statistical learning concepts

MAP learning

Maximum likelihood learning

Class balancing

Resampling with replacement

SMOTE resampling

Elements of information theory

Entropy

Cross-entropy and mutual information

Divergence measures between two probability distributions

Summary

Feature Selection and Feature Engineering

scikit-learn toy datasets

Creating training and test sets

Managing categorical data

Managing missing features

Data scaling and normalization

Whitening

Feature selection and filtering

Principal Component Analysis

Non-Negative Matrix Factorization

Sparse PCA

Kernel PCA

Independent Component Analysis

Atom extraction and dictionary learning

Visualizing high-dimensional datasets using t-SNE

Summary

Regression Algorithms

Linear models for regression

A bidimensional example

Linear regression with scikit-learn and higher dimensionality

R2 score

Explained variance

Regressor analytic expression

Ridge, Lasso, and ElasticNet

Ridge

Lasso

ElasticNet

Robust regression

RANSAC

Huber regression

Bayesian regression

Polynomial regression

Isotonic regression

Summary

Linear Classification Algorithms

Linear classification

Logistic regression

Implementation and optimizations

Stochastic gradient descent algorithms

Passive-aggressive algorithms

Passive-aggressive regression

Finding the optimal hyperparameters through a grid search

Classification metrics

Confusion matrix

Precision

Recall

F-Beta

Cohen's Kappa

Global classification report

Learning curve

ROC curve

Summary

Naive Bayes and Discriminant Analysis

Bayes' theorem

Naive Bayes classifiers

Naive Bayes in scikit-learn

Bernoulli Naive Bayes

Multinomial Naive Bayes

An example of Multinomial Naive Bayes for text classification

Gaussian Naive Bayes

Discriminant analysis

Summary

Support Vector Machines

Linear SVM

SVMs with scikit-learn

Linear classification

Kernel-based classification

Radial Basis Function

Polynomial kernel

Sigmoid kernel

Custom kernels

Non-linear examples

ν-Support Vector Machines

Support Vector Regression

An example of SVR with the Airfoil Self-Noise dataset

Introducing semi-supervised Support Vector Machines (S3VM)

Summary

Decision Trees and Ensemble Learning

Binary Decision Trees

Binary decisions

Impurity measures

Gini impurity index

Cross-entropy impurity index

Misclassification impurity index

Feature importance

Decision Tree classification with scikit-learn

Decision Tree regression

Example of Decision Tree regression with the Concrete Compressive Strength dataset

Introduction to Ensemble Learning

Random Forests

Feature importance in Random Forests

AdaBoost

Gradient Tree Boosting

Voting classifier

Summary

Clustering Fundamentals

Clustering basics

k-NN

Gaussian mixture

Finding the optimal number of components

K-means

Finding the optimal number of clusters

Optimizing the inertia

Silhouette score

Calinski-Harabasz index

Cluster instability

Evaluation methods based on the ground truth

Homogeneity

Completeness

Adjusted Rand Index

Summary

Advanced Clustering

DBSCAN

Spectral Clustering

Online Clustering

Mini-batch K-means

BIRCH

Biclustering

Summary

Hierarchical Clustering

Hierarchical strategies

Agglomerative Clustering

Dendrograms

Agglomerative Clustering in scikit-learn

Connectivity constraints

Summary

Introducing Recommendation Systems

Naive user-based systems

Implementing a user-based system with scikit-learn

Content-based systems

Model-free (or memory-based) collaborative filtering

Model-based collaborative filtering

Singular value decomposition strategy

Alternating least squares strategy

ALS with Apache Spark MLlib

Summary

Introducing Natural Language Processing

NLTK and built-in corpora

Corpora examples

The Bag-of-Words strategy

Tokenizing

Sentence tokenizing

Word tokenizing

Stopword removal

Language detection

Stemming

Vectorizing

Count vectorizing

N-grams

TF-IDF vectorizing

Part-of-Speech

Named Entity Recognition

A sample text classifier based on the Reuters corpus

Summary

Topic Modeling and Sentiment Analysis in NLP

Topic modeling

Latent Semantic Analysis

Probabilistic Latent Semantic Analysis

Latent Dirichlet Allocation

Introducing Word2vec with Gensim

Sentiment analysis

VADER sentiment analysis with NLTK

Summary

Introducing Neural Networks

Deep learning at a glance

Artificial neural networks

MLPs with Keras

Interfacing Keras to scikit-learn

Summary

Advanced Deep Learning Models

Deep model layers

Fully connected layers

Convolutional layers

Dropout layers

Batch normalization layers

Recurrent Neural Networks

An example of a deep convolutional network with Keras

An example of an LSTM network with Keras

A brief introduction to TensorFlow

Computing gradients

Logistic regression

Classification with a multilayer perceptron

Image convolution

Summary

Creating a Machine Learning Architecture

Machine learning architectures

Data collection

Normalization and regularization

Dimensionality reduction

Data augmentation

Data conversion

Modeling/grid search/cross-validation

Visualization

GPU support

A brief introduction to distributed architectures

Scikit-learn tools for machine learning architectures

Pipelines

Feature unions

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

This book is an introduction to the world of machine learning, a topic that is becoming more and more important, not only for IT professionals and analysts but also for all the data scientists and engineers who want to exploit the enormous power of techniques such as predictive analysis, classification, clustering, and natural language processing. In order to facilitate the learning process, all theoretical elements are followed by concrete examples based on Python.

A basic but solid understanding of this topic requires a foundation in mathematics, which is not only necessary to explain the algorithms, but also to let the reader understand how it's possible to tune up the hyperparameters in order to attain the best possible accuracy. Of course, it's impossible to cover all the details with the appropriate precision. For this reason, some topics are only briefly described, limiting the theory to the results without providing any of the workings. In this way, the user has the double opportunity to focus on the fundamental concepts (without too many mathematical complications) and, through the references, examine in depth all the elements that generate interest.

The chapters can be read in no particular order, skipping the topics that you already know. Whenever necessary, there are references to the chapters where some concepts are explained. I apologize in advance for any imprecision, typos or mistakes, and I'd like to thank all the Packt editors for their collaboration and constant attention.

Who this book is for

This book is for machine learning engineers, data engineers, and data scientists who want to build a strong foundation in the field of predictive analytics and machine learning. Familiarity with Python would be an added advantage and will enable you to get the most out of this book.

What this book covers

Chapter 1,A Gentle Introduction to Machine Learning, introduces the world of machine learning, explaining the fundamental concepts of the most important approaches to creating intelligent applications and focusing on the different kinds of learning methods.

Chapter 2,Important Elements in Machine Learning, explains the mathematical concepts regarding the most common machine learning problems, including the concept of learnability and some important elements of information theory. This chapter contains theoretical elements, but it's extremely helpful if you are learning this topic from scratch because it provides an insight into the most important mathematical tools employed in the majority of algorithms.

Chapter 3, Feature Selection and Feature Engineering, describes the most important techniques for preprocessing a dataset, selecting the most informative features, and reducing the original dimensionality.

Chapter 4, Regression Algorithms, describes the linear regression algorithm and its optimizations: Ridge, Lasso, and ElasticNet. It continues with more advanced models that can be employed to solve non-linear regression problems or to mitigate the effect of outliers.

Chapter 5, Linear Classification Algorithms, introduces the concept of linear classification, focusing on logistic regression, perceptrons, stochastic gradient descent algorithms, and passive-aggressive algorithms. The second part of the chapter covers the most important evaluation metrics, which are used to measure the performance of a model and find the optimal hyperparameter set.

Chapter 6, Naive Bayes and Discriminant Analysis, explains the Bayes probability theory and describes the structure of the most diffused Naive Bayes classifiers. In the second part, linear and quadratic discriminant analysis is analyzed with some concrete examples.

Chapter 7, Support Vector Machines, introduces the SVM family of algorithms, focusing on both linear and non-linear classification problems thanks to the employment of the kernel trick. The last part of the chapter covers support vector regression and more complex classification models.

Chapter 8, Decision Trees and Ensemble Learning, explains the concept of a hierarchical decision process and describes the concepts of decision tree classification, random forests, bootstrapped and bagged trees, and voting classifiers.

Chapter 9, Clustering Fundamentals, introduces the concept of clustering, describing the Gaussian mixture, K-Nearest Neighbors, and K-means algorithms. The last part of the chapter covers different approaches to determining the optimal number of clusters and measuring the performance of a model.

Chapter 10, Advanced Clustering, introduces more complex clustering techniques (DBSCAN, Spectral Clustering, and Biclustering) that can be employed when the dataset structure is non-convex. In the second part of the chapter, two online clustering algorithms (mini-batch K-means and BIRCH) are introduced.

Chapter 11, Hierarchical Clustering, continues the explanation of more complex clustering algorithms started in the previous chapter and introduces the concepts of agglomerative clustering and dendrograms.

Chapter 12, Introducing Recommendation Systems, explains the most diffused algorithms employed in recommender systems: content- and user-based strategies, collaborative filtering, and alternating least square. A complete example based on Apache Spark shows how to process very large datasets using the ALS algorithm.

Chapter 13, Introduction to Natural Language Processing, explains the concept of the Bag-of-Words strategy and introduces the most important techniques required to efficiently process natural language datasets (tokenizing, stemming, stop-word removal, tagging, and vectorizing). An example of a classifier based on the Reuters dataset is also discussed in the last part of the chapter.

Chapter 14, Topic Modeling and Sentiment Analysis in NLP, introduces the concept of topic modeling and describes the most important algorithms, such as latent semantic analysis (both deterministic and probabilistic) and latent Dirichlet allocation. The second part of the chapter covers the problem of word embedding and sentiment analysis, explaining the most diffused approaches to address it.

Chapter 15, Introducing Neural Networks, introduces the world of deep learning, explaining the concept of neural networks and computational graphs. In the second part of the chapter, the high-level deep learning framework Keras is presented with a concrete example of a Multi-layer Perceptron.

Chapter 16, Advanced Deep Learning Models, explains the basic functionalities of the most important deep learning layers, with Keras examples of deep convolutional networks and recurrent (LSTM) networks for time-series processing. In the second part of the chapter, the TensorFlow framework is briefly introduced, along with some examples that expose some of its basic functionalities.

Chapter 17, Creating a Machine Learning Architecture, explains how to define a complete machine learning pipeline, focusing on the peculiarities and drawbacks of each step.

To get the most out of this book

To fully understand all the algorithms in this book, it's important to have a basic knowledge of linear algebra, probability theory, and calculus.

All practical examples are written in Python and use the scikit-learn machine learning framework, Natural Language Toolkit (NLTK), Crab, langdetect, Spark (PySpark), Gensim, Keras, and TensorFlow (deep learning frameworks). These are available for Linux, macOS X, and Windows, with Python 2.7 and 3.3+. When a particular framework is employed for a specific task, detailed instructions and references will be provided. All the examples from chapters 1 to 14 can be executed using Python 2.7 (while TensorFlow requires Python 3.5+); however, I highly suggest using a Python 3.5+ distribution. The most common choice for data science and machine learning is Anaconda (https://www.anaconda.com/download/), which already contains all the most important packages.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packtpub.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Machine-Learning-Algorithms-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/MachineLearningAlgorithmsSecondEdition_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

A Gentle Introduction to Machine Learning

In the last few years, machine learning has become one of the most important and prolific IT and artificial intelligence branches. It's not surprising that its applications are becoming more widespread day by day in every business sector, always with new and more powerful tools and results. Open source, production-ready frameworks, together with hundreds of papers published every month, are contributing to one of the most pervasive democratization processes in IT history. But why is machine learning so important and valuable?

In this chapter, we are going to discuss the following:

The difference between classic systems and adaptive ones

The general concept of learning, proving a few examples of different approaches

Why bio-inspired systems and computational neuroscience allowed a dramatic improvement in performances

The relationship between big data and machine learning

Introduction – classic and adaptive machines

Since time immemorial, human beings have built tools and machines to simplify their work and reduce the overall effort needed to complete many different tasks. Even without knowing any physical law, they invented levers (formally described for the first time by Archimedes), instruments, and more complex machines to carry out longer and more sophisticated procedures. Hammering a nail became easier and more painless thanks to a simple trick, and so did moving heavy stones or wood using a cart. But, what's the difference between these two examples? Even if the latter is still a simple machine, its complexity allows a person to carry out a composite task without thinking about each step. Some fundamental mechanical laws play a primary role in allowing a horizontal force to contrast gravity efficiently, but neither human beings, nor horses or oxen, knew anything about them. The primitive people simply observed how a genial trick (the wheel) could improve their lives.

The lesson we've learned is that a machine is never efficient or trendy without a concrete possibility to use it with pragmatism. A machine is immediately considered useful and destined to be continuously improved if its users can easily understand what tasks can be completed with less effort or automatically. In the latter case, some intelligence seems to appear next to cogs, wheels, or axles. So, a further step can be added to our evolution list: automatic machines, built (nowadays, we'd say programmed) to accomplish specific goals by transforming energy into work. Wind or watermills are some examples of elementary tools that are able to carry out complete tasks with minimal (compared to a direct activity) human control.

In the following diagram, there's a generic representation of a classical system that receives some input values, processes them, and produces output results:

Interaction diagram of a classic/non-adaptive system

But again, what's the key to the success of a mill? It's not hasty at all to say that human beings have tried to transfer some intelligence into their tools since the dawn of technology. Both the water in a river and the wind show a behavior that we can simply call flowing. They have a lot of energy to give us free of any charge, but a machine should have some awareness to facilitate this process. A wheel can turn around a fixed axle millions of times, but the wind must find a suitable surface to push on. The answer seems obvious, but you should try to think about people without any knowledge or experience; even if implicitly, they started a brand new approach to technology. If you prefer to reserve the word intelligence to more recent results, it's possible to say that the path started with tools, moved first to simple machines, and then moved to smarter ones.

Without further intermediate (but no less important) steps, we can jump into our epoch and change the scope of our discussion. Programmable computers are widespread, flexible, and more and more powerful instruments; moreover, the diffusion of the internet allowed us to share software applications and related information with minimal effort. The word-processing software that I'm using, my email client, a web browser, and many other common tools running on the same machine, are all examples of such flexibility. It's undeniable that the IT revolution dramatically changedour lives and sometimes improved our daily jobs, but without machine learning (and all its applications), there are still many tasks that seem far out of the computer domain. Spam filtering, Natural Language Processing (NLP), visual tracking with a webcam or a smartphone, and predictive analysis are only a few applications that revolutionized human-machine interaction and increased our expectations. In many cases, they transformed our electronic tools into actual cognitive extensions that are changing the way we interact with many daily situations. They achieved this goal by filling the gap between human perception, language, reasoning, and model and artificial instruments.

Here's a schematic representation of an adaptive system:

Interaction diagram of an adaptive system

Such a system isn't based on static or permanent structures (model parameters and architectures), but rather on a continuous ability to adapt its behavior to external signals (datasets or real-time inputs) and, like a human being, to predict the future using uncertain and fragmentary pieces of information.

Before moving on with a more specific discussion, let's briefly define the different kinds of system analysis that can be performed. These techniques are often structured as a sequence of specific operations whose goal is increasing the overall domain knowledge and allowing answering specific questions, however, in some cases, it's possible to limit the process to a single step in order to meet specific business needs. I always suggest to briefly consider them all, because many particular operations make sense only when some conditions are required. A clear understanding of the problem and its implications is the best way to make the right decisions, also taking into consideration possible future developments.

Descriptive analysis

Before trying any machine learning solution, it's necessary to create an abstract description of the context. The best way to achieve this goal is to define a mathematical model, which has the advantage of being immediately comprehensible by anybody (assuming the basic knowledge). However, the goal of descriptive analysis is to find out an accurate description of the phenomena that are observed and validate all the hypothesis. Let's suppose that our task is to optimize the supply chain of a large store. We start collecting data about purchases and sales and, after a discussion with a manager, we define the generic hypotheses that the sales volume increases during the day before the weekend. This means that our model should be based on a periodicity. A descriptive analysis has the task to validate it, but also to discover all those other particular features that were initially neglected.

At the end of this stage, we should know, for example, if the time series (let's suppose we consider only a variable) is periodic, if it has a trend, if it's possible to find out a set of standard rules, and so forth. A further step (that I prefer to consider as whole with this one) is to define a diagnostic model that must be able to connect all the effects with precise causes. This process seems to go in the opposite direction, but its goal is very close to the descriptive analysis one. In fact, whenever we describe a phenomenon, we are naturally driven to finding a rational reason that justifies each specific step. Let's suppose that, after having observed the periodicity in our time series, we find a sequence that doesn't obey this rule. The goal of diagnostic analysis is to give a suitable answer (that is, the store is open on Sunday). This new piece of information enriches our knowledge and specializes it: now, we can state that the series is periodic only when there is a day off, and therefore (clearly, this is a trivial example) we don't expect an increase in the sales before a working day. As many machine learning models have specific prerequisites, a descriptive analysis allows us to immediately understand whether a model will perform poorly or if it's the best choice considering all the known factors. In all of the examples we will look at, we are going to perform a brief descriptive analysis by defining the features of each dataset and what we can observe. As the goal of this book is to focus on adaptive systems, we don't have space for a complete description, but I always invite the reader to imagine new possible scenarios, performing a virtual analysis before defining the models.

Predictive analysis

The goal of machine learning is almost related to this precise stage. In fact, once we have defined a model of our system, we need to infer its future states, given some initial conditions. This process is based on the discovery of the rules that underlie the phenomenon so as to push them forward in time (in the case of a time series) and observe the results. Of course, the goal of a predictive model is to minimize the error between actual and predictive value, considering all possible interfering factors.

In the example of the large store, a good model should be able to forecast a peak before a day off and a normal behavior in all the other cases. Moreover, once a predictive model has been defined and trained, it can be used as a fundamental part of a decision-based process. In this case, the prediction must be turned into a suggested prescription. For example, the object detector of a self-driving car can be extremely accurate and detect an obstacle on time. However, which is the best action to perform in order to achieve a specific goal? According to the prediction (position, size, speed, and so on), another model must be able to pick the action that minimizes the risk of damage and maximizes the probability of a safe movement. This is a common task in reinforcement learning, but it's also extremely useful whenever a manager has to make a decision in a context where there are many factors. The resultant model is, hence, a pipeline that is fed with raw inputs and uses the single outcomes as inputs for subsequent models. Returning to our initial example, the store manager is not interested in discovering the hidden oscillations, but in the right volumes of goods that he has to order every day. Therefore, the first step is predictive analysis, while the second is a prescriptive one, which can take into account many factors that are discarded by the previous model (that is, different suppliers can have shorter or longer delivery times or they can apply discounts according to the volume).

So, the manager will probably define a goal in terms of a function to maximize (or minimize), and the model has to find the best amount of goods to order so as to fulfill the main requirement (that, of course, is the availability, and it depends on the sales prediction). In the remaining part of this book, we are going to discuss many solutions to specific problems, focusing on the predictive stage. But, in order to move on, we need to define what learning means and why it's so important in more and more different business contexts.

Only learning matters

What does learning exactly mean? Simply, we can say that learning is the ability to change according to external stimuli and remember most of our previous experiences. So, machine learning is an engineering approach that gives maximum importance to every technique that increases or improves the propensity for changing adaptively. A mechanical watch, for example, is an extraordinary artifact, but its structure obeys stationary laws and becomes useless if something external is changed. This ability is peculiar to animals and, in particular, to human beings; according to Darwin’s theory, it's also a key success factor for the survival and evolution of all species. Machines, even if they don't evolve autonomously, seem to obey the same law.

Therefore, the main goal of machine learning is to study, engineer, and improve mathematical models that can be trained (once or continuously) with context-related data (provided by a generic environment) to infer the future and to make decisions without complete knowledge of all influencing elements (external factors). In other words, an agent (which is a software entity that receives information from an environment, picks the best action to reach a specific goal, and observes the results of it) adopts a statistical learning approach, trying to determine the right probability distributions, and use them to compute the action (value or decision) that is most likely to be successful (with the fewest errors).

I do prefer using the term inference instead of prediction, but only to avoid the weird (but not so uncommon) idea that machine learning is a sort of modern magic. Moreover, it's possible to introduce a fundamental statement: an algorithm can extrapolate general laws and learn their structure with relatively high precision, but only if they affect the actual data. So, the term prediction can be freely used, but with the same meaning adopted in physics or system theory. Even in the most complex scenarios, such as image classification with convolutional neural networks, every piece of information (geometry, color, peculiar features, contrast, and so on) is already present in the data and the model has to be flexible enough to extract and learn it permanently.

In the following sections, we will give you a brief description of some common approaches to machine learning. Mathematical models, algorithms, and practical examples will be discussed in later chapters.

Supervised learning

A supervised scenario is characterized by the concept of a teacher or supervisor, whose main task is to provide the agent with a precise measure of its error (directly comparable with output values). With actual algorithms, this function is provided by a training set made up of couples (input and expected output). Starting from this information, the agent can correct its parameters so as to reduce the magnitude of a global loss function. After each iteration, if the algorithm is flexible enough and data elements are coherent, the overall accuracy increases and the difference between the predicted and expected values becomes close to zero. Of course, in a supervised scenario, the goal is training a system that must also work with samples that have never been seen before. So, it's necessary to allow the model to develop a generalization ability and avoid a common problem called overfitting,which causes overlearning due to an excessive capacity (we're going to discuss this in more detail in the following chapters, however, we can say that one of the main effects of such a problem is the ability to only correctly predict the samples used for training, while the error rate for the remaining ones is always very high).

In the following graph, a few training points are marked with circles, and the thin blue line represents a perfect generalization (in this case, the connection is a simple segment):

Example of regression of a stock price with different interpolating curves

Two different models are trained with the same datasets (corresponding to the two larger lines). The former is unacceptable because it cannot generalize and capture the fastest dynamics (in terms of frequency), while the latter seems a very good compromise between the original trend, and has a residual ability to generalize correctly in a predictive analysis.

Formally, the previous example is called regression because it's based on continuous output values. Instead, if there is only a discrete number of possible outcomes (called categories), the process becomes a classification. Sometimes, instead of predicting the actual category, it's better to determine its probability distribution. For example, an algorithm can be trained to recognize a handwritten alphabetical letter, so its output is categorical (in English, there'll be 26 allowed symbols). On the other hand, even for human beings, such a process can lead to more than one probable outcome when the visual representation of a letter isn't clear enough to belong to a single category. This means that the actual output is better described by a discrete probability distribution (for example, with 26 continuous values normalized so that they always sum up to 1).

In the following graph, there's an example of classification of elements with two features. The majority of algorithms try to find the best separating hyperplane (in this case, it's a linear problem) by imposing different conditions. However, the goal is always the same: reducing the number of misclassifications and increasing the noise-robustness. For example, look at the triangular point that is closest to the plane (its coordinates are about [5.1 - 3.0]). If the magnitude of the second feature were affected by noise and so the value were quite smaller than 3.0, a slightly higher hyperplane could wrongly classify it. We're going to discuss some powerful techniques to solve these problems in later chapters:

Example of linear classification

Common supervised learning applications include the following:

Predictive analysis based on regression or categorical classification

Spam detection

Pattern detection

NLP

Sentiment analysis

Automatic image classification

Automatic sequence processing (for example, music or speech)

Unsupervised learning

This approach is based on the absence of any supervisor and therefore of absolute error measures. It's useful when it's necessary to learn how a set of elements can be grouped (clustered) according to their similarity (or distance measure). For example, looking at the previous graph, a human being can immediately identify two sets without considering the colors or the shapes. In fact, the circular dots (as well as the triangular ones) determine a coherent set; it is separate from the other one much more than how its points are internally separated. Using a metaphor, an ideal scenario is a sea with a few islands that can be separated from each other, considering only their mutual position and internal cohesion. Clearly, unsupervised learning provides an implicit descriptive analysis because all the pieces of information discovered by the clustering algorithm can be used to obtain a complete insight of the dataset. In fact, all objects share a subset of features, while they are different under other viewpoints. The aggregation process is also aimed to extend the characteristics of some points to their neighbors, assuming that the similarity is not limited to some specific features. For example, in a recommendation engine, a group of users can be clustered according to the preference expressed for some books. If the chosen criteria detected some analogies between users A and B, we can share the non-overlapping elements between the users. Therefore, if A has read a book that can be suitable for B, we are implicitly authorized to recommend it. In this case, the decision is made by considering a goal (sharing the features) and a descriptive analysis. However, as the model can (and should) manage unknown users too, its purpose is also predictive.

In the following graph, each ellipse represents a cluster and all the points inside its area can be labeled in the same way. There are also boundary points (such as the triangles overlapping the circle area) that need a specific criterion (normally a trade-off distance measure) to determine the corresponding cluster. Just as for classification with ambiguities (P and malformed R), a good clustering approach should consider the presence of outliers and treat them so as to increase both the internal coherence (visually, this means picking a subdivision that maximizes the local density) and the separation among clusters.

For example, it's possible to give priority to the distance between a single point and a centroid, or the average distance among points belonging to the same cluster and different ones. In this graph, all boundary triangles are close to each other, so the nearest neighbor is another triangle. However, in real-life problems, there are often boundary areas where there's a partial overlap, meaning that some points have a high degree of uncertainty due to their feature values:

Example of clustering with a bidimensional dataset split into four natural clusters

Another interpretation can be expressed by using probability distributions. If you look at the ellipses, they represent the area of multivariate Gaussians bound between a minimum and maximum variance. Considering the whole domain, a point (for example, a blue star) could potentially belong to all clusters, but the probability given by the first one (lower-left corner) is the highest, and so this determines the membership. Once the variance and mean (in other words, the shape) of all Gaussians become stable, each boundary point is automatically captured by a single Gaussian distribution (except in the case of equal probabilities). Technically, we say that such an approach maximizes the likelihood of a Gaussian mixture given a certain dataset. This is a very important statistical learning concept that spans many different applications, so it will be examined in more depth in the next chapter, Chapter 2, Important Elements in Machine Learning. Moreover, we're going to discuss some common clustering methodologies, considering both strong and weak points, and compare their performances for various test distributions.

Other important techniques involve the use of both labeled and unlabeled data. This approach is therefore called semi-supervised and can be adopted when it's necessary to categorize a large amount of data with a few complete (labeled) examples or when there's the need to impose some constraints to a clustering algorithm (for example, assigning some elements to a specific cluster or excluding others).

Commons unsupervised applications include the following:

Object segmentation (for example, users, products, movies, songs, and so on)

Similarity detection

Automatic labeling

Recommendation engines

Semi-supervised learning

There are many problems where the amount of labeled samples is very small compared with the potential number of elements. A direct supervised approach is infeasible because the data used to train the model couldn't be representative of the whole distribution, so therefore it's necessary to find a trade-off between a supervised and an unsupervised strategy. Semi-supervised learning has been mainly studied in order to solve these kinds of problems. The topic is a little bit more advanced and won't be covered in this book (the reader who is interested can check out Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing). However, the main goals that a semi-supervised learning approach pursues are as follows:

The propagation of labels to unlabeled samples considering the graph of the whole dataset. The samples with labels become

attractors

that extend their influence to the neighbors until an equilibrium point is reached.

Performing a classification training model (in general,

Support Vector Machines

(

SVM

); see

Chapter 7

Support Vector Machines

for further information) using the labeled samples to enforce the conditions necessary for a good separation while trying to exploit the unlabeled samples as

balancers

, whose influence must be mediated by the labeled ones. Semi-supervised SVMs can perform extremely well when the dataset contains only a few labeled samples and dramatically reduce the burden of building and managing very large datasets.

Non-linear dimensionality reduction considering the graph structure of the dataset. This is one of most challenging problems due to the constraints existing in high-dimensional datasets (that is, images). Finding a low-dimensional distribution that represents the original one minimizing the discrepancy is a fundamental task necessary to visualize structures with more than three dimensions. Moreover, the ability to reduce the dimensionality without a significant information loss is a key element whenever it's necessary to work with simpler models. In this book, we are going to discuss some common linear techniques (such as

Principal Component Analysis

(

PCA

) that the reader will be able to understand when some features can be removed without impacting the final accuracy but with a training speed gain.

It should now be clear that semi-supervised learning exploits the ability of finding out separating hyperplanes (classification) together with the auto-discovery of structural relationships (clustering). Without a loss of generality, we could say that the real supervisor, in this case, is the data graph (representing the relationships) that corrects the decisions according to the underlying informational layer. To better understand the logic, we can imagine that we have a set of users, but only 1% of them have been labeled (for simplicity, let's suppose that they are uniformly distributed). Our goal is to find the most accurate labels for the remaining part. A clustering algorithm can rearrange the structure according to the similarities (as the labeled samples are uniform, we can expect to find unlabeled neighbors whose center is a labeled one). Under some assumptions, we can propagate the center's label to the neighbors, repeating this process until every sample becomes stable. At this point, the whole dataset is labeled and it's possible to employ other algorithms to perform specific operations. Clearly, this is only an example, but in real life, it's extremely common to find scenarios where the cost of labeling millions of samples is not justified considering the accuracy achieved by semi-supervised methods.

Reinforcement learning

Even if there are no actual supervisors, reinforcement learning is also based on feedback provided by the environment. However, in this case, the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty), and it's useful to understand whether a certain action performed in a state is positive or not. The sequence of most useful actions is a policy that the agent has to learn in order to be able to always make the best decision in terms of the highest immediate and cumulative (future) reward. In other words, an action can also be imperfect, but in terms of a global policy, it has to offer the highest totalreward. This concept is based on the idea that a rational agent always pursues the objectives that can increase his/her wealth. The ability to see over a distant horizon is a distinctive mark for advanced agents, while short-sighted ones are often unable to correctly evaluate the consequences of their immediate actions and so their strategies are always sub-optimal.

Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure. During the last few years, many classical algorithms have been applied to deep neural networks to learn the best policy for playing Atari video games and to teach an agent how to associate the right action with an input representing the state (usually, this is a screenshot or a memory dump).

In the following diagram, there's a schematic representation of a deep neural network that's been trained to play a famous Atari game:

The generic structure of a deep reinforcement learning architecture

As input, there are one or more subsequent screenshots (this can often be enough to capture the temporal dynamics as well). They are processed using different layers (discussed briefly later) to produce an output that represents the policy for a specific state transition. After applying this policy, the game produces a feedback (as a reward-penalty), and this result is used to refine the output until it becomes stable (so the states are correctly recognized and the suggested action is always the best one) and the total reward overcomes a predefined threshold.

We're going to discuss some examples of reinforcement learning in the Chapter 15, Introducing Neural Networks, and Chapter 16, Advanced Deep Learning Models, dedicated to introducing deep learning and TensorFlow. However, some common examples are as follows:

Automatic robot control

Game solving

Stock trade analysis based on feedback signals

Computational neuroscience

It's not surprising that many machine learning algorithms have been defined and refined thanks to the contribution of research in the field of computational neuroscience. On the other hand, the most diffused adaptive systems are the animals, thanks to their nervous systems, that allow an effective interaction with the environment. From a mechanistic viewpoint, we need to assume that all the processes working inside the gigantic network of neurons are responsible for all computational features, starting from low-level perception and progressing until the highest abstractions, like language, logic reasoning, artistic creation, and so forth.

At the beginning of 1900, Ramón y Cajal and Golgi discovered the structure of the nerve cells, the neurons, but it was necessary to fully understand that their behavior was purely computational. Both scientists drew sketches representing the input units (dendrites), the body (soma), the main channel (axon) and the output gates (synapses), however neither the dynamic nor the learning mechanism of a cell group was fully comprehended. The neuroscientific community was convinced that learning was equivalent to a continuous and structural change, but they weren't able to define exactly what was changing during a learning process. In 1949, the Canadian psychologist Donald Hebb proposed his famous rule (a broader discussion can be found in Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing, 2018 and Theoretical Neuroscience, Dayan P., Abbott L. F., The MIT Press, 2005) that is focused on the synaptic plasticity of the neurons. In other words, the changing element is the number and the nature of the output gates which connect a unit to a large number of other neurons. Hebb understood that if a neuron produces a spike and a synapse propagates it to another neuron which behaves in the same way, the connection is strengthened, otherwise, it's weakened. This can seem like a very simplistic explanation, but it allows you to understand how elementary neural aggregates can perform operations such as detecting the borders of an object, denoising a signal, or even finding the dimensions with maximum variance (PCA).

The research in this field has continued until today, and many companies, together with high-level universities, are currently involved in studying the computational behavior of the brain using the most advanced neuroimaging technologies available. The discoveries are sometimes surprising, as they confirm what was only imagined and never observed. In particular, some areas of the brain can easily manage supervised and unsupervised problems, while others exploit reinforcement learning to predict the most likely future perceptions. For example, an animal quickly learns to associate the sound of steps to the possibility of facing a predator, and learns how to behave consequently. In the same way, the input coming from its eyes are manipulated so as to extract all those pieces of information that are useful to detect the objects. This denoising procedure is extremely common in machine learning and, surprisingly, many algorithms achieve the same goal that an animal brain does! Of course, the complexity of human minds is beyond any complete explanation, but the possibility to double-check these intuitions using computer software has dramatically increased the research speed. At the end of this book, we are going to discuss the basics of deep learning, which is the most advanced branch of machine learning. However, I invite the reader to try to understand all the dynamics (even when they seem very abstract) because the underlying logic is always based on very simple and natural mechanisms, and your brain is very likely to perform the same operations that you're learning while you read!

Beyond machine learning – deep learning and bio-inspired adaptive systems

During the last few years, thanks to more powerful and cheaper computers, many researchers started adopting complex (deep) neural architectures to achieve goals that were unimaginable only two decades ago. Since 1957, when Rosenblatt invented the first perceptron, interest in neural networks has grown more and more. However, many limitations (concerning memory and CPU speed) prevented massive research and hid lots of potential applications of these kinds of algorithms.

In the last decade, many researchers started training bigger and bigger models, built with several different layers (that's why this approach is called deep learning), in order to solve new challenging problems. The availability of cheap and fast computers allowed them to get results in acceptable timeframes and to use very large datasets (made up of images, texts, and animations). This effort led to impressive results, in particular for classification based on photo elements and real-time intelligent interaction using reinforcement learning.

The idea behind these techniques is to create algorithms that work like a brain, and many important advancements in this field have been achieved thanks to the contribution of neurosciences and cognitive psychology. In particular, there's a growing interest in pattern recognition and associative memories whose structure and functioning are similar to what happens in the neocortex. Such an approach also allows simpler algorithms called model-free; these aren't based on any mathematical-physical formulation of a particular problem, but rather on generic learning techniques and repeating experiences.

Of course, testing different architectures and optimization algorithms is rather simpler (and it can be done with parallel processing) than defining a complex model (which is also more difficult to adapt to different contexts). Moreover, deep learning showed better performance than other approaches, even without a context-based model. This suggests that, in many cases, it's better to have a less precise decision made with uncertainty than a precise one determined by the output of a very complex model (often not so fast). For animals, this is often a matter of life and death, and if they succeed, it is thanks to an implicit renounce of some precision.

Common deep learning applications include the following:

Image classification

Real-time visual tracking

Autonomous car driving

Robot control

Logistic optimization

Bioinformatics

Speech recognition and

Natural Language Understanding

(

NLU

)

Natural Language Generation

(

NLG

) and speech synthesis

Many of these problems can also be solved by using classic approaches that are sometimes much more complex, but deep learning outperformed them all. Moreover, it allowed extending their application to contexts initially considered extremely complex, such as autonomous cars or real-time visual object identification.

This book covers, in detail, only some classical algorithms; however, there are many resources that can be read both as an introduction and for a more advanced insight.

Many interesting results have been achieved by the Google DeepMind team (https://deepmind.com) and I suggest that you visit their website to learn more about their latest research and goals. Another very helpful resource is OpenAI (https://openai.com/), where there's also a virtual gym with many reinforcement learning environments ready to use.

Machine learning and big data

Another area that can be exploited using machine learning is big data. After the first release of Apache Hadoop, which implemented an efficient MapReduce algorithm, the amount of information managed in different business contexts grew exponentially. At the same time, the opportunity to use it for machine learning purposes arose and several applications such as mass collaborative filtering became a reality.

Imagine an online store with 1 million users and only 1,000 products. Consider a matrix where each user is associated with every product by an implicit or explicit ranking. This matrix will contain 1,000,000 x 1,000 cells, and even if the number of products is very limited, any operation performed on it will be slow and memory-consuming. Instead, using a cluster, together with parallel algorithms, such a problem disappears, and operations with a higher dimensionality can be carried out in a very short time.

Think about training an image classifier with 1 million samples. A single instance needs to iterate several times, processing small batches of pictures. Even if this problem can be performed using a streaming approach (with a limited amount of memory), it's not surprising to wait even for a few days before the model begins to perform well. Adopting a big data approach instead, it's possible to asynchronously train several local models, periodically share the updates, and re-synchronize them all with a master model. This technique has also been exploited to solve some reinforcement learning problems, where many agents (often managed by different threads) played the same game, providing their periodical contribution to a global intelligence.

Not every machine learning problem is suitable for big data, and not all big datasets are really useful when training models. However, their conjunction in particular situations can lead to extraordinary results by removing many limitations that often affect smaller scenarios. Unfortunately, both machine learning and big data are topics subject to continuous hype, hence one of the tasks that an engineer/scientist has to accomplish is understanding when a particular technology is really helpful and when its burden can be heavier than the benefits. Modern computers often have enough resources to process datasets that, a few years ago, were easily considered big data. Therefore, I invite the reader to carefully analyze each situation and think about the problem from a business viewpoint as well. A Spark cluster has a cost that is sometimes completely unjustified. I've personally seen clusters of two medium machines running tasks that a laptop could have carried out even faster. Hence, always perform a descriptive/prescriptive analysis of the problem and the data, trying to focus on the following:

The current situation

Objectives (what do we need to achieve?)

Data and dimensionality (do we work with batch data? Do we have incoming streams?)

Acceptable delays (do we need real-time? Is it possible to process once a day/week?)

Big data solutions are justified, for example, when the following is the case:

The dataset cannot fit in the memory of a high-end machine

The incoming data flow is huge, continuous, and needs prompt computations (for example, clickstreams, web analytics, message dispatching, and so on)

It's not possible to split the data into small chunks because the acceptable delays are minimal (this piece of information must be mathematically quantified)

The operations can be parallelized efficiently (nowadays, many important algorithms have been implemented in distributed frameworks, but there are still tasks that cannot be processed by using parallel architectures)

In the chapter dedicated to recommendation systems, Chapter 12, Introduction to Recommendation Systems, we're going to discuss how to implement collaborative filtering using Apache Spark. The same framework will also be adopted for an example of Naive Bayes classification.

If you want to know more about the whole Hadoop ecosystem, visit http://hadoop.apache.org. Apache Mahout (http://mahout.apache.org) is a dedicated machine learning framework, and Spark (http://spark.apache.org), one the fastest computational engines, has a module called Machine Learning Library (MLlib) which implements many common algorithms that benefit from parallel processing.

Summary

In this chapter, we introduced the concept of adaptive systems; they can learn from their experiences and modify their behavior in order to maximize the possibility of reaching a specific goal. Machine learning is the name given to a set of techniques that allow you to implement adaptive algorithms to make predictions and auto-organize input data according to their common features.