Keras Reinforcement Learning Projects - Giuseppe Ciaburro - E-Book

Keras Reinforcement Learning Projects E-Book

Giuseppe Ciaburro

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Reinforcement learning has evolved a lot in the last couple of years and proven to be a successful technique in building smart and intelligent AI networks. Keras Reinforcement Learning Projects installs human-level performance into your applications using algorithms and techniques of reinforcement learning, coupled with Keras, a faster experimental library.

The book begins with getting you up and running with the concepts of reinforcement learning using Keras. You’ll learn how to simulate a random walk using Markov chains and select the best portfolio using dynamic programming (DP) and Python. You’ll also explore projects such as forecasting stock prices using Monte Carlo methods, delivering vehicle routing application using Temporal Distance (TD) learning algorithms, and balancing a Rotating Mechanical System using Markov decision processes.

Once you’ve understood the basics, you’ll move on to Modeling of a Segway, running a robot control system using deep reinforcement learning, and building a handwritten digit recognition model in Python using an image dataset. Finally, you’ll excel in playing the board game Go with the help of Q-Learning and reinforcement learning algorithms.

By the end of this book, you’ll not only have developed hands-on training on concepts, algorithms, and techniques of reinforcement learning but also be all set to explore the world of AI.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 395

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Keras Reinforcement Learning Projects
9 projects exploring popular reinforcement learning techniques to build self-learning agents
Giuseppe Ciaburro
BIRMINGHAM - MUMBAI

Keras Reinforcement Learning Projects

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor:Pravin DhandreAcquisition Editor:Dayne CastelinoContent Development Editor:Karan ThakkarTechnical Editor: Nilesh SawakhandeCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Mariammal ChettiyarGraphics: Jisha ChirayilProduction Coordinator:Jyoti Chauhan

First published: September 2018

Production reference: 2091018

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-209-3

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Giuseppe Ciaburro holds a PhD in environmental technical physics and two master's degrees. His research focuses on machine learning applications in the study of urban sound environments. He works at Built Environment Control Laboratory – Università degli Studi della Campania Luigi Vanvitelli (Italy). He has over 15 years of work experience in programming (in Python, R, and MATLAB), first in the field of combustion and then in acoustics and noise control. He has several publications to his credit.

About the reviewer

Sudharsan Ravichandiran is a data scientist, researcher, artificial intelligence enthusiast, and YouTuber (search for Sudharsan reinforcement learning). He completed his bachelors in information technology at Anna University. His area of research focuses on practical implementations of deep learning and reinforcement learning, which includes natural language processing and computer vision. He is an open source contributor and loves answering questions on Stack Overflow. He also authored a best seller Hands on Reinforcement Learning with Python published by Packt.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Keras Reinforcement Learning Projects

Packt Upsell

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Overview of Keras Reinforcement Learning

Basic concepts of machine learning

Discovering the different types of machine learning

Supervised learning

Unsupervised learning

Reinforcement learning

Building machine learning models step by step

Getting started with reinforcement learning

Agent-environment interface

Markov Decision Process

Discounted cumulative reward

Exploration versus exploitation

Reinforcement learning algorithms

Dynamic Programming

Monte Carlo methods

Temporal difference learning

SARSA

Q-learning

Deep Q-learning

Summary

Simulating Random Walks

Random walks

One-dimensional random walk

Simulating 1D random walk

Markov chains

Stochastic process

Probability calculation

Markov chain definition

Transition matrix

Transition diagram

Weather forecasting with Markov chains

Generating pseudorandom text with Markov chains

Summary

Optimal Portfolio Selection

Dynamic Programming

Divide and conquer versus Dynamic Programming

Memoization

Dynamic Programming in reinforcement-learning applications

Optimizing a financial portfolio

Optimization techniques

Solving the knapsack problem using Dynamic Programming

Different approaches to the problem

Brute force

Greedy algorithms

Dynamic Programming

Summary

Forecasting Stock Market Prices

Monte Carlo methods

Historical background

Basic concepts of the Monte Carlo simulation

Monte Carlo applications

Numerical integration using the Monte Carlo method

Monte Carlo for prediction and control

Amazon stock price prediction using Python

Exploratory analysis

The Geometric Brownian motion model

Monte Carlo simulation

Summary

Delivery Vehicle Routing Application

Temporal difference learning

SARSA

Q-learning

Basics of graph theory

The adjacency matrix

Adjacency lists

Graphs as data structures in Python

Graphs using the NetworkX package

Finding the shortest path

The Dijkstra algorithm

The Dijkstra algorithm using the NetworkX package

The Google Maps algorithm

The Vehicle Routing Problem

Summary

Continuous Balancing of a Rotating Mechanical System

Neural network basic concepts

The Keras neural network model

Classifying breast cancer using the neural network

Deep reinforcement learning

The Keras–RL package

Continuous control with deep reinforcement learning

Summary

Dynamic Modeling of a Segway as an Inverted Pendulum System

How Segways work

System modeling basics

OpenAI Gym

OpenAI Gym methods

OpenAI Gym installation

The CartPole system

Q-learning solution

Deep Q-learning solution

Summary

Robot Control System Using Deep Reinforcement Learning

Robot control

Robotics overview

Robot evolution

First-generation robots

Second-generation robots

Third-generation robots

Fourth-generation robots

Robot autonomy

Robot mobility

Automatic control

Control architectures

The FrozenLake environment

The Q-learning solution

A Deep Q-learning solution

Summary

Handwritten Digit Recognizer

Handwritten digit recognition

Optical Character Recognition

Computer vision

Handwritten digit recognition using an autoencoder

Loading data

Model architecture

Deep autoencoder Q-learning

Summary

Playing the Board Game Go

Game theory

Basic concepts

Game types

Cooperative games

Symmetrical games

Zero-sum games

Sequential games

Game theory applications

Prisoner's dilemma

Stag hunt

Chicken game

The Go game

Basic rules of the game

Scoring rules

The AlphaGo project

The AlphaGo algorithm

Monte Carlo Tree Search

Convolutional networks

Summary

What's Next?

Reinforcement-learning applications in real life

DeepMind AlphaZero

IBM Watson

The Unity Machine Learning Agents toolkit

FANUC industrial robots

Automated trading systems using reinforcement learning

Next steps for reinforcement learning

Inverse reinforcement learning

Learning by demonstration

Deep Deterministic Policy Gradients

Reinforcement learning from human preferences

Hindsight Experience Replay

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

This book brings human-level performance into your applications using the algorithms and techniques of reinforcement learning coupled with Keras, the fast experimental library. Some of the projects that are covered are a delivery vehicle routing application, forecasting stock market prices, a robot control system, a optimal portfolio selection, and the dynamic modeling of a Segway. Throughout this book, you will get your hands dirty with the most popular algorithms, such as the Markov Decision Process, the Monte Carlo method, and Q-learning, so that you can be equipped with complex statistics to get better results.

Who this book is for

This book suits data scientists, machine learning engineers, and AI engineers who want to understand reinforcement learning by developing practical projects.

A sound knowledge of machine learning and a basic familiarity with Keras is all you need to get started with this book.

What this book covers

Chapter 1, Overview of Keras Reinforcement Learning, will get you ready to enjoy reinforcement learning using Keras, looking at topics ranging from the basic concepts right to the building of models. By the end of this chapter, you will be ready to dive into working on real-world projects.

Chapter 2, Simulating Random Walks, will have you simulate a random walk using Markov chains through a Python code implementation.

Chapter 3, Optimal Portfolio Selection, explores how to select the optimal portfolio using dynamic programming through a Python code implementation.

Chapter 4, Forecasting Stock Market Prices, guides you in using the Monte Carlo methods to forecast stock market prices.

Chapter 5, Delivery Vehicle Routing Application, shows how to use Temporal Difference (TD) learning algorithms to manage warehouse operations through Python and the Keras library.

Chapter 6, Continuous Balancing of a Rotating Mechanical System, helps you to use deep reinforcement learning methods to balance a rotating mechanical system.

Chapter 7, Dynamic Modeling of a Segway as an Inverted Pendulum System, teaches you the basic concepts of Q-learning and how to use this technique to control a mechanical system.

Chapter 8, A Robot Control System Using Deep Reinforcement Learning, will confront you with the problem of robot navigation in simple maze-like environments where the robot has to rely on its on-board sensors to perform navigation tasks.

Chapter 9, Handwritten Digit Recognizer, shows how to set up a handwritten digit recognition model in Python using an image dataset.

Chapter 10, Playing the Board Game Go, explores how reinforcement learning algorithms were used to address a problem in game theory.

Chapter 11, What's Next?, gives a good understanding of the real-life challenges in building and deploying machine learning models, and explores additional resources and technologies that will help sharpen your machine learning skills.

To get the most out of this book

In this book, reinforcement learning algorithms are implemented in Python. To reproduce the many examples in this book, you need to possess a good knowledge of the Python environment. We have used Python 3.6 and above to build various applications. In that spirit, we have tried to keep all of the code as friendly and readable as possible. We feel that this will enable you to easily understand the code and readily use it in different scenarios.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Keras-Reinforcement-Learning-Projects. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789342093_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "To calculate the logarithm of returns, we will use the log() function from numpy."

A block of code is set as follows:

plt.figure(figsize=(10,5))plt.plot(dataset)plt.show()

Any command-line input or output is written as follows:

git clone https://github.com/openai/gym

cd gym

pip install -e .

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Overview of Keras Reinforcement Learning

Nowadays, most computers are based on a symbolic elaboration, that is, the problem is first encoded in a set of variables and then processed using an explicit algorithm that, for each possible input of the problem, offers an adequate output. However, there are problems in which resolution with an explicit algorithm is inefficient or even unnatural, for example with a speech recognizer; tackling this kind of problem with the classic approach is inefficient. This and other similar problems, such as autonomous navigation of a robot or voice assistance in performing an operation, are part of a very diverse set of problems that can be addressed directly through solutions based on reinforcement learning.

Reinforcement learning is a very exciting part of machine learning, used in applications ranging from autonomous cars to playing games. Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. To do this, we use external feedback signals (reward signals) generated by the environment according to the choices made by the algorithm. A correct choice will result in a reward, while an incorrect choice will lead to a penalization of the system. All of this is in order to achieve the best result obtainable.

The topics covered in this chapter are the following:

An overview of machine learning

Reinforcement learning

Markov Decision Process

(

MDP

)

Temporal difference

(

TD

) learning

Q-learning

Deep Q-learning networks

At the end of the chapter, you will be fully introduced to the power of reinforcement learning and will learn the different approaches to this technique. Several reinforcement learning methods will be covered.

Basic concepts of machine learning

Machine learning is a multidisciplinary field created at the intersection of, and by the synergy between, computer science, statistics, neurobiology, and control theory. Its emergence has played a key role in several fields and has fundamentally changed the vision of software programming. If the question before was, how to program a computer, now the question becomes is how computers will program themselves. Thus, it is clear that machine learning is a basic method that allows a computer to have its own intelligence.

As might be expected, machine learning interconnects and coexists with the study of, and research on, human learning. Like humans, whose brain and neurons are the foundation of insight, Artificial Neural Networks (ANNs) are the basis of any decision-making activity of the computer.

Machine learning refers to the ability to learn from experience without any outside help, which is what we humans do, in most cases. Why should it not be the same for machines?

From a set of data, we can find a model that approximates the set by the use of machine learning. For example, we can identify a correspondence between input variables and output variables for a given system. One way to do this is to postulate the existence of some kind of mechanism for the parametric generation of data, which, however, does not know the exact values of the parameters. This process typically makes reference to statistical techniques, such as the following:

Induction

Deduction

Abduction

The extraction of general laws from a set of observed data is called induction; it is opposed to deduction, in which, starting from general laws, we want to predict the value of a set of variables. Induction is the fundamental mechanism underlying the scientific method in which we want to derive general laws, typically described in a mathematical language, starting from the observation of phenomena.

This observation includes the measurement of a set of variables and, therefore, the acquisition of data that describes the observed phenomena. Then, the resultant model can be used to make predictions on additional data. The overall process in which, starting from a set of observations, we want to make predictions for new situations, is called inference. Therefore, inductive learning starts from observations arising from the surrounding environment and generalizes obtaining knowledge that will be valid for not-yet-observed cases; at least we hope so.

Inductive learning is based on learning by example: knowledge gained by starting from a set of positive examples that are instances of the concept to be learned, and negative examples that are non-instances of the concept. In this regard, Galileo Galilei's (1564-1642) phrase is particularly clear:

"Knowledge forms part of the experience from which hypotheses are derived, based on quantitative data, which must be verified through experiments, also mentally, understood as artificial relationships between quantified variables, to arrive at the formulation of the law in the form of an equation."

The following diagram consists of a flowchart showing inductive and deductive learning:

A question arises spontaneously: why do machine learning systems work where traditional algorithms fail? The reasons for the failure of traditional algorithms are numerous and typically include the following:

Difficulty in problem formalization

: For example, each of us can recognize our friends from their voices. But probably none can describe a sequence of computational steps enabling them to recognize the speaker from the recorded sound.

High number of variables at play

: When considering the problem of recognizing characters from a document, specifying all parameters that are thought to be involved can be particularly complex. In addition, the same formalization applied in the same context but on a different idiom could prove inadequate.

Lack of theory

: Imagine you have to predict exactly the performance of financial markets in the absence of specific mathematical laws.

Need for customization

: The distinction between interesting and uninteresting features depends significantly on the perception of the individual user.

A quick analysis of these issues highlights the lack of experience in all cases.

Discovering the different types of machine learning

The power of machine learning is due to the quality of its algorithms, which have been improved and updated over the years; these are divided into several main types depending on the nature of the signal used for learning or the type of feedback adopted by the system.

They are as follows:

Supervised learning

: The algorithm generates a function that links input values to a desired output through the observation of a set of examples in which each data input has its relative output data, which is used to construct predictive models.

Unsupervised learning

: The algorithm tries to derive knowledge from a general input without the help of a set of pre-classified examples that are used to build descriptive models. A typical example of the application of these algorithms is found in search engines.

Reinforcement learning

: The algorithm is able to learn depending on the changes that occur in the environment in which it is performed. In fact, since every action has some effect on the environment concerned, the algorithm is driven by the same feedback environment. Some of these algorithms are used in speech or text recognition.

The subdivision that we have just proposed does not prohibit the use of hybrid approaches between some or all of these different areas, which have often recorded good results.

Supervised learning

Supervised learning is a machine learning technique that aims to program a computer system so that it can resolve the relevant tasks automatically. To do this, the input data is included in a set I (typically vectors). Then, the set of output data is fixed as set O, and finally, it defines a function f that associates each input with the correct answer. Such information is called a training set. This workflow is presented in the following diagram:

All supervised learning algorithms are based on the following thesis: if an algorithm provides an adequate number of examples, it will be able to create a derived functionBthat will approximate the desired functionA.

If the approximation of the desired function is adequate, then when the input data is offered to the derived function, this function should be able to provide output responses similar to those provided by the desired function and then acceptable. These algorithms are based on the following concept: similar inputs correspond to similar outputs.

Generally, in the real world, this assumption is not valid; however, some situations exist in which it is acceptable. Clearly, the proper functioning of such algorithms depends significantly on the input data. If there are only a few training inputs, the algorithm might not have enough experience to provide a correct output. Conversely, many inputs may make it excessively slow, since the derivative function generated by a large number of inputs increases the training time. Hence the slowness.

Moreover, experience shows that this type of algorithm is very sensitive to noise; even a few pieces of incorrect data can make the entire system unreliable and lead to wrong decisions.

In supervised learning, it's possible to split problems based on the nature of the data. If the output value is categorical, such as membership/non-membership of a certain class, then it is a classification problem. If the output is a continuous real value in a certain range, then it is a regression problem.

Unsupervised learning

The aim of unsupervised learning is to automatically extract information from databases. This process occurs without a priori knowledge of the contents to be analyzed. Unlike supervised learning, there is no information on the membership classes of examples, or more generally on the output corresponding to a certain input. The goal is to get a model that is able to discover interesting properties: groups with similar characteristics (clustering), for instance. Search engines are an example of an application of these algorithms. Given one or more keywords, they are able to create a list of links related to our search.

The validity of these algorithms depends on the usefulness of the information they can extract from the databases. These algorithms work by comparing data and looking for similarities or differences. Available data concerns only the set of features that describe each example.

The following diagram shows supervised learning (on the left) and unsupervised learning examples (on the right):

They show great efficiency with elements of numeric type, but are much less accurate with non-numeric data. Generally, they work properly in the presence of data that is clearly identifiable and contains an order or a clear grouping.

Reinforcement learning

Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli, the nature of which depends on the algorithm choices. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.

In supervised learning, there is a teacher that tells the system the correct output (learning with a teacher). This is not always possible. Often, we have only qualitative information (sometimes binary, right/wrong, or success/failure).

The information available is called reinforcement signals. But the system does not give any information on how to update the agent's behavior (that is, weights). You cannot define a cost function or a gradient. The goal of the system is to create smart agents that have machinery able to learn from their experience.

Building machine learning models step by step

When developing an application that uses machine learning, we will follow a procedure characterized by the following steps:

Collecting the data

: Everything starts from the data, no doubt about it; but one might wonder from where so much data comes. In practice, it is collected through lengthy procedures that may, for example, derive from measurement campaigns or face-to-face interviews. In all cases, the data is collected in a database so that it can then be analyzed to derive knowledge.

Preparing the data

: We have collected the data; now, we have to prepare it for the next step. Once we have this data, we must make sure it is in a format usable by the algorithm we want to use. To do this, you may need to do some formatting. Recall that some algorithms need data in an integer format, whereas others require data in the form of strings, and finally others need it to be in a special format. We will get to this later, but the specific formatting is usually simple compared to the data collection.

Exploring the data

: At this point, we can look at data to verify that it is actually working and that we do not have a bunch of empty values. In this step, through the use of plots, we can recognize patterns and whether or not there are some data points that are vastly different from the rest of the set. Plotting data in one, two, or three dimensions can also help.

Training the algorithm

: Now, let's get serious. In this step, the machine learning algorithm works on the definition of the model and therefore deals with the training. The model starts to extract knowledge from the large amounts of data that we had available, and from which nothing has been explained so far. For unsupervised learning, there's no training step because you don't have a target value.

Testing the algorithm

: In this step, we use the information learned in the previous step to see if the model actually works. The evaluation of an algorithm is for seeing how well the model approximates the real system. In the case of supervised learning, we have some known values that we can use to evaluate the algorithm. In unsupervised learning, we may need to use some other metrics to evaluate success. In both cases, if we are not satisfied, we can return to the previous steps, change some things, and retry the test.

Evaluating the algorithm

: We have reached the point where we can apply what has been done so far. We can assess the approximation ability of the model by applying it to real data. The model, previously trained and tested, is then valued in this phase.

Improving algorithm performance

: Finally, we can focus on the finishing steps. We've verified that the model works, we have evaluated the performance, and now we are ready to analyze the whole process to identify any possible room for improvement.

Before applying the machine learning algorithm to our data, it is appropriate to devote some time to the workflow setting.

Getting started with reinforcement learning

Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli that depend on the actions chosen by the agent. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.

These mechanisms derive from the basic concepts of machine learning (learning from experience), in an attempt to simulate human behavior. In fact, in our mind, we activate brain mechanisms that lead us to chase and repeat what, in us, produces feelings of gratification and wellbeing. Whenever we experience moments of pleasure (food, sex, love, and so on), some substances are produced in our brains that work by reinforcing that same stimulus, emphasizing it.

Along with this mechanism of neurochemical reinforcement, an important role is represented by memory. In fact, the memory collects the experience of the subject in order to be able to repeat it in the future. Evolution has endowed us with this mechanism to push us to repeat gratifying experiences in the direction of the best solutions.

This is why we so powerfully remember the most important experiences of our life: experiences, especially those that are powerfully rewarding, are impressed in memories and condition our future explorations. Previously, we have seen that learning from experience can be simulated by a numerical algorithm in various ways, depending on the nature of the signal used for learning and the type of feedback adopted by the system.

The following diagram shows a flowchart that displays an agent's interaction with the environment in a reinforcement learning setting:

Scientific literature has taken an uncertain stance on the classification of learning by reinforcement as a paradigm. In fact, in the initial phase of the literature, it was considered a special case of supervised learning, after which it was fully promoted as the third paradigm of machine learning algorithms. It is applied in different contexts in which supervised learning is inefficient: the problems of interaction with the environment are a clear example.

The following list shows the steps to follow to correctly apply a reinforcement learning algorithm:

Preparation of the agent

Observation of the environment

Selection of the optimal strategy

Execution of actions

Calculation of the corresponding reward (or penalty)

Development of updating strategies (if necessary)

Repetition of steps 2 through 5 iteratively until the agent learns the optimal strategies

Reinforcement learning is based on a theory from psychology, elaborated following a series of experiments performed on animals. In particular, Edward Thorndike (American psychologist) noted that if a cat is given a reward immediately after the execution of a behavior considered correct, then this increases the probability that this behavior will repeat itself. On the other hand, in the face of unwanted behavior, the application of a punishment decreases the probability of a repetition of the error.

On the basis of this theory, after defining a goal to be achieved, reinforcement learning tries to maximize the rewards received for the execution of the action or set of actions that allows to reach the designated goal.

Agent-environment interface

Reinforcement learning can be seen as a special case of the interaction problem, in terms of achieving a goal. The entity that must reach the goal is called an agent. The entity with which the agent must interact is called the environment, which corresponds to everything that is external to the agent.

So far, we are more focused on the term agent, but what does it represent? The agent (software) is a software entity that performs services on behalf of another program, usually automatically and invisibly. These pieces of software are also called smart agents.

What follows is a list of the most important features of an agent:

It can choose between a continuous and a discrete set for an action on the environment.

The action depends on the situation. The situation is summarized in the system state.

The agent continuously monitors the environment (input) and continuously changes the status

The choice of the action is not trivial and requires a certain degree of intelligence.

The agent has a smart memory.

The agent has a goal-directed behavior, but acts in an uncertain environment that is not known a priori or only partially known. An agent learns by interacting with the environment. Planning can be developed while learning about the environment through the measurements made by the agent itself. This strategy is close to trial-and-error theory.

Trial and error is a fundamental method of problem solving. It is characterized by repeated, varied attempts that are continued until success, or until the agent stops trying.

The agent-environment interaction is continuous: the agent chooses an action to be taken, and in response, the environment changes state, presenting a new situation to be faced.

In the particular case of reinforcement learning, the environment provides the agent with a reward. It is essential that the source of the reward is the environment to avoid the formation, within the agent, of a personal reinforcement mechanism that would compromise learning.

The value of the reward is proportional to the influence that the action has in reaching the objective, so it is positive or high in the case of a correct action, or negative or low for an incorrect action.

In the following list are some examples of real life in which there is an interaction between agent and environment to solve a problem:

A chess player, for each move, has information on the configurations of pieces that can be created, and on the possible countermoves of the opponent.

A little giraffe, in just a few hours, learns to get up and run.

A truly autonomous robot learns to move around a room to get out of it. For example: Roomba Robot Vacuum.

The parameters of a refinery (oil pressure, flow, and so on) are set in real time, so as to obtain the maximum yield or maximum quality. For example, if particularly dense oil arrives, then the flow rate to the plant is modified to allow an adequate refining.

All the examples that we examined have the following characteristics in common:

Interaction with the environment

A specific goal that the agent wants to get

Uncertainty or partial knowledge of the environment

From the analysis of these examples, it is possible to make the following observations:

The agent learns from its own experience.

The actions change the status (the situation), the possibilities of choice in the future change (delayed reward).

The effect of an action cannot be completely predicted.

The agent has a global assessment of its behavior.

It must exploit this information to improve its choices. Choices improve with experience.

Problems can have a finite or infinite time horizon.

Essentially, the agent receives sensations from the environment through its sensors. Depending on its feelings, the agent decides what actions to take in the environment. Based on the immediate result of its actions, the agent can be rewarded.

If you want to use an automatic learning method, you need to give a formal description of the environment. It is not important to know exactly how the environment is made; what is interesting is to make general assumptions about the properties that the environment has. In reinforcement learning, it is usually assumed that the environment can be described by a MDP.

Exploration versus exploitation

Ideally, the agent must associate with each action at the respective reward r, in order to then choose the most rewarding behavior for achieving the goal. This approach is therefore impracticable for complex problems in which the number of states is particularly high and, consequently, the possible associations increase exponentially.

This problem is called the exploration-exploitation dilemma. Ideally, the agent must explore all possible actions for each state, finding the one that is actually most rewarded for exploiting in achieving its goal.

Thus, decision-making involves a fundamental choice:

Exploitation

: Make the best decision, given current information

Exploration

: Collect more information

In this process, the best long-term strategy can lead to considerable sacrifices in the short term. Therefore, it is necessary to gather enough information to make the best decisions.

The exploration-exploitation dilemma makes itself known whenever we try to learn something new. Often, we have to decide whether to choose what we already know (exploitation), leaving our cultural baggage unaltered, or choosing something new and learning more in this way (exploration). The second choice puts us at the risk of making the wrong choices. This is an experience that we often face; think, for example, about the choices we make in a restaurant when we are asked to choose between the dishes on the menu:

We can choose something that we already know and that, in the past, has given us back a known reward with gratification (exploitation), such as pizza (who does not know the goodness of a margherita pizza?)

We can try something new that we have never tasted before and see what we get (exploration), such as lasagna (alas, not everyone knows the magic taste of a plate of lasagna)

The choice we will make will depend on many boundary conditions: the price of the dishes, the level of hunger, knowledge of the dishes, and so on. What is important is that the study of the best way to make these kinds of choices has demonstrated that optimal learning sometimes requires us to make bad choices. This means that, sometimes, you have to choose to avoid the action you deem most rewarding and take an action that you feel is less rewarding. The logic is that these actions are necessary to obtain a long-term benefit: sometimes, you need to get your hands dirty to learn more.

The following are more examples of adopting this technique for real-life cases:

Selection of a store:

Exploitation

: Go to your favorite store

Exploration

: Try a new store