39,59 €
Reinforcement learning has evolved a lot in the last couple of years and proven to be a successful technique in building smart and intelligent AI networks. Keras Reinforcement Learning Projects installs human-level performance into your applications using algorithms and techniques of reinforcement learning, coupled with Keras, a faster experimental library.
The book begins with getting you up and running with the concepts of reinforcement learning using Keras. You’ll learn how to simulate a random walk using Markov chains and select the best portfolio using dynamic programming (DP) and Python. You’ll also explore projects such as forecasting stock prices using Monte Carlo methods, delivering vehicle routing application using Temporal Distance (TD) learning algorithms, and balancing a Rotating Mechanical System using Markov decision processes.
Once you’ve understood the basics, you’ll move on to Modeling of a Segway, running a robot control system using deep reinforcement learning, and building a handwritten digit recognition model in Python using an image dataset. Finally, you’ll excel in playing the board game Go with the help of Q-Learning and reinforcement learning algorithms.
By the end of this book, you’ll not only have developed hands-on training on concepts, algorithms, and techniques of reinforcement learning but also be all set to explore the world of AI.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 395
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor:Pravin DhandreAcquisition Editor:Dayne CastelinoContent Development Editor:Karan ThakkarTechnical Editor: Nilesh SawakhandeCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Mariammal ChettiyarGraphics: Jisha ChirayilProduction Coordinator:Jyoti Chauhan
First published: September 2018
Production reference: 2091018
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78934-209-3
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Giuseppe Ciaburro holds a PhD in environmental technical physics and two master's degrees. His research focuses on machine learning applications in the study of urban sound environments. He works at Built Environment Control Laboratory – Università degli Studi della Campania Luigi Vanvitelli (Italy). He has over 15 years of work experience in programming (in Python, R, and MATLAB), first in the field of combustion and then in acoustics and noise control. He has several publications to his credit.
Sudharsan Ravichandiran is a data scientist, researcher, artificial intelligence enthusiast, and YouTuber (search for Sudharsan reinforcement learning). He completed his bachelors in information technology at Anna University. His area of research focuses on practical implementations of deep learning and reinforcement learning, which includes natural language processing and computer vision. He is an open source contributor and loves answering questions on Stack Overflow. He also authored a best seller Hands on Reinforcement Learning with Python published by Packt.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Keras Reinforcement Learning Projects
Packt Upsell
Why subscribe?
Packt.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Overview of Keras Reinforcement Learning
Basic concepts of machine learning
Discovering the different types of machine learning
Supervised learning
Unsupervised learning
Reinforcement learning
Building machine learning models step by step
Getting started with reinforcement learning
Agent-environment interface
Markov Decision Process
Discounted cumulative reward
Exploration versus exploitation
Reinforcement learning algorithms
Dynamic Programming
Monte Carlo methods
Temporal difference learning
SARSA
Q-learning
Deep Q-learning
Summary
Simulating Random Walks
Random walks
One-dimensional random walk
Simulating 1D random walk
Markov chains
Stochastic process
Probability calculation
Markov chain definition
Transition matrix
Transition diagram
Weather forecasting with Markov chains
Generating pseudorandom text with Markov chains
Summary
Optimal Portfolio Selection
Dynamic Programming
Divide and conquer versus Dynamic Programming
Memoization
Dynamic Programming in reinforcement-learning applications
Optimizing a financial portfolio
Optimization techniques
Solving the knapsack problem using Dynamic Programming
Different approaches to the problem
Brute force
Greedy algorithms
Dynamic Programming
Summary
Forecasting Stock Market Prices
Monte Carlo methods
Historical background
Basic concepts of the Monte Carlo simulation
Monte Carlo applications
Numerical integration using the Monte Carlo method
Monte Carlo for prediction and control
Amazon stock price prediction using Python
Exploratory analysis
The Geometric Brownian motion model
Monte Carlo simulation
Summary
Delivery Vehicle Routing Application
Temporal difference learning
SARSA
Q-learning
Basics of graph theory
The adjacency matrix
Adjacency lists
Graphs as data structures in Python
Graphs using the NetworkX package
Finding the shortest path
The Dijkstra algorithm
The Dijkstra algorithm using the NetworkX package
The Google Maps algorithm
The Vehicle Routing Problem
Summary
Continuous Balancing of a Rotating Mechanical System
Neural network basic concepts
The Keras neural network model
Classifying breast cancer using the neural network
Deep reinforcement learning
The Keras–RL package
Continuous control with deep reinforcement learning
Summary
Dynamic Modeling of a Segway as an Inverted Pendulum System
How Segways work
System modeling basics
OpenAI Gym
OpenAI Gym methods
OpenAI Gym installation
The CartPole system
Q-learning solution
Deep Q-learning solution
Summary
Robot Control System Using Deep Reinforcement Learning
Robot control
Robotics overview
Robot evolution
First-generation robots
Second-generation robots
Third-generation robots
Fourth-generation robots
Robot autonomy
Robot mobility
Automatic control
Control architectures
The FrozenLake environment
The Q-learning solution
A Deep Q-learning solution
Summary
Handwritten Digit Recognizer
Handwritten digit recognition
Optical Character Recognition
Computer vision
Handwritten digit recognition using an autoencoder
Loading data
Model architecture
Deep autoencoder Q-learning
Summary
Playing the Board Game Go
Game theory
Basic concepts
Game types
Cooperative games
Symmetrical games
Zero-sum games
Sequential games
Game theory applications
Prisoner's dilemma
Stag hunt
Chicken game
The Go game
Basic rules of the game
Scoring rules
The AlphaGo project
The AlphaGo algorithm
Monte Carlo Tree Search
Convolutional networks
Summary
What's Next?
Reinforcement-learning applications in real life
DeepMind AlphaZero
IBM Watson
The Unity Machine Learning Agents toolkit
FANUC industrial robots
Automated trading systems using reinforcement learning
Next steps for reinforcement learning
Inverse reinforcement learning
Learning by demonstration
Deep Deterministic Policy Gradients
Reinforcement learning from human preferences
Hindsight Experience Replay
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
This book brings human-level performance into your applications using the algorithms and techniques of reinforcement learning coupled with Keras, the fast experimental library. Some of the projects that are covered are a delivery vehicle routing application, forecasting stock market prices, a robot control system, a optimal portfolio selection, and the dynamic modeling of a Segway. Throughout this book, you will get your hands dirty with the most popular algorithms, such as the Markov Decision Process, the Monte Carlo method, and Q-learning, so that you can be equipped with complex statistics to get better results.
This book suits data scientists, machine learning engineers, and AI engineers who want to understand reinforcement learning by developing practical projects.
A sound knowledge of machine learning and a basic familiarity with Keras is all you need to get started with this book.
Chapter 1, Overview of Keras Reinforcement Learning, will get you ready to enjoy reinforcement learning using Keras, looking at topics ranging from the basic concepts right to the building of models. By the end of this chapter, you will be ready to dive into working on real-world projects.
Chapter 2, Simulating Random Walks, will have you simulate a random walk using Markov chains through a Python code implementation.
Chapter 3, Optimal Portfolio Selection, explores how to select the optimal portfolio using dynamic programming through a Python code implementation.
Chapter 4, Forecasting Stock Market Prices, guides you in using the Monte Carlo methods to forecast stock market prices.
Chapter 5, Delivery Vehicle Routing Application, shows how to use Temporal Difference (TD) learning algorithms to manage warehouse operations through Python and the Keras library.
Chapter 6, Continuous Balancing of a Rotating Mechanical System, helps you to use deep reinforcement learning methods to balance a rotating mechanical system.
Chapter 7, Dynamic Modeling of a Segway as an Inverted Pendulum System, teaches you the basic concepts of Q-learning and how to use this technique to control a mechanical system.
Chapter 8, A Robot Control System Using Deep Reinforcement Learning, will confront you with the problem of robot navigation in simple maze-like environments where the robot has to rely on its on-board sensors to perform navigation tasks.
Chapter 9, Handwritten Digit Recognizer, shows how to set up a handwritten digit recognition model in Python using an image dataset.
Chapter 10, Playing the Board Game Go, explores how reinforcement learning algorithms were used to address a problem in game theory.
Chapter 11, What's Next?, gives a good understanding of the real-life challenges in building and deploying machine learning models, and explores additional resources and technologies that will help sharpen your machine learning skills.
In this book, reinforcement learning algorithms are implemented in Python. To reproduce the many examples in this book, you need to possess a good knowledge of the Python environment. We have used Python 3.6 and above to build various applications. In that spirit, we have tried to keep all of the code as friendly and readable as possible. We feel that this will enable you to easily understand the code and readily use it in different scenarios.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Keras-Reinforcement-Learning-Projects. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789342093_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "To calculate the logarithm of returns, we will use the log() function from numpy."
A block of code is set as follows:
plt.figure(figsize=(10,5))plt.plot(dataset)plt.show()
Any command-line input or output is written as follows:
git clone https://github.com/openai/gym
cd gym
pip install -e .
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Nowadays, most computers are based on a symbolic elaboration, that is, the problem is first encoded in a set of variables and then processed using an explicit algorithm that, for each possible input of the problem, offers an adequate output. However, there are problems in which resolution with an explicit algorithm is inefficient or even unnatural, for example with a speech recognizer; tackling this kind of problem with the classic approach is inefficient. This and other similar problems, such as autonomous navigation of a robot or voice assistance in performing an operation, are part of a very diverse set of problems that can be addressed directly through solutions based on reinforcement learning.
Reinforcement learning is a very exciting part of machine learning, used in applications ranging from autonomous cars to playing games. Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. To do this, we use external feedback signals (reward signals) generated by the environment according to the choices made by the algorithm. A correct choice will result in a reward, while an incorrect choice will lead to a penalization of the system. All of this is in order to achieve the best result obtainable.
The topics covered in this chapter are the following:
An overview of machine learning
Reinforcement learning
Markov Decision Process
(
MDP
)
Temporal difference
(
TD
) learning
Q-learning
Deep Q-learning networks
At the end of the chapter, you will be fully introduced to the power of reinforcement learning and will learn the different approaches to this technique. Several reinforcement learning methods will be covered.
Machine learning is a multidisciplinary field created at the intersection of, and by the synergy between, computer science, statistics, neurobiology, and control theory. Its emergence has played a key role in several fields and has fundamentally changed the vision of software programming. If the question before was, how to program a computer, now the question becomes is how computers will program themselves. Thus, it is clear that machine learning is a basic method that allows a computer to have its own intelligence.
As might be expected, machine learning interconnects and coexists with the study of, and research on, human learning. Like humans, whose brain and neurons are the foundation of insight, Artificial Neural Networks (ANNs) are the basis of any decision-making activity of the computer.
Machine learning refers to the ability to learn from experience without any outside help, which is what we humans do, in most cases. Why should it not be the same for machines?
From a set of data, we can find a model that approximates the set by the use of machine learning. For example, we can identify a correspondence between input variables and output variables for a given system. One way to do this is to postulate the existence of some kind of mechanism for the parametric generation of data, which, however, does not know the exact values of the parameters. This process typically makes reference to statistical techniques, such as the following:
Induction
Deduction
Abduction
The extraction of general laws from a set of observed data is called induction; it is opposed to deduction, in which, starting from general laws, we want to predict the value of a set of variables. Induction is the fundamental mechanism underlying the scientific method in which we want to derive general laws, typically described in a mathematical language, starting from the observation of phenomena.
This observation includes the measurement of a set of variables and, therefore, the acquisition of data that describes the observed phenomena. Then, the resultant model can be used to make predictions on additional data. The overall process in which, starting from a set of observations, we want to make predictions for new situations, is called inference. Therefore, inductive learning starts from observations arising from the surrounding environment and generalizes obtaining knowledge that will be valid for not-yet-observed cases; at least we hope so.
Inductive learning is based on learning by example: knowledge gained by starting from a set of positive examples that are instances of the concept to be learned, and negative examples that are non-instances of the concept. In this regard, Galileo Galilei's (1564-1642) phrase is particularly clear:
The following diagram consists of a flowchart showing inductive and deductive learning:
A question arises spontaneously: why do machine learning systems work where traditional algorithms fail? The reasons for the failure of traditional algorithms are numerous and typically include the following:
Difficulty in problem formalization
: For example, each of us can recognize our friends from their voices. But probably none can describe a sequence of computational steps enabling them to recognize the speaker from the recorded sound.
High number of variables at play
: When considering the problem of recognizing characters from a document, specifying all parameters that are thought to be involved can be particularly complex. In addition, the same formalization applied in the same context but on a different idiom could prove inadequate.
Lack of theory
: Imagine you have to predict exactly the performance of financial markets in the absence of specific mathematical laws.
Need for customization
: The distinction between interesting and uninteresting features depends significantly on the perception of the individual user.
A quick analysis of these issues highlights the lack of experience in all cases.
The power of machine learning is due to the quality of its algorithms, which have been improved and updated over the years; these are divided into several main types depending on the nature of the signal used for learning or the type of feedback adopted by the system.
They are as follows:
Supervised learning
: The algorithm generates a function that links input values to a desired output through the observation of a set of examples in which each data input has its relative output data, which is used to construct predictive models.
Unsupervised learning
: The algorithm tries to derive knowledge from a general input without the help of a set of pre-classified examples that are used to build descriptive models. A typical example of the application of these algorithms is found in search engines.
Reinforcement learning
: The algorithm is able to learn depending on the changes that occur in the environment in which it is performed. In fact, since every action has some effect on the environment concerned, the algorithm is driven by the same feedback environment. Some of these algorithms are used in speech or text recognition.
The subdivision that we have just proposed does not prohibit the use of hybrid approaches between some or all of these different areas, which have often recorded good results.
Supervised learning is a machine learning technique that aims to program a computer system so that it can resolve the relevant tasks automatically. To do this, the input data is included in a set I (typically vectors). Then, the set of output data is fixed as set O, and finally, it defines a function f that associates each input with the correct answer. Such information is called a training set. This workflow is presented in the following diagram:
All supervised learning algorithms are based on the following thesis: if an algorithm provides an adequate number of examples, it will be able to create a derived functionBthat will approximate the desired functionA.
If the approximation of the desired function is adequate, then when the input data is offered to the derived function, this function should be able to provide output responses similar to those provided by the desired function and then acceptable. These algorithms are based on the following concept: similar inputs correspond to similar outputs.
Generally, in the real world, this assumption is not valid; however, some situations exist in which it is acceptable. Clearly, the proper functioning of such algorithms depends significantly on the input data. If there are only a few training inputs, the algorithm might not have enough experience to provide a correct output. Conversely, many inputs may make it excessively slow, since the derivative function generated by a large number of inputs increases the training time. Hence the slowness.
Moreover, experience shows that this type of algorithm is very sensitive to noise; even a few pieces of incorrect data can make the entire system unreliable and lead to wrong decisions.
In supervised learning, it's possible to split problems based on the nature of the data. If the output value is categorical, such as membership/non-membership of a certain class, then it is a classification problem. If the output is a continuous real value in a certain range, then it is a regression problem.
The aim of unsupervised learning is to automatically extract information from databases. This process occurs without a priori knowledge of the contents to be analyzed. Unlike supervised learning, there is no information on the membership classes of examples, or more generally on the output corresponding to a certain input. The goal is to get a model that is able to discover interesting properties: groups with similar characteristics (clustering), for instance. Search engines are an example of an application of these algorithms. Given one or more keywords, they are able to create a list of links related to our search.
The validity of these algorithms depends on the usefulness of the information they can extract from the databases. These algorithms work by comparing data and looking for similarities or differences. Available data concerns only the set of features that describe each example.
The following diagram shows supervised learning (on the left) and unsupervised learning examples (on the right):
They show great efficiency with elements of numeric type, but are much less accurate with non-numeric data. Generally, they work properly in the presence of data that is clearly identifiable and contains an order or a clear grouping.
Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli, the nature of which depends on the algorithm choices. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.
In supervised learning, there is a teacher that tells the system the correct output (learning with a teacher). This is not always possible. Often, we have only qualitative information (sometimes binary, right/wrong, or success/failure).
The information available is called reinforcement signals. But the system does not give any information on how to update the agent's behavior (that is, weights). You cannot define a cost function or a gradient. The goal of the system is to create smart agents that have machinery able to learn from their experience.
When developing an application that uses machine learning, we will follow a procedure characterized by the following steps:
Collecting the data
: Everything starts from the data, no doubt about it; but one might wonder from where so much data comes. In practice, it is collected through lengthy procedures that may, for example, derive from measurement campaigns or face-to-face interviews. In all cases, the data is collected in a database so that it can then be analyzed to derive knowledge.
Preparing the data
: We have collected the data; now, we have to prepare it for the next step. Once we have this data, we must make sure it is in a format usable by the algorithm we want to use. To do this, you may need to do some formatting. Recall that some algorithms need data in an integer format, whereas others require data in the form of strings, and finally others need it to be in a special format. We will get to this later, but the specific formatting is usually simple compared to the data collection.
Exploring the data
: At this point, we can look at data to verify that it is actually working and that we do not have a bunch of empty values. In this step, through the use of plots, we can recognize patterns and whether or not there are some data points that are vastly different from the rest of the set. Plotting data in one, two, or three dimensions can also help.
Training the algorithm
: Now, let's get serious. In this step, the machine learning algorithm works on the definition of the model and therefore deals with the training. The model starts to extract knowledge from the large amounts of data that we had available, and from which nothing has been explained so far. For unsupervised learning, there's no training step because you don't have a target value.
Testing the algorithm
: In this step, we use the information learned in the previous step to see if the model actually works. The evaluation of an algorithm is for seeing how well the model approximates the real system. In the case of supervised learning, we have some known values that we can use to evaluate the algorithm. In unsupervised learning, we may need to use some other metrics to evaluate success. In both cases, if we are not satisfied, we can return to the previous steps, change some things, and retry the test.
Evaluating the algorithm
: We have reached the point where we can apply what has been done so far. We can assess the approximation ability of the model by applying it to real data. The model, previously trained and tested, is then valued in this phase.
Improving algorithm performance
: Finally, we can focus on the finishing steps. We've verified that the model works, we have evaluated the performance, and now we are ready to analyze the whole process to identify any possible room for improvement.
Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli that depend on the actions chosen by the agent. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.
These mechanisms derive from the basic concepts of machine learning (learning from experience), in an attempt to simulate human behavior. In fact, in our mind, we activate brain mechanisms that lead us to chase and repeat what, in us, produces feelings of gratification and wellbeing. Whenever we experience moments of pleasure (food, sex, love, and so on), some substances are produced in our brains that work by reinforcing that same stimulus, emphasizing it.
Along with this mechanism of neurochemical reinforcement, an important role is represented by memory. In fact, the memory collects the experience of the subject in order to be able to repeat it in the future. Evolution has endowed us with this mechanism to push us to repeat gratifying experiences in the direction of the best solutions.
This is why we so powerfully remember the most important experiences of our life: experiences, especially those that are powerfully rewarding, are impressed in memories and condition our future explorations. Previously, we have seen that learning from experience can be simulated by a numerical algorithm in various ways, depending on the nature of the signal used for learning and the type of feedback adopted by the system.
The following diagram shows a flowchart that displays an agent's interaction with the environment in a reinforcement learning setting:
Scientific literature has taken an uncertain stance on the classification of learning by reinforcement as a paradigm. In fact, in the initial phase of the literature, it was considered a special case of supervised learning, after which it was fully promoted as the third paradigm of machine learning algorithms. It is applied in different contexts in which supervised learning is inefficient: the problems of interaction with the environment are a clear example.
The following list shows the steps to follow to correctly apply a reinforcement learning algorithm:
Preparation of the agent
Observation of the environment
Selection of the optimal strategy
Execution of actions
Calculation of the corresponding reward (or penalty)
Development of updating strategies (if necessary)
Repetition of steps 2 through 5 iteratively until the agent learns the optimal strategies
Reinforcement learning is based on a theory from psychology, elaborated following a series of experiments performed on animals. In particular, Edward Thorndike (American psychologist) noted that if a cat is given a reward immediately after the execution of a behavior considered correct, then this increases the probability that this behavior will repeat itself. On the other hand, in the face of unwanted behavior, the application of a punishment decreases the probability of a repetition of the error.
On the basis of this theory, after defining a goal to be achieved, reinforcement learning tries to maximize the rewards received for the execution of the action or set of actions that allows to reach the designated goal.
Reinforcement learning can be seen as a special case of the interaction problem, in terms of achieving a goal. The entity that must reach the goal is called an agent. The entity with which the agent must interact is called the environment, which corresponds to everything that is external to the agent.
So far, we are more focused on the term agent, but what does it represent? The agent (software) is a software entity that performs services on behalf of another program, usually automatically and invisibly. These pieces of software are also called smart agents.
What follows is a list of the most important features of an agent:
It can choose between a continuous and a discrete set for an action on the environment.
The action depends on the situation. The situation is summarized in the system state.
The agent continuously monitors the environment (input) and continuously changes the status
The choice of the action is not trivial and requires a certain degree of intelligence.
The agent has a smart memory.
The agent has a goal-directed behavior, but acts in an uncertain environment that is not known a priori or only partially known. An agent learns by interacting with the environment. Planning can be developed while learning about the environment through the measurements made by the agent itself. This strategy is close to trial-and-error theory.
The agent-environment interaction is continuous: the agent chooses an action to be taken, and in response, the environment changes state, presenting a new situation to be faced.
In the particular case of reinforcement learning, the environment provides the agent with a reward. It is essential that the source of the reward is the environment to avoid the formation, within the agent, of a personal reinforcement mechanism that would compromise learning.
The value of the reward is proportional to the influence that the action has in reaching the objective, so it is positive or high in the case of a correct action, or negative or low for an incorrect action.
In the following list are some examples of real life in which there is an interaction between agent and environment to solve a problem:
A chess player, for each move, has information on the configurations of pieces that can be created, and on the possible countermoves of the opponent.
A little giraffe, in just a few hours, learns to get up and run.
A truly autonomous robot learns to move around a room to get out of it. For example: Roomba Robot Vacuum.
The parameters of a refinery (oil pressure, flow, and so on) are set in real time, so as to obtain the maximum yield or maximum quality. For example, if particularly dense oil arrives, then the flow rate to the plant is modified to allow an adequate refining.
All the examples that we examined have the following characteristics in common:
Interaction with the environment
A specific goal that the agent wants to get
Uncertainty or partial knowledge of the environment
From the analysis of these examples, it is possible to make the following observations:
The agent learns from its own experience.
The actions change the status (the situation), the possibilities of choice in the future change (delayed reward).
The effect of an action cannot be completely predicted.
The agent has a global assessment of its behavior.
It must exploit this information to improve its choices. Choices improve with experience.
Problems can have a finite or infinite time horizon.
Essentially, the agent receives sensations from the environment through its sensors. Depending on its feelings, the agent decides what actions to take in the environment. Based on the immediate result of its actions, the agent can be rewarded.
If you want to use an automatic learning method, you need to give a formal description of the environment. It is not important to know exactly how the environment is made; what is interesting is to make general assumptions about the properties that the environment has. In reinforcement learning, it is usually assumed that the environment can be described by a MDP.
Ideally, the agent must associate with each action at the respective reward r, in order to then choose the most rewarding behavior for achieving the goal. This approach is therefore impracticable for complex problems in which the number of states is particularly high and, consequently, the possible associations increase exponentially.
This problem is called the exploration-exploitation dilemma. Ideally, the agent must explore all possible actions for each state, finding the one that is actually most rewarded for exploiting in achieving its goal.
Thus, decision-making involves a fundamental choice:
Exploitation
: Make the best decision, given current information
Exploration
: Collect more information
In this process, the best long-term strategy can lead to considerable sacrifices in the short term. Therefore, it is necessary to gather enough information to make the best decisions.
The exploration-exploitation dilemma makes itself known whenever we try to learn something new. Often, we have to decide whether to choose what we already know (exploitation), leaving our cultural baggage unaltered, or choosing something new and learning more in this way (exploration). The second choice puts us at the risk of making the wrong choices. This is an experience that we often face; think, for example, about the choices we make in a restaurant when we are asked to choose between the dishes on the menu:
We can choose something that we already know and that, in the past, has given us back a known reward with gratification (exploitation), such as pizza (who does not know the goodness of a margherita pizza?)
We can try something new that we have never tasted before and see what we get (exploration), such as lasagna (alas, not everyone knows the magic taste of a plate of lasagna)
The choice we will make will depend on many boundary conditions: the price of the dishes, the level of hunger, knowledge of the dishes, and so on. What is important is that the study of the best way to make these kinds of choices has demonstrated that optimal learning sometimes requires us to make bad choices. This means that, sometimes, you have to choose to avoid the action you deem most rewarding and take an action that you feel is less rewarding. The logic is that these actions are necessary to obtain a long-term benefit: sometimes, you need to get your hands dirty to learn more.
The following are more examples of adopting this technique for real-life cases:
Selection of a store:
Exploitation
: Go to your favorite store
Exploration
: Try a new store