32,36 €
Develop self-learning algorithms and agents using TensorFlow and other Python tools, frameworks, and libraries
Key Features
Book Description
Reinforcement Learning (RL) is a popular and promising branch of AI that involves making smarter models and agents that can automatically determine ideal behavior based on changing requirements. This book will help you master RL algorithms and understand their implementation as you build self-learning agents.
Starting with an introduction to the tools, libraries, and setup needed to work in the RL environment, this book covers the building blocks of RL and delves into value-based methods, such as the application of Q-learning and SARSA algorithms. You'll learn how to use a combination of Q-learning and neural networks to solve complex problems. Furthermore, you'll study the policy gradient methods, TRPO, and PPO, to improve performance and stability, before moving on to the DDPG and TD3 deterministic algorithms. This book also covers how imitation learning techniques work and how Dagger can teach an agent to drive. You'll discover evolutionary strategies and black-box optimization techniques, and see how they can improve RL algorithms. Finally, you'll get to grips with exploration approaches, such as UCB and UCB1, and develop a meta-algorithm called ESBAS.
By the end of the book, you'll have worked with key RL algorithms to overcome challenges in real-world applications, and be part of the RL research community.
What you will learn
Who this book is for
If you are an AI researcher, deep learning user, or anyone who wants to learn reinforcement learning from scratch, this book is for you. You'll also find this reinforcement learning book useful if you want to learn about the advancements in the field. Working knowledge of Python is necessary.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 452
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Pravin DhandreAcquisition Editor:Winston ChristopherContent Development Editor: Roshan KumarSenior Editor: Jack CummingsTechnical Editor: Joseph SunilCopy Editor: Safis EditingProject Coordinator: Kirti PisatProofreader: Safis EditingIndexer: Rekha NairProduction Designer: Nilesh Mohite
First published: October 2019
Production reference: 1181019
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78913-111-6
www.packt.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Andrea Lonza is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.
Greg Walters has been involved with computers and computer programming since 1972. He is extremely well-versed in Visual Basic, Visual Basic .NET, Python and SQL using MySQL, SQLite, Microsoft SQL Server, Oracle, C++, Delphi, Modula-2, Pascal, C, 80x86 Assembler, COBOL, and Fortran. He is a programming trainer and has trained numerous people on many pieces of computer software, including MySQL, Open Database Connectivity, Quattro Pro, Corel Draw!, Paradox, Microsoft Word, Excel, DOS, Windows 3.11, Windows for Workgroups, Windows 95, Windows NT, Windows 2000, Windows XP, and Linux. He is retired and, in his spare time, is a musician and loves to cook, but he is also open to working as a freelancer on various projects.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Reinforcement Learning Algorithms with Python
Dedication
About Packt
Why subscribe?
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Algorithms and Environments
The Landscape of Reinforcement Learning
An introduction to RL
Comparing RL and supervised learning
History of RL
Deep RL
Elements of RL
Policy
The value function
Reward
Model
Applications of RL
Games 
Robotics and Industry 4.0
Machine learning
Economics and finance
Healthcare
Intelligent transportation systems
Energy optimization and smart grid
Summary
Questions
Further reading
Implementing RL Cycle and OpenAI Gym
Setting up the environment
Installing OpenAI Gym 
Installing Roboschool 
OpenAI Gym and RL cycles
Developing an RL cycle
Getting used to spaces
Development of ML models using TensorFlow
Tensor
Constant
Placeholder
Variable
Creating a graph
Simple linear regression example
Introducing TensorBoard
Types of RL environments
Why different environments?
Open source environments
Summary
Questions
Further reading
Solving Problems with Dynamic Programming
MDP
Policy
Return
Value functions
Bellman equation
Categorizing RL algorithms
Model-free algorithms
Value-based algorithms
Policy gradient algorithms
Actor-Critic algorithms
Hybrid algorithms
Model-based RL
Algorithm diversity
Dynamic programming
Policy evaluation and policy improvement
Policy iteration
Policy iteration applied to FrozenLake
Value iteration
Value iteration applied to FrozenLake
Summary
Questions
Further reading
Section 2: Model-Free RL Algorithms
Q-Learning and SARSA Applications
Learning without a model
User experience
Policy evaluation
The exploration problem
Why explore?
How to explore
TD learning
TD update
Policy improvement
Comparing Monte Carlo and TD
SARSA
The algorithm
Applying SARSA to Taxi-v2
Q-learning
Theory
The algorithm
Applying Q-learning to Taxi-v2
Comparing SARSA and Q-learning
Summary
Questions
Deep Q-Network
Deep neural networks and Q-learning
Function approximation 
Q-learning with neural networks
Deep Q-learning instabilities
DQN
The solution
Replay memory
The target network
The DQN algorithm
The loss function
Pseudocode
Model architecture
DQN applied to Pong
Atari games 
Preprocessing
DQN implementation
DNNs
The experienced buffer
The computational graph and training loop
Results
DQN variations
Double DQN
DDQN implementation
Results
Dueling DQN
Dueling DQN implementation
Results
N-step DQN
Implementation
Results
Summary
Questions
Further reading
Learning Stochastic and PG Optimization
Policy gradient methods
The gradient of the policy
Policy gradient theorem
Computing the gradient
The policy
On-policy PG
Understanding the REINFORCE algorithm
Implementing REINFORCE
Landing a spacecraft using REINFORCE
Analyzing the results
REINFORCE with baseline
Implementing REINFORCE with baseline
Learning the AC algorithm
Using a critic to help an actor to learn
The n-step AC model
The AC implementation
Landing a spacecraft using AC 
Advanced AC, and tips and tricks
Summary
Questions
Further reading
TRPO and PPO Implementation
Roboschool
Control a continuous system
Natural policy gradient
Intuition behind NPG
A bit of math
FIM and KL divergence
Natural gradient complications
Trust region policy optimization
The TRPO algorithm
Implementation of the TRPO algorithm
Application of TRPO
Proximal Policy Optimization
A quick overview
The PPO algorithm
Implementation of PPO
PPO application
Summary
Questions
Further reading
DDPG and TD3 Applications
Combining policy gradient optimization with Q-learning
Deterministic policy gradient
Deep deterministic policy gradient
The DDPG algorithm
DDPG implementation
Appling DDPG to BipedalWalker-v2
Twin delayed deep deterministic policy gradient (TD3)
Addressing overestimation bias
Implementation of TD3
Addressing variance reduction
Delayed policy updates
Target regularization
Applying TD3 to BipedalWalker
Summary
Questions
Further reading
Section 3: Beyond Model-Free Algorithms and Improvements
Model-Based RL
Model-based methods
A broad perspective on model-based learning
A known model
Unknown model
Advantages and disadvantages
Combining model-based with model-free learning
A useful combination
Building a model from images
ME-TRPO applied to an inverted pendulum
Understanding ME-TRPO
Implementing ME-TRPO
Experimenting with RoboSchool
Results on RoboSchoolInvertedPendulum
Summary
Questions
Further reading
Imitation Learning with the DAgger Algorithm
Technical requirements
Installation of Flappy Bird
The imitation approach
The driving assistant example
Comparing IL and RL
The role of the expert in imitation learning
The IL structure
Comparing active with passive imitation
Playing Flappy Bird
How to use the environment
Understanding the dataset aggregation algorithm
The DAgger algorithm
Implementation of DAgger
Loading the expert inference model
Creating the learner's computational graph
Creating a DAgger loop
Analyzing the results on Flappy Bird
IRL
Summary
Questions
Further reading
Understanding Black-Box Optimization Algorithms
Beyond RL
A brief recap of RL
The alternative
EAs
The core of EAs
Genetic algorithms
Evolution strategies
CMA-ES
ES versus RL
Scalable evolution strategies
The core
Parallelizing ES
Other tricks
Pseudocode
Scalable implementation
The main function
Workers
Applying scalable ES to LunarLander
Summary
Questions
Further reading
Developing the ESBAS Algorithm
Exploration versus exploitation
Multi-armed bandit
Approaches to exploration
The ∈-greedy strategy
The UCB algorithm
UCB1
Exploration complexity
Epochal stochastic bandit algorithm selection
Unboxing algorithm selection
Under the hood of ESBAS
Implementation
Solving Acrobot
Results
Summary
Questions
Further reading
Practical Implementation for Resolving RL Challenges
Best practices of deep RL
Choosing the appropriate algorithm
From zero to one
Challenges in deep RL
Stability and reproducibility
Efficiency
Generalization
Advanced techniques
Unsupervised RL
Intrinsic reward
Transfer learning
Types of transfer learning
1-task learning
Multi-task learning
RL in the real world
Facing real-world challenges
Bridging the gap between simulation and the real world
Creating your own environment
Future of RL and its impact on society
Summary
Questions
Further reading
Assessments
Other Books You May Enjoy
Leave a review - let other readers know what you think
Reinforcement learning (RL) is a popular and promising branch of artificial intelligence that involves making smarter models and agents that can automatically determine ideal behavior based on changing requirements. Reinforcement Learning Algorithms with Python will help you master RL algorithms and understand their implementation as you build self-learning agents.Starting with an introduction to the tools, libraries, and setup needed to work in the RL environment, this book covers the building blocks of RL and delves into value-based methods such as the application of Q-learning and SARSA algorithms. You'll learn how to use a combination of Q-learning and neural networks to solve complex problems. Furthermore, you'll study policy gradient methods, TRPO, and PPO, to improve performance and stability, before moving on to the DDPG and TD3 deterministic algorithms. This book also covers how imitation learning techniques work and how Dagger can teach an agent to fly. You'll discover evolutionary strategies and black-box optimization techniques. Finally, you'll get to grips with exploration approaches such as UCB and UCB1 and develop a meta-algorithm called ESBAS.By the end of the book, you'll have worked with key RL algorithms to overcome challenges in real-world applications, and you'll be part of the RL research community.
If you are an AI researcher, deep learning user, or anyone who wants to learn RL from scratch, this book is for you. You'll also find this RL book useful if you want to learn about the advancements in the field. Working knowledge of Python is necessary.
Chapter 1, The Landscape of Reinforcement Learning, gives you an insight into RL. It describes the problems that RL is good at solving and the applications where RL algorithms are already adopted. It also introduces the tools, the libraries, and the setup needed for the completion of the projects in the following chapters.
Chapter 2, Implementing RL Cycle and OpenAI Gym, describes the main cycle of the RL algorithms, the toolkit used to develop the algorithms, and the different types of environments. You will be able to develop a random agent using the OpenAI Gym interface to play CartPole using random actions. You will also learn how to use the OpenAI Gym interface to run other environments.
Chapter 3, Solving Problems with Dynamic Programming, introduces to you the core ideas, terminology, and approaches of RL. You will learn about the main blocks of RL and develop a general idea about how RL algorithms can be created to solve a problem. You will also learn the differences between model-based and model-free algorithms and the categorization of reinforcement learning algorithms. Dynamic programming will be used to solve the game FrozenLake.
Chapter 4, Q-Learning and SARSA Applications, talks about value-based methods, in particular Q-learning and SARSA, two algorithms that differ from dynamic programming and scale well on large problems. To become confident with these algorithms, you will apply them to the FrozenLake game and study the differences from dynamic programming.
Chapter 5, Deep Q-Networks, describes how neural networks and convolutional neural networks (CNNs) in particular are applied to Q-learning. You'll learn why the combination of Q-learning and neural networks produces incredible results and how its use can open the door to a much larger variety of problems. Furthermore, you'll apply the DQN to an Atari game using the OpenAI Gym interface.
Chapter 6, Learning Stochastic and PG Optimization, introduces a new family of model-free algorithms: policy gradient methods. You will learn the differences between policy gradient and value-based methods, and you'll learn about their strengths and weaknesses. Then you will implement the REINFORCE and Actor-Critic algorithms to solve a new game called LunarLander.
Chapter 7, TRPO and PPO Implementation, proposes a modification of policy gradient methods using new mechanisms to control the improvement of the policy. These mechanisms are used to improve the stability and convergence of the policy gradient algorithms. In particular you'll learn and implement two main policy gradient methods that use these techniques, namely TRPO and PPO. You will implement them on RoboSchool, an environment with a continuous action space.
Chapter 8, DDPG and TD3 Applications, introduces a new category of algorithms called deterministic policy algorithms that combine both policy gradient and Q-learning. You will learn about the underlying concepts and implement DDPG and TD3, two deep deterministic algorithms, on a new environment.
Chapter 9, Model-Based RL, illustrates RL algorithms that learn the model of the environment to plan future actions, or, to learn a policy. You will be taught how they work, their strengths, and why they are preferred in many situations. To master them, you will implement a model-based algorithm on Roboschool.
Chapter 10, Imitation Learning with the DAgger Algorithm, explains how imitation learning works and how it can be applied and adapted to a problem. You will learn about the most well-known imitation learning algorithm, DAgger. To become confident with it, you will implement it to speed up the learning process of an agent on FlappyBird.
Chapter 11, Understanding Black-Box Optimization Algorithms, explores evolutionary algorithms, a class of black-box optimization algorithms that don't rely on backpropagation. These algorithms are gaining interest because of their fast training and easy parallelization across hundreds or thousands of cores. This chapter provides a theoretical and practical background of these algorithms by focusing particularly on the Evolution Strategy algorithm, a type of evolutionary algorithm.
Chapter 12, Developing ESBAS Algorithm, introduces the important exploration-exploitation dilemma, which is specific to RL. The dilemma is demonstrated using the multi-armed bandit problem and is solved using approaches such as UCB and UCB1. Then, you will learn about the problem of algorithm selection and develop a meta-algorithm called ESBAS. This algorithm uses UCB1 to select the most appropriate RL algorithm for each situation.
Chapter 13, Practical Implementations to Resolve RL Challenges, takes a look at the major challenges in this field and explains some practices and methods to overcome them. You will also learn about some of the challenges of applying RL to real-world problems, future developments of deep RL, and their social impact in the world.
Working knowledge of Python is necessary. Knowledge of RL and the various tools used for it will also be beneficial.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
Support
tab.
Click on
Code Downloads
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Reinforcement-Learning-Algorithms-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789131116_ColorImages.pdf.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
This section is an introduction to reinforcement learning. It includes building the theoretical foundation and setting up the environment that is needed in the upcoming chapters.
This section includes the following chapters:
Chapter 1
,
The Landscape of Reinforcement Learning
Chapter 2
,
I
mplementing RL Cycle and OpenAI Gym
Chapter 3
,
Solving Problems with Dynamic Programming
Humans and animals learn through a process of trial and error. This process is based on our reward mechanisms that provide a response to our behaviors. The goal of this process is to, through multiple repetitions, incentivize the repetition of actions which trigger positive responses, and disincentivize the repetition of actions which trigger negative ones. Through the trial and error mechanism, we learn to interact with the people and world around us, and pursue complex, meaningful goals, rather than immediate gratification.
Learning through interaction and experience is essential. Imagine having to learn to play football by only looking at other people playing it. If you took to the field to play a football match based on this learning experience, you would probably perform incredibly poorly.
This was demonstrated throughout the mid-20th century, notably by Richard Held and Alan Hein's 1963 study on two kittens, both of whom were raised on a carousel. One kitten was able to move freely (actively), whilst the other was restrained and moved following the active kitten (passively). Upon both kittens being introduced to light, only the kitten who was able to move actively developed a functioning depth perception and motor skills, whilst the passive kitten did not. This was notably demonstrated by the absence of the passive kitten's blink-reflex towards incoming objects. What this, rather crude experiment demonstrated is that regardless of visual deprivation, physical interaction with the environment is necessary in order for animals to learn.
Inspired by how animals and humans learn, reinforcement learning (RL) is built around the idea of trial and error from active interactions with the environment. In particular, with RL, an agent learns incrementally as it interacts with the world. In this way, it's possible to train a computer to learn and behave in a rudimentary, yet similar way to how humans do.
This book is all about reinforcement learning. The intent of the book is to give you the best possible understanding of this field with a hands-on approach. In the first chapters, you'll start by learning the most fundamental concepts of reinforcement learning. As you grasp these concepts, we'll start developing our first reinforcement learning algorithms. Then, as the book progress, you'll create more powerful and complex algorithms to solve more interesting and compelling problems. You'll see that reinforcement learning is very broad and that there exist many algorithms that tackle a variety of problems in different ways. Nevertheless, we'll do our best to provide you with a simple but complete description of all the ideas, alongside a clear and practical implementation of the algorithms.
To start with, in this chapter, you'll familiarize yourself with the fundamental concepts of RL, the distinctions between different approaches, and the key concepts of policy, value function, reward, and model of the environment. You'll also learn about the history and applications of RL.
The following topics will be covered in this chapter:
An introduction to RL
Elements of RL
Applications of RL
RL is an area of machine learning that deals with sequential decision-making, aimed at reaching a desired goal. An RL problem is constituted by a decision-maker called an Agent and the physical or virtual world in which the agent interacts, is known as the Environment. The agent interacts with the environment in the form of Action which results in an effect. As a result, the environment will feedback to the agent a new State and Reward. These two signals are the consequences of the action taken by the agent. In particular, the reward is a value indicating how good or bad the action was, and the state is the current representation of the agent and the environment. This cycle is shown in the following diagram:
In this diagram the agent is represented by PacMan that based on the current state of the environment, choose which action to take. Its behavior will influence the environment, like its position and that of the enemies, that will be returned by the environment in the form of a new state and the reward. This cycle is repeated until the game ends.
The ultimate goal of the agent is to maximize the total reward accumulated during its lifetime. Let's simplify the notation: if is the action at time and is the reward at time , then the agent will take actions , to maximize the sum of all rewards .
To maximize the cumulative reward, the agent has to learn the best behavior in every situation. To do so, the agent has to optimize for a long-term horizon while taking care of every single action. In environments with many discrete or continuous states and actions, learning is difficult because the agent should be accountable for each situation. To make the problem harder, RL can have very sparse and delayed rewards, making the learning process more arduous.
To give an example of an RL problem while explaining the complexity of a sparse reward, consider the well-known story of two siblings, Hansel and Gretel. Their parents led them into the forest to abandon them, but Hansel, who knew of their intentions, had taken a slice of bread with him when they left the house and managed to leave a trail of breadcrumbs that would lead him and his sister home. In the RL framework, the agents are Hansel and Gretel, and the environment is the forest. A reward of +1 is obtained for every crumb of bread reached and a reward of +10 is acquired when they reach home. In this case, the denser the trail of bread, the easier it will be for the siblings to find their way home. This is because to go from one piece of bread to another, they have to explore a smaller area. Unfortunately, sparse rewards are far more common than dense rewards in the real world.
An important characteristic of RL is that it can deal with environments that are dynamic, uncertain, and non-deterministic. These qualities are essential for the adoption of RL in the real world. The following points are examples of how real-world problems can be reframed in RL settings:
Self-driving cars are a popular, yet difficult, concept to approach with RL. This is because of the many aspects to be taken into consideration while driving on the road (such as pedestrians, other cars, bikes, and traffic lights) and the highly uncertain environment. In this case, the self-driving car is the agent that can act on the steering wheel, accelerator, and brakes. The environment is the world around it. Obviously, the agent cannot be aware of the whole world around it, as it can only capture limited information via its sensors (for example, the camera, radar, and GPS). The goal of the self-driving car is to reach the destination in the minimum amount of time while following the rules of the road and without damaging anything. Consequently, the agent can receive a negative reward if a negative event occurs and a positive reward can be received in proportion to the driving time when the agent reaches its destination.
In the game of chess, the goal is to checkmate the opponent's piece. In an RL framework, the player is the agent and the environment is the current state of the board. The agent is allowed to move the game pieces according to their own way of moving. As a result of an action, the environment returns a positive or negative reward corresponding to a win or a loss for the agent. In all other situations, the reward is 0 and the next state is the state of the board after the opponent has moved. Unlike the self-driving car example, here, the environment state equals the agent state. In other words, the agent has a perfect view of the environment.
RL and supervised learning are similar, yet different, paradigms to learn from data. Many problems can be tackled with both supervised learning and RL; however, in most cases, they are suited to solve different tasks.
Supervised learning learns to generalize from a fixed dataset with a limited amount of data consisting of examples. Each example is composed of the input and the desired output (or label) that provides immediate learning feedback.
In comparison, RL is more focused on sequential actions that you can take in a particular situation. In this case, the only supervision provided is the reward signal. There's no correct action to take in a circumstance, as in the supervised settings.
RL can be viewed as a more general and complete framework for learning. The major characteristics that are unique to RL are as follows:
The reward could be dense, sparse, or very delayed. In many cases, the reward is obtained only at the end of the task (for example, in the game of chess).
The problem is sequential and time-dependent; actions will affect the next actions, which, in turn, influence the possible rewards and states.
An agent has to take actions with a higher potential to achieve a goal (exploitation), but it should also try different actions to ensure that other parts of the environment are explored (exploration). This problem is called the exploration-exploitation dilemma (or exploration-exploitation trade-off) and it manages the difficult task of balancing between the exploration and exploitation of the environment. This is also very important because, unlike supervised learning, RL can influence the environment since it is free to collect new data as long as it deems it useful.
The environment is stochastic and nondeterministic, and the agent has to take this into consideration when learning and predicting the next action. In fact, we'll see that many of the RL components can be designed to either output a single deterministic value or a range of values along with their probability.
The third type of learning is unsupervised learning, and this is used to identify patterns in data without giving any supervised information. Data compression, clustering, and generative models are examples of unsupervised learning. It can also be adopted in RL settings in order to explore and learn about the environment. The combination of unsupervised learning and RL is called unsupervised RL. In this case, no reward is given and the agent could generate an intrinsic motivation to favor new situations where they can explore the environment.
The first mathematical foundation of RL was built during the 1960s and 1970s in the field of optimal control. This solved the problem of minimizing a behavior's measure of a dynamic system over time. The method involved solving a set of equations with the known dynamics of the system. During this time, the key concept of a Markov decision process (MDP) was introduced. This provides a general framework for modeling decision-making in stochastic situations. During these years, a solution method for optimal control called dynamic programming (DP) was introduced. DP is a method that breaks down a complex problem into a collection of simpler subproblems for solving an MDP.
Note that DP only provides an easier way to solve optimal control for systems with known dynamics; there is no learning involved. It also suffers from the problem of the curse of dimensionality because the computational requirements grow exponentially with the number of states.
Even if these methods don't involve learning, as noted by Richard S. Sutton and Andrew G. Barto, we must consider the solution methods of optimal control, such as DP, to also be RL methods.
In the 1980s, the concept of learning by temporally successive predictions—the so-called temporal difference learning (TD learning) method—was finally introduced. TD learning introduced a new family of powerful algorithms that will be explained in this book.
The first problems solved with TD learning are small enough to be represented in tables or arrays. These methods are called tabular methods, which are often found as an optimal solution but are not scalable. In fact, many RL tasks involve huge state spaces, making tabular methods impossible to adopt. In these problems, function approximations are used to find a good approximate solution with less computational resources.
The adoption of function approximations and, in particular, of artificial neural networks (and deep neural networks) in RL is not trivial; however, as shown on many occasions, they are able to achieve amazing results. The use of deep learning in RL is called deep reinforcement learning (deep RL) and it has achieved great popularity ever since a deep RL algorithm named deep q network (DQN) displayed a superhuman ability to play Atari games from raw images in 2015. Another striking achievement of deep RL was with AlphaGo in 2017, which became the first program to beat Lee Sedol, a human professional Go player, and 18-time world champion. These breakthroughs not only showed that machines can perform better than humans in high-dimensional spaces (using the same perception as humans with respect to images), but also that they can behave in interesting ways. An example of this is the creative shortcut found by a deep RL system while playing Breakout, an Atari arcade game in which the player has to destroy all the bricks, as shown in the following image. The agent found that just by creating a tunnel on the left-hand side of the bricks and by putting the ball in that direction, it could destroy much more bricks and thus increase its overall score with just one move.
There are many other interesting cases where the agents exhibit superb behavior or strategies that weren't known to humans, like a move performed by AlphaGo while playing Go against Lee Sedol. From a human perspective, that move seemed nonsense but ultimately allowed AlphaGo to win the game (the move is called move 37).
Nowadays, when dealing with high-dimensional state or action spaces, the use of deep neural networks as function approximations becomes almost a default choice. Deep RL has been applied to more challenging problems, such as data center energy optimization, self-driving cars, multi-period portfolio optimization, and robotics, just to name a few.
Now you could ask yourself—why can deep learning combined with RL perform so well? Well, the main answer is that deep learning can tackle problems with a high-dimensional state space. Before the advent of deep RL, state spaces had to break down into simpler representations, called features. These were difficult to design and, in some cases, only an expert could do it. Now, using deep neural networks such as a convolutional neural network (CNN) or a recurrent neural network (RNN), RL can learn different levels of abstraction directly from raw pixels or sequential data (such as natural language). This configuration is shown in the following diagram:
Furthermore, deep RL problems can now be solved completely in an end-to-end fashion. Before the deep learning era, an RL algorithm involved two distinct pipelines: one to deal with the perception of the system and one to be responsible for the decision-making. Now, with deep RL algorithms, these processes are joined and are trained end-to-end, from the raw pixels straight to the action. For example, as shown in the preceding diagram, it's possible to train Pacman end-to-end using a CNN to process the visual component and a fully connected neural network (FNN) to translate the output of the CNN into an action.
Nowadays, deep RL is a very hot topic. The principal reason for this is that deep RL is thought to be the type of technology that will enable us to build highly intelligent machines. As proof, two of the more renowned AI companies that are working to solve intelligence problems, namely DeepMind and OpenAI, are heavily researching in RL.
Besides the huge steps achieved with deep RL, there is a long way to go. There are many challenges that still need to be addressed, some of which are listed as follows:
Deep RL is far too slow to learn compared to humans.
Transfer learning in RL is still an open problem.
The reward function is difficult to design and define.
RL agents struggle to learn in highly complex and dynamic environments such as the physical world.
Nonetheless, the research in this field is growing at a fast rate and companies are starting to adopt RL in their products.
As we know, an agent interacts with their environment by the means of actions. This will cause the environment to change and to feedback to the agent a reward that is proportional to the quality of the actions and the new state of the agent. Through trial and error, the agent incrementally learns the best action to take in every situation so that, in the long run, it will achieve a bigger cumulative reward. In the RL framework, the choice of the action in a particular state is done by a policy, and the cumulative reward that is achievable from that state is called the value function. In brief, if an agent wants to behave optimally, then in every situation, the policy has to select the action that will bring it to the next state with the highest value. Now, let's take a deeper look at these fundamental concepts.
The policy defines how the agent selects an action given a state. The policy chooses the action that maximizes the cumulative reward from that state, not with the bigger immediate reward. It takes care of looking for the long-term goal of the agent. For example, if a car has another 30 km to go before reaching its destination, but only has another 10 km of autonomy left and the next gas stations are 1 km and 60 km away, then the policy will choose to get fuel at the first gas station (1 km away) in order to not run out of gas. This decision is not optimal in the immediate future as it will take some time to refuel, but it will be sure to ultimately accomplish the goal.
The following diagram shows a simple example where an actor moving in a 4 x 4 grid has to go toward the star while avoiding the spirals. The actions recommended by a policy are indicated by an arrow pointing in the direction of the move. The diagram on the left shows a random initial policy, while the diagram on the right shows the final optimal policy. In a situation with two equally optimal actions, the agent can arbitrarily chooses which action to take:
An important distinction is between stochastic policies and deterministic policies. In the deterministic case, the policy provides a single deterministic action to take. On the other hand, in the stochastic case, the policy provides a probability for each action. The concept of the probability of an action is useful because it takes into consideration the dynamicity of the environment and helps its exploration.
One way to classify RL algorithms is based on how policies are improved during learning. The simpler case is when the policy that acts on the environment is similar to the one that improves while learning. Another way to say this is that the policy learns from the same data that it generates. These algorithms are called on-policy. Off-policy algorithms, in comparison, involve two policies—one that acts on the environment and another that learns but is not actually used. The former is called the behavior policy, while the latter is called the target policy. The goal of the behavior policy is to interact with and collect information about the environment in order to improve the passive target policy. Off-policy algorithms, as we will see in the coming chapters, are more unstable and difficult to design than on-policy algorithms, but they are more sample efficient, meaning that they require less experience to learn.
To better understand these two concepts, we can think of someone who has to learn a new skill. If the person behaves as on-policy algorithms do, then every time they try a sequence of actions, they'll change their belief and behavior in accordance with the reward accumulated. In comparison, if the person behaves as an off-policy algorithm, they (the target policy) can also learn by looking at an old video of themselves (the behavior policy) doing the same skill—that is, they can use old experiences to help them to improve.
The policy-gradient method is a family of RL algorithms that learns a parametrized policy (as a deep neural network) directly from the gradient of the performance with respect to the policy. These algorithms have many advantages, including the ability to deal with continuous actions and explore the environment with different levels of granularity. They will be presented in greater detail in Chapter 6, Learning Stochastic and PG Optimization, Chapter 7, TRPO and PPO Implementation, and Chapter 8, DDPG and TD3 Applications.
The value function represents the long-term quality of a state. This is the cumulative reward that is expected in the future if the agent starts from a given state. If the reward measures the immediate performance, the value function measures the performance in the long run. This means that a high reward doesn't imply a high-value function and a low reward doesn't imply a low-value function.
Moreover, the value function can be a function of the state or of the state-action pair. The former case is called a state-value function, while the latter is called an action-value function:
Here, the diagram shows the final state values (on the left side) and the corresponding optimal policy (on the right side).
Using the same gridworld example used to illustrate the concept of policy, we can show the state-value function. First of all, we can assume a reward of 0 in each situation except for when the agent reaches the star, gaining a reward of +1. Moreover, let's assume that a strong wind moves the agent in another direction with a probability of 0.33. In this case, the state values will be similar to those shown in the left-hand side of the preceding diagram. An optimal policy will choose the actions that will bring it to the next state with the highest state value, as shown in the right-hand side of the preceding diagram.
Action-value methods (or value-function methods) are the other big family of RL algorithms. These methods learn an action-value function and use it to choose the actions to take. Starting from Chapter 3, Solving Problems with Dynamic Programming, you'll learn more about these algorithms. It's worth noting that some policy-gradient methods, in order to combine the advantages of both methods, can also use a value function to learn the appropriate policy. These methods are called actor-criticmethods. The following diagram shows the three main families of RL algorithms:
At each timestep, that is, after each move of the agent, the environment sends back a number that indicates how good that action was to the agent. This is called a reward. As we have already mentioned, the end goal of the agent is to maximize the cumulative reward obtained during their interaction with the environment.
In literature, the reward is assumed to be a part of the environment, but that's not strictly true in reality. The reward can come from the agent too, but never from the decision-making part of it. For this reason and to simplify the formulation, the reward is always sent from the environment.
The reward is the only supervision signal injected into the RL cycle and it is essential to design the reward in the correct way in order to obtain an agent with good behavior. If the reward has some flaws, the agent may find them and follow incorrect behavior. For example, Coast Runners is a boat-racing game with the goal being to finish ahead of other players. During the route, the boats are rewarded for hitting targets. Some folks at OpenAI trained an agent with RL to play it. They found that, instead of running to the finish line as fast as possible, the trained boat was driving in a circle to capture re-populating targets while crashing and catching fire. In this way, the boat found a way to maximize the total reward without acting as expected. This behavior was due to an incorrect balance between short-term and long-term rewards.
The reward can appear with different frequencies depending on the environment. A frequent reward is called a dense reward; however, if it is seen only a few times during a game, or only at its end, it is called a sparse reward. In the latter case, it could be very difficult for an agent to catch the reward and find the optimal actions.
Imitation learning and inverse RL are two powerful techniques that deal with the absence of a reward in the environment. Imitation learning uses an expert demonstration to map states to actions. On the other hand, inverse RL deduces the reward function from an expert optimal behavior. Imitation learning and inverse RL will be studied in Chapter 10, Imitation Learning with the DAgger Algorithm.
The model is an optional component of the agent, meaning that it is not required in order to find a policy for the environment. The model details how the environment behaves, predicting the next state and the reward, given a state and an action. If the model is known, planning algorithms can be used to interact with the model and recommend future actions. For example, in environments with discrete actions, potential trajectories can be simulated using look ahead searches (for instance, using the Monte Carlo tree search).
The model of the environment could either be given in advance or learned through interactions with it. If the environment is complex, it's a good idea to approximate it using deep neural networks. RL algorithms that use an already known model of the environment, or learn one, are called model-based methods. These solutions are opposed to model-free methods and will be explained in more detail in Chapter 9, Model-Based RL.
RL has been applied to a wide variety of fields, including robotics, finance, healthcare, and intelligent transportation systems. In general, they can be grouped into three major areas—automatic machines (such as autonomous vehicles, smart grids, and robotics), optimization processes (for example, planned maintenance, supply chains, and process planning) and control (for example, fault detection and quality control).
In the beginning, RL was only ever applied to simple problems, but deep RL opened the road to different problems, making it possible to deal with more complex tasks. Nowadays, deep RL has been showing some very promising results. Unfortunately, many of these breakthroughs are limited to research applications or games, and, in many situations, it is not easy to bridge the gap between purely research-oriented applications and industry problems. Despite this, more companies are moving toward the adoption of RL in their industries and products.
We will now take a look at the principal fields that are already adopting or will benefit from RL.
Games are a perfect testbed for RL because they are created in order to challenge human capabilities, and, to complete them, skills common to the human brain are required (such as memory, reasoning, and coordination). Consequently, a computer that can play on the same level or better than a human must possess the same qualities. Moreover, games are easy to reproduce and can be easily simulated in computers. Video games proved to be very difficult to solve because of their partial observability (that is, only a fraction of the game is visible) and their huge search space (that is, it's impossible for a computer to simulate all possible configurations).
A breakthrough in games occurred when, in 2015, AlphaGo beat Lee Sedol in the ancient game of Go. This win occurred in spite of the prediction that it wouldn't. At the time, it was thought that no computer would be able to beat an expert in Go for the next 10 years. AlphaGo used both RL and supervised learning to learn from professional human games. A few years after that match, the next version, named AlphaGo Zero, beat AlphaGo 100 games to 0. AlphaGo Zero learned to play Go in only three days through self-play.
To capture the messiness and continuous nature of the real world, a team of five neural networks named OpenAI Five was trained to play DOTA 2, a real-time strategy game with two teams (each with five players) playing against each other. The steep learning curve in playing this game is due to the long time horizons (a game lasts for 45 minutes on average with thousands of actions), the partial observability (each player can only see a small area around themselves), and the high-dimensional continuous action and observation space. In 2018, OpenAI Five played against the top DOTA 2 players at The International, losing the match but showing innate capabilities in both collaboration and strategy skills. Finally, on April 13, 2019, OpenAI Five officially defeated the world champions in the game, becoming the first AI to beat professional teams in an esports game.
RL in industrial robotics is a very active area of research as it is a natural adoption of this paradigm in the real world. The potential and benefit of industrial intelligent robots are huge and extensive. RL enables Industry 4.0 (referred to as the fourth industrial revolution) with intelligent devices, systems, and robots that perform highly complex and rational operations. Systems that predict maintenance, real-time diagnoses, and management of manufacturing activities can be integrated for better control and productivity.
Thanks to the flexibility of RL, it can be employed not only in standalone tasks but also as a sort of fine-tune method in supervised learning algorithms. In many natural language processing (NLP) and computer vision tasks, the metric to optimize isn't differentiable, so to address the problem in supervised settings with neural networks, it needs an auxiliary differentiable loss function. However, the discrepancy between the two loss functions will penalize the final performance. One way to deal with this is to first train the system using supervised learning with the auxiliary loss function, and then use RL to fine-tune the network optimizing with respect to the final metric. For instance, this process can be of benefit in subfields such as machine translation and question answering, where the evaluation metrics are complex and not differentiable.
Furthermore, RL can solve NLP problems such as dialogue systems and text generation. Computer vision, localization, motion analysis, visual control, and visual tracking can all be trained with deep RL.
Deep learning proposes to overcome the heavy task of manual feature engineering while requiring the manual design of the neural network architecture. This is tedious work involving many parts that have to be combined in the best possible way. So, why can we not automate it? Well, actually, we can.
