Deep Reinforcement Learning Hands-On - Maxim Lapan - E-Book

Deep Reinforcement Learning Hands-On E-Book

Maxim Lapan

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Start your journey into reinforcement learning (RL) and reward yourself with the third edition of Deep Reinforcement Learning Hands-On. This book takes you through the basics of RL to more advanced concepts with the help of various applications, including game playing, discrete optimization, stock trading, and web browser navigation. By walking you through landmark research papers in the fi eld, this deep RL book will equip you with practical knowledge of RL and the theoretical foundation to understand and implement most modern RL papers.
The book retains its approach of providing concise and easy-to-follow explanations from the previous editions. You'll work through practical and diverse examples, from grid environments and games to stock trading and RL agents in web environments, to give you a well-rounded understanding of RL, its capabilities, and its use cases. You'll learn about key topics, such as deep Q-networks (DQNs), policy gradient methods, continuous control problems, and highly scalable, non-gradient methods.
If you want to learn about RL through a practical approach using OpenAI Gym and PyTorch, concise explanations, and the incremental development of topics, then Deep Reinforcement Learning Hands-On, Third Edition, is your ideal companion

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 981

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Deep Reinforcement Learning Hands-On

Third Edition

A practical and easy-to-follow guide to RL from Q-learning and DQNs to PPO and RLHF

Maxim Lapan

Deep Reinforcement Learning Hands-On

Third Edition

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrievalsystem, or transmitted in any form or by any means, without the prior writtenpermission of the publisher, except in the case of brief quotations embedded incritical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracyof the information presented. However, the information contained in this book issold without warranty, either express or implied. Neither the author, nor PacktPublishing or its dealers and distributors, will be held liable for any damages causedor alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of thecompanies and products mentioned in this book by the appropriate use of capitals.However, Packt Publishing cannot guarantee the accuracy of this information.

Lead Senior Publishing Product Manager: Bhavesh Amin

Acquisition Editor – Peer Reviews: Swaroop Singh

Project Editor: Amisha Vathare

Development Editor: Shruti Menon

Copy Editor: Safis Editing

Technical Editor: Kushal Sharma

Indexer: Hemangini Bari

Proofreader: Safis Editing

Presentation Designer: Pranit Padwal

Developer Relations Marketing Executive: Anamika Singh

First published: June 2018

Second edition: January 2020

Third edition: November 2024

Production reference: 1071124

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83588-270-2

www.packt.com

To my loving friends and family

– Maxim Lapan

Contributors

About the author

Maxim Lapan is a deep learning and machine learning enthusiast. His background and 20 years of work expertise as a software developer and a systems architect covers everything from low-level Linux kernel driver development to performance optimization and the design of distributed applications working on thousands of servers. With extensive work experience in big data, machine learning, deep learning, and large parallel distributed HPC and non-HPC systems, he has the ability to explain complex concepts using simple words and clear examples. His current area of interest is the practical applications of deep learning, such as deep natural language processing, large language models, and deep reinforcement learning.

Maxim lives in Gronau (North Rhine-Westphalia), Germany, with his family.

I’d like to thank my family: my wife, Olga, and my children, Ksenia, Julia, andFedor, for their patience and support. The third edition was written during aturbulent time for our family, and without you, this would not have beendoable.I also want to thank my business partner, Arnout. Thanks for your patience andsupport!

About the reviewers

Daniel Armstrong is an applied data scientist specializing in natural language processing, conversational AI, and computer vision. With experience in the finance and consumer goods industries, he develops AI-powered solutions that enhance customer experiences and streamline operations. He is well versed in machine learning, deep learning, and knowledge management. Currently, Daniel is focusing on LLM-powered agents that integrate structured dialogue models and logical reasoning frameworks to build context-aware systems for automation and decision-making.

Michael Yurushkin holds a PhD in computer science. He is an expert in machine learning and software development, and is passionate about creating products using cutting-edge technology. Michael is an associate professor at Southern Federal University (SFedU), where he teaches deep learning with applications in computer vision and natural language processing, while also supervising student research. Michael enjoys playing chess and currently holds the title of Candidate Master.

Join our community on Discord

Read this book alongside other users, deep learning experts, and the author himself. Ask questions, provide solutions to other readers, chat with the author via Ask Me Anything sessions, and much more. Scan the QR code or visit the link to join the community: https://packt.link/rl

Contents

Preface

Why I wrote this book

The approach

Who this book is for

What this book covers

To get the most out of this book

Changes in the third edition

Part 1 Introduction to RL

What Is Reinforcement Learning?

Supervised learning

Unsupervised learning

Reinforcement learning

Complications in RL

RL formalisms

Reward

The agent

The environment

Actions

Observations

The theoretical foundations of RL

Markov decision processes

The Markov process

Markov reward processes

Adding actions to MDP

Policy

Summary

OpenAI Gym API and Gymnasium

The anatomy of the agent

Hardware and software requirements

The OpenAI Gym API and Gymnasium

The action space

The observation space

The environment

Creating an environment

The CartPole session

The random CartPole agent

Extra Gym API functionality

Wrappers

Rendering the environment

More wrappers

Summary

Deep Learning with PyTorch

Tensors

The creation of tensors

Scalar tensors

Tensor operations

GPU tensors

Gradients

Tensors and gradients

NN building blocks

Custom layers

Loss functions and optimizers

Loss functions

Optimizers

Monitoring with TensorBoard

TensorBoard 101

Plotting metrics

GAN on Atari images

PyTorch Ignite

Ignite concepts

GAN training on Atari using Ignite

Summary

The Cross-Entropy Method

The taxonomy of RL methods

The cross-entropy method in practice

The cross-entropy method on CartPole

The cross-entropy method on FrozenLake

The theoretical background of the cross-entropy method

Summary

Part 2 Value-based methods

Tabular Learning and the Bellman Equation

Value, state, and optimality

The Bellman equation of optimality

The value of the action

The value iteration method

Value iteration in practice

Q-iteration for FrozenLake

Summary

Deep Q-Networks

Real-life value iteration

Tabular Q-learning

Deep Q-learning

Interaction with the environment

SGD optimization

Correlation between steps

The Markov property

The final form of DQN training

DQN on Pong

Wrappers

The DQN model

Training

Running and performance

Your model in action

Things to try

Summary

Higher-Level RL Libraries

Why RL libraries?

The PTAN library

Action selectors

The agent

DQNAgent

PolicyAgent

Experience source

Toy environment

The ExperienceSource class

The ExperienceSourceFirstLast Class

Experience replay buffers

The TargetNet class

Ignite helpers

The PTAN CartPole solver

Other RL libraries

Summary

DQN Extensions

Basic DQN

Common library

Implementation

Hyperparameter tuning

Results with common parameters

Tuned baseline DQN

N-step DQN

Implementation

Results

Hyperparameter tuning

Double DQN

Implementation

Results

Hyperparameter tuning

Noisy networks

Implementation

Results

Hyperparameter tuning

Prioritized replay buffer

Implementation

Results

Hyperparameter tuning

Dueling DQN

Implementation

Results

Hyperparameter tuning

Categorical DQN

Implementation

Results

Hyperparameter tuning

Combining everything

Results

Hyperparameter tuning

Summary

Ways to Speed Up RL

Why speed matters

Baseline

The computation graph in PyTorch

Several environments

Playing and training in separate processes

Tweaking wrappers

Benchmark results

Summary

Stocks Trading Using RL

Why trading?

Problem statement and key decisions

Data

The trading environment

Models

Training code

Results

The feed-forward model

The convolution model

Things to try

Summary

Part 3 Policy-based methods

Policy Gradients

Values and policy

Why the policy?

Policy representation

Policy gradients

The REINFORCE method

The CartPole example

Results

Policy-based versus value-based methods

REINFORCE issues

Full episodes are required

High gradient variance

Exploration problems

High correlation of samples

Policy gradient methods on CartPole

Implementation

Results

Policy gradient methods on Pong

Implementation

Results

Summary

Actor-Critic Method: A2C and A3C

Variance reduction

CartPole variance

Advantage actor-critic (A2C)

A2C on Pong

Results

Asynchronous Advantage Actor-Critic (A3C)

Correlation and sample efficiency

Adding an extra “A” to A2C

A3C with data parallelism

Results

A3C with gradient parallelism

Implementation

Results

Summary

The TextWorld Environment

Interactive fiction

The environment

Installation

Game generation

Observation and action spaces

Extra game information

The deep NLP basics

Recurrent Neural Networks (RNNs)

Word embedding

The Encoder-Decoder architecture

Transformers

Baseline DQN

Observation preprocessing

Embeddings and encoders

The DQN model and the agent

Training code

Training results

Tweaking observations

Tracking visited rooms

Relative actions

Objective in observation

Transformers

ChatGPT

Setup

Interactive mode

ChatGPT API

Summary

Web Navigation

The evolution of web navigation

Browser automation and RL

Challenges in browser automation

The MiniWoB benchmark

MiniWoB++

Installation

Actions and observations

Simple example

The simple clicking approach

Grid actions

The RL part of our implementation

The model and training code

Training results

Simple clicking limitations

Adding text description

Implementation

Results

Human demonstrations

Recording the demonstrations

Training with demonstrations

Results

Things to try

Summary

Part 4 Advanced RL

Continous Action Space

Why a continuous space?

The action space

Environments

The A2C method

Implementation

Results

Using models and recording videos

Deep deterministic policy gradients

Exploration

Implementation

Results and video

Distributional policy gradients

Architecture

Implementation

Results

Things to try

Summary

Trust Region Methods

Environments

The A2C baseline

Implementation

Results

Video recording

PPO

Implementation

Results

TRPO

Implementation

Results

ACKTR

Implementation

Results

SAC

Implementation

Results

Overall results

Summary

Black-Box Optimizations in RL

Black-box methods

Evolution strategies

Implementing ES on CartPole

CartPole results

ES on HalfCheetah

Implementing ES on HalfCheetah

HalfCheetah results

Genetic algorithms

GA on CartPole

GA tweaks

Deep GA

Novelty search

GA on HalfCheetah

Implementation

Results

Summary

Advanced Exploration

Why exploration is important

What’s wrong with 𝜖-greedy?

Alternative ways of exploration

Noisy networks

Count-based methods

Prediction-based methods

MountainCar experiments

DQN + 𝜖-greedy

DQN + noisy networks

DQN + state counts

PPO method

PPO + Noisy Networks

PPO + state counts

PPO + network distillation

Comparison of methods

Atari experiments

DQN + 𝜖-greedy

DQN + noisy networks

PPO

Summary

Reinforcement Learning with Human Feedback

Reward functions in complex environments

Theoretical background

Method overview

RLHF and LLMs

RLHF experiments

Initial training using A2C

Labeling process

Reward model training

Combining A2C with the reward model

Fine-tuning with 100 labels

The second round of the experiment

The third round of the experiment

Overall results

Summary

AlphaGo Zero and MuZero

Comparing model-based and model-free methods

Model-based methods for board games

The AlphaGo Zero method

Overview

MCTS

Self-play

Training and evaluation

Connect 4 with AlphaGo Zero

The game model

Implementing MCTS

The model

Training

Testing and comparison

Results

MuZero

High-level model

Training process

Connect 4 with MuZero

Hyperparameters and MCTS tree nodes

Models

MCTS search

Training data and gameplay

MuZero results

MuZero and Atari

Summary

RL in Discrete Optimization

The Rubik’s cube and discrete optimization

Optimality and God’s number

Approaches to cube solving

Actions

States

The training process

The NN architecture

The training

The model application

Results

The code outline

Cube environments

Training

The search process

The experiment results

The 2 × 2 cube

The 3 × 3 cube

Further improvements and experiments

Summary

Multi-Agent RL

What is multi-agent RL?

Getting started with the environment

An overview of MAgent

Installing MAgent

Setting up a random environment

Deep Q-network for tigers

Understanding the code

Training and results

Collaboration by the tigers

Training both tigers and deer

The battle environment

Summary

Bibliography

Index

Landmarks

Cover

Table of Contents

Part 1

Introduction to RL

1

What Is Reinforcement Learning?

The automatic learning of optimal decisions over time is a general and common problem that has been studied in many scientific and engineering fields. In our changing world, even problems that look like static input-output problems can become dynamic if time is taken into account. For example, imagine that you want to solve the simple supervised learning problem of pet image classification with two target classes—dog and cat. You gather the training dataset and implement the classifier using your favorite deep learning toolkit. After the training and validation, the model demonstrates excellent performance. Great! You deploy it and leave it running for a while. However, after a vacation at some seaside resort, you return to discover that dog grooming fashions have changed and a significant portion of your queries are now misclassified, so you need to update your training images and repeat the process again. Not so great!

This example is intended to show that even simple machine learning (ML) problems often have a hidden time dimension. This is frequently overlooked and might become an issue in a production system. This can be addressed by reinforcement learning (RL), a subfield of ML, which is an approach that natively incorporates an extra dimension (which is usually time, but not necessarily) into learning equations. This places RL much closer to how people understand artificial intelligence (AI). In this chapter, we will discuss RL in more detail and you will become familiar with the following:

How RL is related to and differs from other ML disciplines: supervised and unsupervised learning

What the main RL formalisms are and how they are related to each other

Theoretical foundations of RL: Markov processes (MPs), Markovreward processes (MRPs), and Markov decision processes (MDPs)

Supervised learning

You may be familiar with the notion of supervised learning, which is the most studied and well-known ML problem. Its basic question is, how do you automatically build a function that maps some input into some output when given a set of example pairs? It sounds simple in those terms, but the problem includes many tricky questions that computers have only recently started to address with some success. There are lots of examples of supervised learning problems, including the following:

Text classification: Is this email message spam or not?

Image classification and object location: Does this image contain a picture of a cat, dog, or something else?

Regression problems: Given the information from weather sensors, what will be the weather tomorrow?

Sentiment analysis: What is the customer satisfaction level of this review?

These questions may look different, but they share the same idea — we have many examples of input and desired output, and we want to learn how to generate the output for some future, currently unseen input. The name supervised comes from the fact that we learn from known answers provided by a “ground truth” data source.

Unsupervised learning

At the other extreme, we have the so-called unsupervised learning, which assumes no supervision and has no known labels assigned to our data. The main objective is to learn some hidden structure of the dataset at hand. One common example of such an approach to learning is the clustering of data. This happens when our algorithm tries to combine data items into a set of clusters, which can reveal relationships in data. For instance, you might want to find similar images or clients with common behavior patterns.

Another unsupervised learning method that is becoming more and more popular is generative adversarial networks (GANs). When we have two competing neural networks (NNs), the first network tries to generate fake data to fool the second network, while the second network tries to discriminate artificially generated data from data sampled from our dataset. Over time, both networks become more and more skillful in their tasks by capturing subtle specific patterns in the dataset.

Reinforcement learning

RL is the third camp and lies somewhere in between full supervision and a complete lack of predefined labels. On the one hand, it uses many well-established methods of supervised learning, such as deep neural networks for function approximation, stochastic gradient descent, and backpropagation, to learn data representation. On the other hand, it usually applies them in a different way.

In the next two sections of the chapter, we will explore specific details of the RL approach, including assumptions and abstractions in its strict mathematical form. For now, to compare RL with supervised and unsupervised learning, we will take a less formal, but more easily understood, path.

Imagine that you have an agent that needs to take actions in some environment. Both “agent” and “environment” will be defined in detail later in this chapter. A robot mouse in a maze is a good example, but you can also imagine an automatic helicopter trying to perform a roll, or a chess program learning how to beat a grandmaster. Let’s go with the robot mouse for simplicity.

Figure 1.1: The robot mouse maze world

In this case, the environment is a maze with food at some points and electricity at others. The robot mouse is the agent that can take actions, such as turn left/right and move forward. At each moment, it can observe the full state of the maze to make a decision about the actions to take. The robot mouse tries to find as much food as possible while avoiding getting an electric shock whenever possible. These food and electricity signals stand as the reward that is given to the agent (robot mouse) by the environment as additional feedback about the agent’s actions. The reward is a very important concept in RL, and we will talk about it later in the chapter. For now, it is enough for you to know that the final goal of the agent is to maximize its reward as much as possible. In our particular example, the robot mouse could suffer a slight electric shock as a short-term setback to get to a place with plenty of food in the long term — this would be a better result for the robot mouse than just standing still and gaining nothing.

We don’t want to hard-code knowledge about the environment and the best actions to take in every specific situation into the robot mouse — it will take too much effort and may become useless even with a slight maze change. What we want is to have some magic set of methods that will allow our robot mouse to learn on its own how to avoid electricity and gather as much food as possible. RL is exactly this magic toolbox and it behaves differently from supervised and unsupervised learning methods; it doesn’t work with predefined labels in the way that supervised learning does. Nobody labels all the images that the robot sees as good or bad, or gives it the best direction to turn in.

However, we’re not completely blind as in an unsupervised learning setup — we have a reward system. The reward can be positive from gathering the food, negative from electric shocks, or neutral when nothing special happens. By observing the reward and relating it to the actions taken, our agent learns how to perform an action better, gather more food, and get fewer electric shocks. Of course, RL generality and flexibility comes with a price. RL is considered to be a much more challenging area than supervised or unsupervised learning. Let’s quickly discuss what makes RL tricky.

Complications in RL

The first thing to note is that observations in RL depend on an agent’s behavior and, to some extent, it is the result of this behavior. If your agent decides to do inefficient things, then the observations will tell you nothing about what it has done wrong and what should be done to improve the outcome (the agent will just get negative feedback all the time). If the agent is stubborn and keeps making mistakes, then the observations will give the false impression that there is no way to get a larger reward — life is suffering — which could be totally wrong.

In ML terms, this can be rephrased as having non-IID data. The abbreviation iid stands for independent and identically distributed, a requirement for most supervised learning methods.

The second thing that complicates our agent’s life is that it needs to not only exploit the knowledge it has learned, but actively explore the environment, because maybe doing things differently will significantly improve the outcome. The problem is that too much exploration may also seriously decrease the reward (not to mention the agent can actually forget what it has learned before), so we need to find a balance between these two activities somehow. This exploration/exploitation dilemma is one of the open fundamental questions in RL. People face this choice all the time — should I go to an already known place for dinner or try this fancy new restaurant? How frequently should I change jobs? Should I study a new field or keep working in my area? There are no universal answers to these questions.

The third complication lies in the fact that reward can be seriously delayed after actions. In chess, for example, one single strong move in the middle of the game can shift the balance. During learning, we need to discover such causalities, which can be tricky to discern during the flow of time and our actions.

However, despite all these obstacles and complications, RL has seen huge improvements in recent years and is becoming more and more active as a field of research and practical application.

Interested in learning more? Let’s dive into the details and look at RL formalisms and play rules.

RL formalisms

Every scientific and engineering field has its own assumptions and limitations. Earlier in this chapter, we discussed supervised learning, in which such assumptions are the knowledge of input-output pairs. You have no labels for your data? You need to figure out how to obtain labels or try to use some other theory. This doesn’t make supervised learning good or bad; it just makes it inapplicable to your problem.

There are many historical examples of practical and theoretical breakthroughs that have occurred when somebody tried to challenge rules in a creative way. However, we also must understand our limitations. It’s important to know and understand game rules for various methods, as it can save you tons of time in advance. Of course, such formalisms exist for RL, and we will spend the rest of this book analyzing them from various angles.

The following diagram shows two major RL entities — agent and environment — and their communication channels — actions, reward, and observations:

Figure 1.2: RL entities and their communication channels

We will discuss them in detail in the next few sections.

Reward

First, let’s return to the notion of reward. In RL, it’s just a scalar value we obtain periodically from the environment. As mentioned, reward can be positive or negative, large or small, but it’s just a number. The purpose of reward is to tell our agent how well it has behaved. We don’t define how frequently the agent receives this reward; it can be every second or once in an agent’s lifetime, although it’s common practice to receive rewards every fixed timestamp or at every environment interaction, just for convenience. In the case of once-in-a-lifetime reward systems, all rewards except the last one will be zero.

As I stated, the purpose of reward is to give an agent feedback about its success, and it’s a central thing in RL. Basically, the term reinforcement comes from the fact that reward obtained by an agent should reinforce its behavior in a positive or negative way. Reward is local, meaning that it reflects the benefits and losses achieved by the agent so far. Of course, getting a large reward for some action doesn’t mean that, a second later, you won’t face dramatic consequences as a result of your previous decisions. It’s like robbing a bank — it could look like a good idea until you think about the consequences.

What an agent is trying to achieve is the largest accumulated reward over its sequence of actions. To give you a better understanding of reward, here is a list of some concrete examples with their rewards:

Financial trading: An amount of profit is a reward for a trader buying and selling stocks.

Chess: Reward is obtained at the end of the game as a win, lose, or draw. Of course, it’s up to interpretation. For me, for example, achieving a draw in a match against a chess grandmaster would be a huge reward. In practice, we need to specify the exact reward value, but it could be a fairly complicated expression. For instance, in the case of chess, the reward could be proportional to the opponent’s strength.

Dopamine system in the brain: There is a part of the brain (limbic system) that produces dopamine every time it needs to send a positive signal to the rest of the brain. High concentrations of dopamine lead to a sense of pleasure, which reinforces activities considered by this system to be good. Unfortunately, the limbic system is ancient in terms of the things it considers good — food, reproduction, and safety — but that is a totally different story!

Computer games: They usually give obvious feedback to the player, which is either the number of enemies killed or a score gathered. Note in this example that reward is already accumulated, so the RL reward for arcade games should be the derivative of the score, that is, +1 every time a new enemy is killed, −N if the player was killed by the enemy, and 0 at all other time steps.

Web navigation: There are problems, with high practical value, that require the automated extraction of information available on the web. Search engines are trying to solve this task in general, but sometimes, to get to the data you’re looking for, you need to fill in some forms or navigate through a series of links, or complete CAPTCHAs, which can be difficult for search engines to do. There is an RL-based approach to those tasks in which the reward is the information or the outcome that you need to get.

NN architecture search: RL can be used for NN architecture optimization where the quality of models is crucial and people work hard to gain an extra 1% on target metrics. In this use case, the aim is to get the best performance metric on some dataset by tweaking the number of layers or their parameters, adding extra bypass connections, or making other changes to the NN architecture. The reward in this case is the performance (accuracy or another measure showing how accurate the NN predictions are).

Dog training: If you have ever tried to train a dog, you know that you need to give it something tasty (but not too much) every time it does the thing you’ve asked. It’s also common to reprimand your pet a bit (negative reward) when it doesn’t follow your orders, although recent studies have shown that this isn’t as effective as a positive reward.

School marks: We all have experience here! School marks are a reward system designed to give pupils feedback about their studying.

As you can see from the preceding examples, the notion of reward is a very general indication of the agent’s performance, and it can be found or artificially injected into lots of practical problems around us.

The agent

An agent is somebody or something who/that interacts with the environment by executing certain actions, making observations, and receiving eventual rewards for this. In most practical RL scenarios, the agent is our piece of software that is supposed to solve some problem in a more-or-less efficient way. For our initial set of six examples, the agents will be as follows:

Financial trading: A trading system or a trader making decisions about order execution (buying, selling, or doing nothing).

Chess: A player or a computer program.

Dopamine system: The brain itself, which, according to sensory data, decides whether it was a good experience.

Computer games: The player who enjoys the game or the computer program. (Andrej Karpathy once tweeted that “we were supposed to make AI do all the work and we play games but we do all the work and the AI is playing games!”).

Web navigation: The software that tells the browser which links to click on, where to move the mouse, or which text to enter.

NN architecture search: The software that controls the concrete architecture of the NN being evaluated.

Dog training: You make decisions about the actions (feeding/reprimand), so, the agent is you. But in principle, your dog also could be seen as the agent — the dog is trying to maximize the reward (food and/or attention) by behaving properly. Strictly speaking, here we have a “multi-agent RL” setup, which is briefly discussed in Chapter22.

School: Student/pupil.

The environment

The environment is everything outside of an agent. In the most general sense, it’s the rest of the universe, but this goes slightly overboard and exceeds the capacity of even tomorrow’s computers, so we usually follow the general sense here.

The agent’s communication with the environment is limited to reward (obtained from the environment), actions (executed by the agent and sent to the environment), and observations (some information besides the reward that the agent receives from the environment). We have discussed rewards already, so let’s talk about actions and observations next. We will identify the environment for each of our examples when we discuss the observations.

Actions

Actions are things that an agent can do in the environment. Actions can, for example, be piece moves on the board (if it’s a board game), or doing homework (in the case of school). They can be as simple as movepawn one space forward or as complicated as build a profitable startupcompany.

In RL, we distinguish between two types of actions — discrete or continuous. Discrete actions form the finite set of mutually exclusive things an agent can do, such as move left or right. Continuous actions have some value attached to them, such as a car’s turn the wheel action having an angle and direction of steering. Different angles could lead to a different scenario a second later, so just turn thewheel is definitely not enough.

Giving concrete examples, let’s look at the actions in our six scenarios:

Financial trading: Actions are decisions to buy or sell stock. “Do nothing and wait” also is an action.

Chess: Actions are valid piece moves according to the current board’s position.

Dopamine system: Actions are the things that you are doing.

Computer games: Actions are pushing buttons. They could be also continuous, such as turning the steering wheel in an auto simulator.

Web navigation: Actions could be mouse clicks, scrolling, and text typing.

NN architecture search: Actions are changes in NN architecture, which could be discrete (count of layers in the network) or continuous (probability in the dropout layer).

Dog training: Actions are everything you can do with your dog — giving a piece of tasty food, petting, even saying “good dog!” in a kind voice.

School: Actions are marks and lots of more informal signals, like praising the successes or giving extra homework.

Observations

Observations of the environment form the second information channel for an agent, with the first being the reward. You may be wondering why we need a separate data source. The answer is convenience. Observations are pieces of information that the environment provides the agent with that indicate what’s going on around the agent.

Observations may be relevant to the upcoming reward (such as seeing a bank notification about being paid) or may not be. Observations can even include reward information in some vague or obfuscated form, such as score numbers on a computer game’s screen. Score numbers are just pixels, but potentially, we could convert them into reward values; it’s not a very complex task for a modern computer vision techniques.

On the other hand, reward shouldn’t be seen as a secondary or unimportant thing — reward is the main force that drives the agent’s learning process. If a reward is wrong, noisy, or just slightly off course from the primary objective, then there is a chance that training will go in the wrong direction.

It’s also important to distinguish between an environment’s state and observations. The state of an environment most of the time is internal to the environment and potentially includes every atom in the universe, which makes it impossible to measure everything about the environment. Even if we limit the environment’s state to be small enough, most of the time, it will be either not possible to get full information about it or our measurements will contain noise. This is completely fine, though, and RL was created to support such cases natively. To illustrate the difference, let’s return to our set of examples:

Financial trading: Here, the environment is the whole financial market and everything that influences it. This is a huge list of things, such as the latest news, economic and political conditions, weather, food supplies, and Twitter/X trends. Even your decision to stay home today can potentially indirectly influence the world’s financial system (if you believe in the “butterfly effect”). However, our observations are limited to stock prices, news, and so on. We don’t have access to most of the environment’s state, which makes financial forecasting such a nontrivial thing.

Chess: The environment here is your board plus your opponent, which includes their chess skills, mood, brain state, chosen tactics, and so on. Observations are what you see (your current chess position), but, at some levels of play, knowledge of psychology and the ability to read an opponent’s mood could increase your chances.

Dopamine system: The environment here is your brain plus your nervous system and your organs’ states plus the whole world you can perceive. Observations are the inner brain state and signals coming from your senses.

Computer game: Here, the environment is your computer’s state, including all memory and disk data. For networked games, you need to include other computers plus all Internet infrastructure between them and your machine. Observations are a screen’s pixels and sound only. These pixels are not a tiny amount of information (it has been estimated that the total number of possible moderate-size images (1024×768) is significantly larger than the number of atoms in our galaxy), but the whole environment state is definitely larger.

Web navigation: The environment here is the Internet, including all the network infrastructure between the computer on which our agent works and the web server, which is a really huge system that includes millions and millions of different components. The observation is normally the web page that is loaded in the browser.

NN architecture search: In this example, the environment is fairly simple and includes the NN toolkit that performs the particular NN evaluation and the dataset that is used to obtain the performance metric. In comparison to the Internet, this looks like a tiny toy environment. Observations might be different and include some information about testing, such as loss convergence dynamics or other metrics obtained from the evaluation step.

Dog training: Here, the environment is your dog (including its hardly observable inner reactions, mood, and life experiences) and everything around it, including other dogs and even a cat hiding in a bush. Observations are signals from your senses and memory.

School: The environment here is the school itself, the education system of the country, society, and the cultural legacy. Observations are the same as for the dog training example — the student’s senses and memory.

This is our “mise en scène” and we will play around with it in the rest of this book. You will have already noticed that the RL model is extremely flexible and general, and it can be applied to a variety of scenarios. Let’s now look at how RL is related to other disciplines, before diving into the details of the RL model.

There are many other areas that contribute or relate to RL. The most significant are shown in the following diagram, which includes six large domains heavily overlapping each other on the methods and specific topics related to decision-making (shown inside the inner circle).

Figure 1.3: Various domains in RL

At the intersection of all those related, but still different, scientific areas sits RL, which is so general and flexible that it can take the best available information from these varying domains:

ML: RL, being a subfield of ML, borrows lots of its machinery, tricks, and techniques from ML. Basically, the goal of RL is to learn how an agent should behave when it is given imperfect observational data.

Engineering (especially optimal control): This helps with taking a sequence of optimal actions to get the best result.

Neuroscience: We used the dopamine system as our example, and it has been shown that the human brain acts similarly to the RL model.

Psychology: This studies behavior in various conditions, such as how people react and adapt, which is close to the RL topic.

Economics: One of the important topics in economics is how to maximize reward in terms of imperfect knowledge and the changing conditions of the real world.

Mathematics: This works with idealized systems and also devotes significant attention to finding and reaching the optimal conditions in the field of operations research.

In the next part of the chapter, you will become familiar with the theoretical foundations of RL, which will make it possible to start moving toward the methods used to solve the RL problem. The upcoming section is important for understanding the rest of the book.

The theoretical foundations of RL

In this section, I will introduce you to the mathematical representation and notation of the formalisms (reward, agent, actions, observations, and environment) that we just discussed. Then, using this as a knowledge base, we will explore the second-order notions of the RL language, including state, episode, history, value, and gain, which will be used repeatedly to describe different methods later in the book.

Markov decision processes

Before that, we will cover Markov decision processes (MDPs), which will bedescribed like a Russian matryoshka doll: we will start from the simplest case of a Markov process (MP), then extend that with rewards, which will turn it into a Markov reward process (MRP). Then, we will put this idea into an extra envelope by adding actions, which will lead us to an MDP.

MPs and MDPs are widely used in computer science and other engineering fields. So, reading this chapter will be useful for you not only for RL contexts but also for a much wider range of topics. If you’re already familiar with MDPs, then you can quickly skim this chapter, paying attention only to the terminology definitions, as we will use them later on.

The Markov process

Let’s start with the simplest concept in the Markov family: the MP, which is also known as the Markov chain. Imagine that you have some system in front of you that you can only observe. What you observe is called states, and the system can switch between states according to some laws of dynamics (most of the time unknown to you). Again, you cannot influence the system, but can only watch the states changing. All possible states for a system form a set called the statespace. For MPs, we require this set of states to be finite (but it can be extremely large to compensate for this limitation). Your observations form a sequence of states or a chain (that’s why MPs are also called Markov chains).

For example, looking at the simplest model of the weather in some city, we can observe the current day as sunny or rainy, which is our state space. A sequence of observations over time forms a chain of states, such as [sunny, sunny, rainy, sunny, ...], and this is called history. To call such a system an MP, it needs to fulfill the Markov property, which means that the future system dynamics from any state have to depend on this state only. The main point of the Markov property is to make every observable state self-contained to describe the future of the system. In other words, the Markov property requires the states of the system to be distinguishable from each other and unique. In this case, only one state is required to model the future dynamics of the system and not the whole history or, say, the last N states.

In the case of our toy weather example, the Markov property limits our model to represent only the cases when a sunny day can be followed by a rainy one with the same probability, regardless of the number of sunny days we’ve seen in the past. It’s not a very realistic model as, from common sense, we know that the chance of rain tomorrow depends not only on the current conditions but on a large number of other factors, such as the season, our latitude, and the presence of mountains and sea nearby. It was recently proven that even solar activity has a major influence on the weather. So, our example is really naïve, but it’s important to understand the limitations and make conscious decisions about them.

Of course, if we want to make our model more complex, we can always do this by extending our state space, which will allow us to capture more dependencies in the model at the cost of a larger state space. For example, if you want to capture separately the probability of rainy days during summer and winter, then you can include the season in your state.

In this case, your state space will be [sunny+summer, sunny+winter, rainy+summer, rainy+winter] and so on.

As your system model complies with the Markov property, you can capture transition probabilities with a transition matrix, which is a square matrix of the size N ×N, where N is the number of states in our model. Every cell in a row, i, and a column, j, in the matrix contains the probability of the system to transition from state i to state j.

For example, in our sunny/rainy example, the transition matrix could be as follows:

Sunny Rainy Sunny 0.80.2Rainy 0.10.9

In this case, if we have a sunny day, then there is an 80% chance that the next day will be sunny and a 20% chance that the next day will be rainy. If we observe a rainy day, then there is a 10% probability that the weather will become better and a 90% probability of the next day being rainy.

So, that’s it. The formal definition of an MP is as follows:

A set of states (S) that a system can be in

A transition matrix (T), with transition probabilities, which defines the system dynamics

A useful visual representation of an MP is a graph with nodes corresponding to system states and edges, labeled with probabilities representing a possible transition from state to state. If the probability of a transition is 0, we don’t draw an edge (there is no way to go from one state to another). This kind of representation is also widely used in finite state machine representation, which is studied in automata theory. For our sunny/rainy weather model, the graph is as shown here:

Figure 1.4: The sunny/rainy weather model

Again, we’re talking about observation only. There is no way for us to influence the weather, so we just observe it and record our observations.

To give you a more complicated example, let’s consider another model called Office Worker (Dilbert, the main character in Scott Adams’ famous cartoons, is a good example). His state space in our example has the following states:

Home: He’s not at the office

Computer: He’s working on his computer at the office

Coffee: He’s drinking coffee at the office

Chat: He’s discussing something with colleagues at the office

The state transition graph is shown in the following figure:

Figure 1.5: The state transition graph for our office worker

We assume that our office worker’s weekday usually starts from the Home state and that he starts his day with Coffee without exception (no Home →Computer edge and no Home →Chat edge). The preceding diagram also shows that workdays always end (that is, going to the Home state) from the Computer state.

The transition matrix for the diagram above is as follows:

Home Coffee Chat Computer Home 60%40%0%0%Coffee 0%10%70%20%Chat 0%20%50%30%Computer 20%20%10%50%

The transition probabilities could be placed directly on the state transition graph, as shown in Figure1.6.

Figure 1.6: The state transition graph with transition probabilities

In practice, we rarely have the luxury of knowing the exact transition matrix. A much more real-world situation is when we only have observations of our system’s states, which are also called episodes:

Home → Coffee → Coffee → Chat → Chat → Coffee →Computer →Computer →Home

Computer → Computer → Chat → Chat → Coffee →Computer →Computer →Computer

Home → Home → Coffee → Chat → Computer → Coffee →Coffee

It’s not complicated to estimate the transition matrix from our observations — we just count all the transitions from every state and normalize them to a sum of 1. The more observation data we have, the closer our estimation will be to the true underlying model.

It’s also worth noting that the Markov property implies stationarity (which means, the underlying transition distribution for any state does not change over time). Non-stationarity