Reinforcement Learning and Stochastic Optimization - Warren B. Powell - E-Book

Reinforcement Learning and Stochastic Optimization E-Book

Warren B. Powell

0,0
122,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

REINFORCEMENT LEARNING AND STOCHASTIC OPTIMIZATION Clearing the jungle of stochastic optimization Sequential decision problems, which consist of "decision, information, decision, information," are ubiquitous, spanning virtually every human activity ranging from business applications, health (personal and public health, and medical decision making), energy, the sciences, all fields of engineering, finance, and e-commerce. The diversity of applications attracted the attention of at least 15 distinct fields of research, using eight distinct notational systems which produced a vast array of analytical tools. A byproduct is that powerful tools developed in one community may be unknown to other communities. Reinforcement Learning and Stochastic Optimization offers a single canonical framework that can model any sequential decision problem using five core components: state variables, decision variables, exogenous information variables, transition function, and objective function. This book highlights twelve types of uncertainty that might enter any model and pulls together the diverse set of methods for making decisions, known as policies, into four fundamental classes that span every method suggested in the academic literature or used in practice. Reinforcement Learning and Stochastic Optimization is the first book to provide a balanced treatment of the different methods for modeling and solving sequential decision problems, following the style used by most books on machine learning, optimization, and simulation. The presentation is designed for readers with a course in probability and statistics, and an interest in modeling and applications. Linear programming is occasionally used for specific problem classes. The book is designed for readers who are new to the field, as well as those with some background in optimization under uncertainty. Throughout this book, readers will find references to over 100 different applications, spanning pure learning problems, dynamic resource allocation problems, general state-dependent problems, and hybrid learning/resource allocation problems such as those that arose in the COVID pandemic. There are 370 exercises, organized into seven groups, ranging from review questions, modeling, computation, problem solving, theory, programming exercises and a "diary problem" that a reader chooses at the beginning of the book, and which is used as a basis for questions throughout the rest of the book.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1914

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Reinforcement Learning and Stochastic Optimization

A Unified Framework for Sequential Decisions

Warren B. Powell Princeton University Princeton, NJ

 

 

This edition first published 2022

© 2022 John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right ofWarren B. Powell to be identified as the author of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer ofWarranty

The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

A catalogue record for this book is available from the Library of Congress

Hardback ISBN: 9781119815037; ePub ISBN: 9781119815051; ePDF ISBN: 9781119815044; Obook ISBN: 9781119815068.

Cover image: © DeniseSmit/Getty Images

Cover design by Wiley

Set in 9.5/12.5pt STIXTwoText by Integra Software Services Pvt. Ltd, Pondicherry, India

Contents

Cover

Title page

Copyright

Preface

Acknowledgments

Part I – Introduction

1 Sequential Decision Problems

1.1 The Audience

1.2 The Communities of Sequential Decision Problems

1.3 Our Universal Modeling Framework

1.4 DesigningPolicies for Sequential Decision Problems

1.4.1 Policy Search

1.4.2 Policies Based on Lookahead Approximations

1.4.3 Mixing and Matching

1.4.4 Optimality of the Four Classes

1.4.5 Pulling it All Together

1.5 Learning

1.6 Themes

1.6.1 Blending Learning and Optimization

1.6.2 Bridging Machine Learning to Sequential Decisions

1.6.3 From Deterministic to Stochastic Optimization

1.6.4 From Single to Multiple Agents

1.7 Our Modeling Approach

1.8 How to Read this Book

1.8.1 Organization of Topics

1.8.2 How to Read Each Chapter

1.8.3 Organization of Exercises

1.9 Bibliographic Notes

Exercises

Bibliography

2 Canonical Problems and Applications

2.1 Canonical Problems

2.1.1 Stochastic Search – Derivative-based and Derivative-free

2.1.1.1 Derivative-based Stochastic Search

2.1.1.2 Derivative-free Stochastic Search

2.1.2 Decision Trees

2.1.3 Markov Decision Processes

2.1.4 Optimal Control

2.1.5 Approximate Dynamic Programming

2.1.6 Reinforcement Learning

2.1.7 Optimal Stopping

2.1.8 Stochastic Programming

2.1.9 The Multiarmed Bandit Problem

2.1.10 Simulation Optimization

2.1.11 Active Learning

2.1.12 Chance-constrained Programming

2.1.13 Model Predictive Control

2.1.14 Robust Optimization

2.2 A Universal Modeling Framework for Sequential Decision Problems

2.2.1 Our Universal Model for Sequential Decision Problems

2.2.2 A Compact Modeling Presentation

2.2.3 MDP/RL vs. Optimal Control Modeling Frameworks

2.3 Applications

2.3.1 The Newsvendor Problems

2.3.1.1 Basic Newsvendor – Final Reward

2.3.1.2 Basic Newsvendor – Cumulative Reward

2.3.1.3 Contextual Newsvendor

2.3.1.4 Multidimensional Newsvendor Problems

2.3.2 Inventory/Storage Problems

2.3.2.1 Inventory Without Lags

2.3.2.2 Inventory Planning with Forecasts

2.3.2.3 Lagged Decisions

2.3.3 Shortest Path Problems

2.3.3.1 A Deterministic Shortest Path Problem

2.3.3.2 A Stochastic Shortest Path Problem

2.3.3.3 A Dynamic Shortest Path Problem

2.3.3.4 A Robust Shortest Path Problem

2.3.4 Some Fleet Management Problems

2.3.4.1 The Nomadic Trucker

2.3.4.2 From One Driver to a Fleet

2.3.5 Pricing

2.3.6 Medical Decision Making

2.3.7 Scientific Exploration

2.3.8 Machine Learning vs. Sequential Decision Problems

2.4 Bibliographic Notes

Exercises

Bibliography

3 Online Learning

3.1 Machine Learning for Sequential Decisions

3.1.1 Observations and Data in Stochastic Optimization

3.1.2 Indexing Input

x

n

and Response

y

n

+1

3.1.3 FunctionsWe are Learning

3.1.4 Sequential Learning: From Very Little Data to … More Data

3.1.5 Approximation Strategies

3.1.6 From Data Analytics to Decision Analytics

3.1.7 Batch vs. Online Learning

3.2 Adaptive Learning Using Exponential Smoothing

3.3 Lookup Tables with Frequentist Updating

3.4 Lookup Tables with Bayesian Updating

3.4.1 The Updating Equations for Independent Beliefs

3.4.2 Updating for Correlated Beliefs

3.4.3 Gaussian Process Regression

3.5 Computing Bias and Variance*

3.6 Lookup Tables and Aggregation*

3.6.1 Hierarchical Aggregation

3.6.2 Estimates of Different Levels of Aggregation

3.6.3 Combining Multiple Levels of Aggregation

3.7 Linear Parametric Models

3.7.1 Linear Regression Review

3.7.2 Sparse Additive Models and Lasso

3.8 Recursive Least Squares for Linear Models

3.8.1 Recursive Least Squares for Stationary Data

3.8.2 Recursive Least Squares for Nonstationary Data*

3.8.3 Recursive Estimation Using Multiple Observations*

3.9 Nonlinear Parametric Models

3.9.1 Maximum Likelihood Estimation

3.9.2 Sampled Belief Models

3.9.3 Neural Networks – Parametric*

3.9.4 Limitations of Neural Networks

3.10 Nonparametric Models*

3.10.1 K-Nearest Neighbor

3.10.2 Kernel Regression

3.10.3 Local Polynomial Regression

3.10.4 Deep Neural Networks

3.10.5 Support Vector Machines

3.10.6 Indexed Functions, Tree Structures, and Clustering

3.10.7 Comments on Nonparametric Models

3.11 Nonstationary Learning*

3.11.1 Nonstationary Learning I – Martingale Truth

3.11.2 Nonstationary Learning II – Transient Truth

3.11.3 Learning Processes

3.12 The Curse of Dimensionality

3.13 Designing Approximation Architectures in Adaptive Learning

3.14 Why Does ItWork?

**

3.14.1 Derivation of the Recursive Estimation Equations

3.14.2 The Sherman-Morrison Updating Formula

3.14.3 Correlations in Hierarchical Estimation

3.14.4 Proof of Proposition 3.14.1

3.15 Bibliographic Notes

Exercises

Bibliography

4 Introduction to Stochastic Search

4.1 Illustrations of the Basic Stochastic Optimization Problem

4.2 Deterministic Methods

4.2.1 A “Stochastic” Shortest Path Problem

4.2.2 A Newsvendor Problem with Known Distribution

4.2.3 Chance-Constrained Optimization

4.2.4 Optimal Control

4.2.5 Discrete Markov Decision Processes

4.2.6 Remarks

4.3 Sampled Models

4.3.1 Formulating a Sampled Model

4.3.1.1 A Sampled Stochastic Linear Program

4.3.1.2 Sampled Chance-Constrained Models

4.3.1.3 Sampled Parametric Models

4.3.2 Convergence

4.3.3 Creating a Sampled Model

4.3.4 Decomposition Strategies*

4.4 Adaptive Learning Algorithms

4.4.1 Modeling Adaptive Learning Problems

4.4.2 Online vs. Offline Applications

4.4.2.1 Machine Learning

4.4.2.2 Optimization

4.4.3 Objective Functions for Learning

4.4.4 Designing Policies

4.5 Closing Remarks

4.6 Bibliographic Notes

Exercises

Bibliography

Part II – Stochastic Search

5 Derivative-Based Stochastic Search

5.1 Some Sample Applications

5.2 Modeling Uncertainty

5.2.1 Training Uncertainty

W

1

, …

W

N

5.2.2 Model Uncertainty

S

0

5.2.3 Testing Uncertainty

5.2.4 Policy Evaluation

5.2.5 Closing Notes

5.3 Stochastic Gradient Methods

5.3.1 A Stochastic Gradient Algorithm

5.3.2 Introduction to Stepsizes

5.3.3 Evaluating a Stochastic Gradient Algorithm

5.3.4 A Note on Notation

5.4 Styles of Gradients

5.4.1 Gradient Smoothing

5.4.2 Second-Order Methods

5.4.3 Finite Differences

5.4.4 SPSA

5.4.5 Constrained Problems

5.5 Parameter Optimization for Neural Networks*

5.5.1 Computing the Gradient

5.5.2 The Stochastic Gradient Algorithm

5.6 Stochastic Gradient Algorithm as a Sequential Decision Problem

5.7 Empirical Issues

5.8 Transient Problems

*

5.9 Theoretical Performance

*

5.10 Why Does itWork?

5.10.1 Some Probabilistic Preliminaries

5.10.2 An Older Proof

*

5.10.3 A More Modern Proof

**

5.11 Bibliographic Notes

Exercises

Bibliography

6 Stepsize Policies

6.1 Deterministic Stepsize Policies

6.1.1 Properties for Convergence

6.1.2 A Collection of Deterministic Policies

6.1.2.1 Constant Stepsizes

6.1.2.2 Generalized Harmonic Stepsizes

6.1.2.3 Polynomial Learning Rates

6.1.2.4 McClain’s Formula

6.1.2.5 Search-then-Converge Learning Policy

6.2 Adaptive Stepsize Policies

6.2.1 The Case for Adaptive Stepsizes

6.2.2 Convergence Conditions

6.2.3 A Collection of Stochastic Policies

6.2.3.1 Kesten’s Rule

6.2.3.2 Trigg’s Formula

6.2.3.3 Stochastic Gradient Adaptive Stepsize Rule

6.2.3.4 ADAM

6.2.3.5 AdaGrad

6.2.3.6 RMSProp

6.2.4 Experimental Notes

6.3 Optimal Stepsize Policies

*

6.3.1 Optimal Stepsizes for Stationary Data

6.3.2 Optimal Stepsizes for Nonstationary Data – I

6.3.3 Optimal Stepsizes for Nonstationary Data – II

6.4 OptimalStepsizesforApproximateValueIteration

*

6.5 Convergence

6.6 Guidelines for Choosing Stepsize Policies

6.7 Why Does itWork

*

6.7.1 Proof of BAKF Stepsize

6.8 Bibliographic Notes

Exercises

Bibliography

7 Derivative-Free Stochastic Search

7.1 Overview of Derivative-free Stochastic Search

7.1.1 Applications and Time Scales

7.1.2 The Communities of Derivative-free Stochastic Search

7.1.3 The Multiarmed Bandit Story

7.1.4 From Passive Learning to Active Learning to Bandit Problems

7.2 Modeling Derivative-free Stochastic Search

7.2.1 The Universal Model

7.2.2 Illustration: Optimizing a Manufacturing Process

7.2.3 Major Problem Classes

7.3 Designing Policies

7.4 Policy Function Approximations

7.5 Cost Function Approximations

7.6 VFA-based Policies

7.6.1 An Optimal Policy

7.6.2 Beta-Bernoulli Belief Model

7.6.3 Backward Approximate Dynamic Programming

7.6.4 Gittins Indices for Learning in Steady State

7.7 Direct Lookahead Policies

7.7.1 When do we Need Lookahead Policies?

7.7.2 Single Period Lookahead Policies

7.7.3 Restricted Multiperiod Lookahead

7.7.4 Multiperiod Deterministic Lookahead

7.7.5 Multiperiod Stochastic Lookahead Policies

7.7.6 Hybrid Direct Lookahead

7.8 The Knowledge Gradient (Continued)

*

7.8.1 The Belief Model

7.8.2 The Knowledge Gradient for Maximizing Final Reward

7.8.3 The Knowledge Gradient for Maximizing Cumulative Reward

7.8.4 The Knowledge Gradient for Sampled Belief Model

*

7.8.5 Knowledge Gradient for Correlated Beliefs

7.9 Learning in Batches

7.10 Simulation Optimization

*

7.10.1 An Indifference Zone Algorithm

7.10.2 Optimal Computing Budget Allocation

7.11 Evaluating Policies

7.11.1 Alternative Performance Metrics

*

7.11.2 Perspectives of Optimality

*

7.12 Designing Policies

7.12.1 Characteristics of a Policy

7.12.2 The Effect of Scaling

7.12.3 Tuning

7.13 Extensions

*

7.13.1 Learning in Nonstationary Settings

7.13.2 Strategies for Designing Time-dependent Policies

7.13.3 A Transient Learning Model

7.13.4 The Knowledge Gradient for Transient Problems

7.13.5 Learning with Large or Continuous Choice Sets

7.13.6 Learning with Exogenous State Information – the Contextual Bandit Problem

7.13.7 State-dependent vs. State-independent Problems

7.14 Bibliographic Notes

Exercises

Bibliography

Part III – State-dependent Problems

8 State-dependent Problems

8.1 Graph Problems

8.1.1 A Stochastic Shortest Path Problem

8.1.2 The Nomadic Trucker

8.1.3 The Transformer Replacement Problem

8.1.4 Asset Valuation

8.2 Inventory Problems

8.2.1 A Basic Inventory Problem

8.2.2 The Inventory Problem – II

8.2.3 The Lagged Asset Acquisition Problem

8.2.4 The Batch Replenishment Problem

8.3 Complex Resource Allocation Problems

8.3.1 The Dynamic Assignment Problem

8.3.2 The Blood Management Problem

8.4 State-dependent Learning Problems

8.4.1 Medical Decision Making

8.4.2 Laboratory Experimentation

8.4.3 Bidding for Ad-clicks

8.4.4 An Information-collecting Shortest Path Problem

8.5 A Sequence of Problem Classes

8.6 Bibliographic Notes

Exercises

Bibliography

9 Modeling Sequential Decision Problems

9.1 A Simple Modeling Illustration

9.2 Notational Style

9.3 Modeling Time

9.4 The States of Our System

9.4.1 Defining the State Variable

9.4.2 The Three States of Our System

9.4.3 Initial State

S

0

vs. Subsequent States

S

t

t

> 0

9.4.4 Lagged State Variables

*

9.4.5 The Post-decision State Variable

*

9.4.6 A Shortest Path Illustration

9.4.7 Belief States

*

9.4.8 Latent Variables

*

9.4.9 Rolling Forecasts

*

9.4.10 Flat vs. Factored State Representations

*

9.4.11 A Programmer’s Perspective of State Variables

9.5 Modeling Decisions

9.5.1 Types of Decisions

9.5.2 Initial Decision x0 vs. Subsequent Decisions xt, t > 0

9.5.3 Strategic, Tactical, and Execution Decisions

9.5.4 Constraints

9.5.5 Introducing Policies

9.6 The Exogenous Information Process

9.6.1 Basic Notation for Information Processes

9.6.2 Outcomes and Scenarios

9.6.3 Lagged Information Processes

*

9.6.4 Models of Information Processes

*

9.6.5 Supervisory Processes

*

9.7 The Transition Function

9.7.1 A General Model

9.7.2 Model-free Dynamic Programming

9.7.3 Exogenous Transitions

9.8 The Objective Function

9.8.1 The Performance Metric

9.8.2 Optimizing the Policy

9.8.3 Dependence of Optimal Policy on S0

9.8.4 State-dependent Variations

9.8.5 Uncertainty Operators

9.9 Illustration: An Energy Storage Model

9.9.1 With a Time-series Price Model

9.9.2 With Passive Learning

9.9.3 With Active Learning

9.9.4 With Rolling Forecasts

9.10 Base Models and Lookahead Models

9.11 A Classification of Problems

*

9.12 Policy Evaluation

*

9.13 Advanced Probabilistic Modeling Concepts

**

9.13.1 A Measure-theoretic View of Information

**

9.13.2 Policies and Measurability

9.14 Looking Forward

9.15 Bibliographic Notes

Exercises

Bibliography

10 Uncertainty Modeling

10.1 Sources of Uncertainty

10.1.1 Observational Errors

10.1.2 Exogenous Uncertainty

10.1.3 Prognostic Uncertainty

10.1.4 Inferential (or Diagnostic) Uncertainty

10.1.5 Experimental Variability

10.1.6 Model Uncertainty

10.1.7 Transitional Uncertainty

10.1.8 Control/implementation Uncertainty

10.1.9 Communication Errors and Biases

10.1.10 Algorithmic Instability

10.1.11 Goal Uncertainty

10.1.12 Political/regulatory Uncertainty

10.1.13 Discussion

10.2 A Modeling Case Study: The COVID Pandemic

10.3 Stochastic Modeling

10.3.1 Sampling Exogenous Information

10.3.2 Types of Distributions

10.3.3 Modeling Sample Paths

10.3.4 State-action-dependent Processes

10.3.5 Modeling Correlations

10.4 Monte Carlo Simulation

10.4.1 Generating Uniform [0, 1] Random Variables

10.4.2 Uniform and Normal Random Variable

10.4.3 Generating Random Variables from Inverse Cumulative Distributions

10.4.4 Inverse Cumulative From Quantile Distributions

10.4.5 Distributions with Uncertain Parameters

10.5 Case Study: Modeling Electricity Prices

10.5.1 Mean Reversion

10.5.2 Jump-diffusion Models

10.5.3 Quantile Distributions

10.5.4 Regime Shifting

10.5.5 Crossing Times

10.6 Sampling vs. Sampled Models

10.6.1 Iterative Sampling: A Stochastic Gradient Algorithm

10.6.2 Static Sampling: Solving a Sampled Model

10.6.3 Sampled Representation with Bayesian Updating

10.7 Closing Notes

10.8 Bibliographic Notes

Exercises

Bibliography

11 Designing Policies

11.1 From Optimization to Machine Learning to Sequential Decision Problems

11.2 The Classes of Policies

11.3 Policy Function Approximations

11.4 Cost Function Approximations

11.5 Value Function Approximations

11.6 Direct Lookahead Approximations

11.6.1 The Basic Idea

11.6.2 Modeling the Lookahead Problem

11.6.3 The Policy-Within-a-Policy

11.7 Hybrid Strategies

11.7.1 Cost Function Approximation with Policy Function Approximations

11.7.2 Lookahead Policies with Value Function Approximations

11.7.3 Lookahead Policies with Cost Function Approximations

11.7.4 Tree Search with Rollout Heuristic and a Lookup Table Policy

11.7.5 Value Function Approximation with Policy Function Approximation

11.7.6 Fitting Value Functions Using ADP and Policy Search

11.8 Randomized Policies

11.9 Illustration: An Energy Storage Model Revisited

11.9.1 Policy Function Approximation

11.9.2 Cost Function Approximation

11.9.3 Value Function Approximation

11.9.4 Deterministic Lookahead

11.9.5 Hybrid Lookahead-Cost Function Approximation

11.9.6 Experimental Testing

11.10 Choosing the Policy Class

11.10.1 The Policy Classes

11.10.2 Policy Complexity-Computational Tradeoffs

11.10.3 Screening Questions

11.11 Policy Evaluation

11.12 Parameter Tuning

11.12.1 The Soft Issues

11.12.2 Searching Across Policy Classes

11.13 Bibliographic Notes

Exercises

Bibliography

Part IV – Policy Search

12 Policy Function Approximations and Policy Search

12.1 Policy Search as a Sequential Decision Problem

12.2 Classes of Policy Function Approximations

12.2.1 Lookup Table Policies

12.2.2 Boltzmann Policies for Discrete Actions

12.2.3 Linear Decision Rules

12.2.4 Monotone Policies

12.2.5 Nonlinear Policies

12.2.6 Nonparametric/Locally Linear Policies

12.2.7 Contextual Policies

12.3 Problem Characteristics

12.4 Flavors of Policy Search

12.5 Policy Search with Numerical Derivatives

12.6 Derivative-Free Methods for Policy Search

12.6.1 Belief Models

12.6.2 Learning Through Perturbed PFAs

12.6.3 Learning CFAs

12.6.4 DLA Using the Knowledge Gradient

12.6.5 Comments

12.7 Exact Derivatives for Continuous Sequential Problems

*

12.8 ExactDerivativesforDiscreteDynamicPrograms

**

12.8.1 A Stochastic Policy

12.8.2 The Objective Function

12.8.3 The Policy Gradient Theorem

12.8.4 Computing the Policy Gradient

12.9 Supervised Learning

12.10 Why Does itWork?

12.10.1 Derivation of the Policy Gradient Theorem

12.11 Bibliographic Notes

Exercises

Bibliography

13 Cost Function Approximations

13.1 General Formulation for Parametric CFA

13.2 Objective-Modified CFAs

13.2.1 Linear Cost Function Correction

13.2.2 CFAs for Dynamic Assignment Problems

13.2.3 Dynamic Shortest Paths

13.2.4 Dynamic Trading Policy

13.2.5 Discussion

13.3 Constraint-Modified CFAs

13.3.1 General Formulation of Constraint-Modified CFAs

13.3.2 A Blood Management Problem

13.3.3 An Energy Storage Example with Rolling Forecasts

13.4 Bibliographic Notes

Exercises

Bibliography

Part V – Lookahead Policies

14 Exact Dynamic Programming

14.1 Discrete Dynamic Programming

14.2 The Optimality Equations

14.2.1 Bellman’s Equations

14.2.2 Computing the Transition Matrix

14.2.3 Random Contributions

14.2.4 Bellman’s Equation Using Operator Notation

*

14.3 Finite Horizon Problems

14.4 Continuous Problems with Exact Solutions

14.4.1 The Gambling Problem

14.4.2 The Continuous Budgeting Problem

14.5 Infinite Horizon Problems

*

14.6 Value Iteration for Infinite Horizon Problems

*

14.6.1 A Gauss-Seidel Variation

14.6.2 Relative Value Iteration

14.6.3 Bounds and Rates of Convergence

14.7 Policy Iteration for Infinite Horizon Problems

*

14.8 Hybrid Value-Policy Iteration

*

14.9 Average Reward Dynamic Programming

*

14.10 The Linear Programming Method for Dynamic Programs

**

14.11 Linear Quadratic Regulation

14.12 Why Does itWork?

**

14.12.1 The Optimality Equations

14.12.2 Convergence of Value Iteration

14.12.3 Monotonicity of Value Iteration

14.12.4 Bounding the Error from Value Iteration

14.12.5 Randomized Policies

14.13 Bibliographic Notes

Exercises

Bibliography

15 Backward Approximate Dynamic Programming

15.1 Backward Approximate Dynamic Programming for Finite Horizon Problems

15.1.1 Some Preliminaries

15.1.2 Backward ADP Using Lookup Tables

15.1.3 Backward ADP Algorithm with Continuous Approximations

15.2 FittedValueIterationforInfiniteHorizonProblems

15.3 Value Function Approximation Strategies

15.3.1 Linear Models

15.3.2 Monotone Functions

15.3.3 Other Approximation Models

15.4 Computational Observations

15.4.1 Experimental Benchmarking of Backward ADP

15.4.2 Computational Notes

15.5 Bibliographic Notes

Exercises

Bibliography

16 Forward ADP I: The Value of a Policy

16.1 Sampling the Value of a Policy

16.1.1 Direct Policy Evaluation for Finite Horizon Problems

16.1.2 Policy Evaluation for Infinite Horizon Problems

16.1.3 Temporal Difference Updates

16.1.4 TD(

γ

)

16.1.5 TD(0) and Approximate Value Iteration

16.1.6 TD Learning for Infinite Horizon Problems

16.2 Stochastic Approximation Methods

16.3 Bellman’s Equation Using a Linear Model

*

16.3.1 A Matrix-based Derivation

**

16.3.2 A Simulation-based Implementation

16.3.3 Least Squares Temporal Difference Learning (LSTD)

16.3.4 Least Squares Policy Evaluation

16.4 Analysis of TD(0), LSTD, and LSPE Using a Single State

*

16.4.1 Recursive Least Squares and TD(0)

16.4.2 Least Squares Policy Evaluation

16.4.3 Least Squares Temporal Difference Learning

16.4.4 Discussion

16.5 Gradient-based Methods for Approximate Value Iteration

*

16.5.1 Approximate Value Iteration with Linear Models

**

16.5.2 A Geometric View of Linear Models

*

16.6 Value Function Approximations Based on Bayesian Learning

*

16.6.1 Minimizing Bias for Infinite Horizon Problems

16.6.2 Lookup Tables with Correlated Beliefs

16.6.3 Parametric Models

16.6.4 Creating the Prior

16.7 Learning Algorithms and Atepsizes

16.7.1 Least Squares Temporal Differences

16.7.2 Least Squares Policy Evaluation

16.7.3 Recursive Least Squares

16.7.4 Bounding 1/

n

Convergence for Approximate value Iteration

16.7.5 Discussion

16.8 Bibliographic Notes

Exercises

Bibliography

17 Forward ADP II: Policy Optimization

17.1 Overview of Algorithmic Strategies

17.2 Approximate Value Iteration and Q-Learning Using Lookup Tables

17.2.1 Value Iteration Using a Pre-Decision State Variable

17.2.2 Q-Learning

17.2.3 Value Iteration Using a Post-Decision State Variable

17.2.4 Value Iteration Using a Backward Pass

17.3 Styles of Learning

17.3.1 Offline Learning

17.3.2 From Offline to Online

17.3.3 Evaluating Offline and Online Learning Policies

17.3.4 Lookahead Policies

17.4 Approximate Value Iteration Using Linear Models

17.5 On-policy vs. off-policy learning and the exploration–exploitation problem

17.5.1 Terminology

17.5.2 Learning with Lookup Tables

17.5.3 Learning with Generalized Belief Models

17.6 Applications

17.6.1 Pricing an American Option

17.6.2 Playing “Lose Tic-Tac-Toe”

17.6.3 Approximate Dynamic Programming for Deterministic Problems

17.7 Approximate Policy Iteration

17.7.1 Finite Horizon Problems Using Lookup Tables

17.7.2 Finite Horizon Problems Using Linear Models

17.7.3 LSTD for Infinite Horizon Problems Using Linear Models

17.8 The Actor–Critic Paradigm

17.9 Statistical Bias in the Max Operator

*

17.10 The Linear Programming Method Using Linear Models

*

17.11 Finite Horizon Approximations for Steady-State Applications

17.12 Bibliographic Notes

Exercises

Bibliography

18 Forward ADP III: Convex Resource Allocation Problems

18.1 Resource Allocation Problems

18.1.1 The Newsvendor Problem

18.1.2 Two-Stage Resource Allocation Problems

18.1.3 A General Multiperiod Resource Allocation Model

*

18.2 Values Versus Marginal Values

18.3 Piecewise Linear Approximations for Scalar Funtions

18.3.1 The Leveling Algorithm

18.3.2 The CAVE Algorithm

18.4 Regression Methods

18.5 Separable Piecewise Linear Approximations

18.6 Benders Decomposition for Nonseparable Approximations

**

18.6.1 Benders’ Decomposition for Two-Stage Problems

18.6.2 Asymptotic Analysis of Benders with Regularization

**

18.6.3 Benders with Regularization

18.7 Linear Approximations for High-Dimensional Applications

18.8 Resource Allocation with Exogenous Information State

18.9 Closing Notes

18.10 Bibliographic Notes

Exercises

Bibliography

19 Direct Lookahead Policies

19.1 Optimal Policies Using Lookahead Models

19.2 Creating an Approximate Lookahead Model

19.2.1 Modeling the Lookahead Model

19.2.2 Strategies for Approximating the Lookahead Model

19.3 Modified Objectives in Lookahead Models

19.3.1 Managing Risk

19.3.2 Utility Functions for Multiobjective Problems

19.3.3 Model Discounting

19.4 Evaluating DLA Policies

19.4.1 Evaluating Policies in a Simulator

19.4.2 Evaluating Risk-Adjusted Policies

19.4.3 Evaluating Policies in the Field

19.4.4 Tuning Direct Lookahead Policies

19.5 Why Use a DLA?

19.6 Deterministic Lookaheads

19.6.1 A Deterministic Lookahead: Shortest Path Problems

19.6.2 Parameterized Lookaheads

19.7 A Tour of Stochastic Lookahead Policies

19.7.1 Lookahead PFAs

19.7.2 Lookahead CFAs

19.7.3 Lookahead VFAs for the Lookahead Model

19.7.4 Lookahead DLAs for the Lookahead Model

19.7.5 Discussion

19.8 Monte Carlo Tree Search for Discrete Decisions

19.8.1 Basic Idea

19.8.2 The Steps of MCTS

19.8.3 Discussion

19.8.4 Optimistic Monte Carlo Tree Search

19.9 Two-Stage Stochastic Programming for Vector Decisions

*

19.9.1 The Basic Two-Stage Stochastic Program

19.9.2 Two-Stage Approximation of a Sequential Problem

19.9.3 Discussion

19.10 Observations on DLA Policies

19.11 Bibliographic Notes

Exercises

Bibliography

Part VI – Multiagent Systems

20 Multiagent Modeling and Learning

20.1 Overview of Multiagent Systems

20.1.1 Dimensions of a Multiagent System

20.1.2 Communication

20.1.3 Modeling a Multiagent System

20.1.4 Controlling Architectures

20.2 A Learning Problem – Flu Mitigation

20.2.1 Model 1: A Static Model

20.2.2 Variations of Our Flu Model

20.2.3 Two-Agent Learning Models

20.2.4 Transition Functions for Two-Agent Model

20.2.5 Designing Policies for the Flu Problem

20.3 The POMDP Perspective

*

20.4 The Two-Agent Newsvendor Problem

20.5 Multiple Independent Agents – An HVAC Controller Model

20.5.1 Model

20.5.2 Designing Policies

20.6 Cooperative Agents – A Spatially Distributed Blood Management Problem

20.7 Closing Notes

20.8 Why Does itWork?

20.8.1 Derivation of the POMDP Belief Transition Function

20.9 Bibliographic Notes

Exercises

Bibliography

Index

End User License Agreement

List of Illustrations

Chapter 1

Figure 1.1 Illustration of shipments...

Figure 1.2 A sampling of...

Figure 1.3 Deterministic vs. stochastic...

Chapter 2

Figure 2.1 Decision tree illustrating...

Figure 2.2 Canonical model for...

Figure 2.3 Finding a path...

Figure 2.4 A basic shortest...

Figure 2.5 Illustration of a...

Chapter 3

Figure 3.1 Illustration of the...

Figure 3.2 Illustration of a...

Figure 3.3 Illustration of a...

Figure 3.4 Average weight (across...

Figure 3.5 Neural networks with...

Figure 3.6 A three-layer...

Figure 3.7 Illustrative logistics function...

Figure 3.8 A neural network...

Figure 3.9 Illustration of Gaussian...

Figure 3.10 Illustration of a...

Figure 3.11 Illustration of penalty...

Figure 3.12 Fitting noisy data...

Figure 3.13 Illustration of the...

Chapter 4

Figure 4.1 Concave function, showing...

Chapter 5

Figure 5.1 Illustration of gradient...

Figure 5.2 Different estimates of...

Figure 5.3 Convergence of SPSA...

Figure 5.4 A three-layer...

Figure 5.5 (a) Illustration of...

Figure 5.6 A stochastic gradient...

Figure 5.7 A four-layer...

Chapter 6

Figure 6.1 Illustration of poor...

Figure 6.2 Illustration of the...

Figure 6.3 Stepsizes for a...

Figure 6.4 The McClain stepsize...

Figure 6.5 The search-then...

Figure 6.6 Different parameters can...

Figure 6.7 The bias-adjusted...

Figure 6.8 The BAKF stepsize...

Figure 6.9 The objective function...

Figure 6.10 Performance of stochastic...

Chapter 7

Figure 7.1 A set of...

Figure 7.2 (a) Optimizing over...

Figure 7.3 A family of...

Figure 7.4 Value of making...

Figure 7.5 The KG(*) policy...

Figure 7.6 The decision tree...

Figure 7.7 (a) Expected value...

Figure 7.8 The knowledge gradient...

Figure 7.9 Sampled set of...

Figure 7.10 Regions of z...

Figure 7.11 (a) Sampling pattern...

Figure 7.12 Comparison of correlated...

Figure 7.13 The knowledge gradient...

Figure 7.14 Policy search algorithm...

Figure 7.15 Optimal computing budget...

Figure 7.16 Use to plot...

Chapter 8

Figure 8.1 Illustration of a...

Figure 8.2 Illustration of the...

Figure 8.3 (a) The assignment...

Chapter 9

Figure 9.1 The path from...

Figure 9.2 Relationship between discrete...

Figure 9.3 Illustration of the...

Figure 9.4 Pre-decision state...

Figure 9.5 A deterministic network...

Figure 9.6 A stochastic network...

Figure 9.7 Deterministic shortest path...

Figure 9.8 A path problem...

Figure 9.9 A path problem...

Chapter 10

Figure 10.1 Generating uniformly and...

Figure 10.2 Electricity spot prices...

Figure 10.3 Illustration of a...

Figure 10.4 Actual vs. predicted...

Chapter 11

Figure 11.1 Decision tree showing...

Figure 11.2 (a) Decision to...

Figure 11.3 Illustration of the...

Chapter 12

Figure 12.1 Illustration of a...

Figure 12.2 Actively learning a...

Figure 12.3 Plot of the...

Figure 12.4 Sampled belief model...

Figure 12.5 The knowledge gradient...

Chapter 13

Figure 13.1 Illustration of a...

Figure 13.2 Illustration of rolling...

Figure 13.3 Allowable substitutions of...

Figure 13.4 Multiperiod model of...

Figure 13.5 Energy storage system...

Figure 13.6 (a) Energy load...

Figure 13.7 Objective vs...

Figure 13.8 2-d heatmaps...

Figure 13.9 Improvement in profits...

Chapter 14

Figure 14.1 Decision tree illustrating...

Figure 14.2 Collapsed version of...

Figure 14.3 A backward dynamic...

Figure 14.4 The value iteration...

Figure 14.5 The Gauss-Seidel...

Figure 14.6 Relative value iteration...

Figure 14.7 Policy iteration...

Figure 14.9 Cumulative contribution over...

Chapter 15

Figure 15.1 Illustration of transitions...

Figure 15.2 Calculation of the...

Figure 15.3 A backward dynamic...

Figure 15.4 Calculation of value...

Figure 15.5 A backward dynamic...

Figure 15.6 Illustration of the...

Figure 15.7 Video snapshot of...

Figure 15.8 Real-time energy...

Figure 15.9 Performance of the...

Chapter 16

Figure 16.1 Basic policy evaluation...

Figure 16.2 Five-state Markov...

Figure 16.3 A TD...

Figure 16.4 v¯n...

Chapter 17

Figure 17.1 Approximate dynamic programming...

Figure 17.2 A Q-learning...

Figure 17.3 Approximate value iteration...

Figure 17.4 Double-pass version...

Figure 17.5 Approximate value iteration...

Figure 17.6 Two-state dynamic...

Figure 17.7 The effect of...

Figure 17.8 Some tic-tac...

Figure 17.9 A policy iteration...

Figure 17.10 A policy iteration...

Figure 17.11 Approximate policy iteration...

Figure 17.12 Approximate policy iteration...

Figure 17.13 EmaxxUx−maxxEUx...

Figure 17.14 Queueing network with...

Figure 17.15 Exact value function...

Chapter 18

Figure 18.1 (a) The shape...

Figure 18.2 A two-stage...

Figure 18.3 Steps of the...

Figure 18.4 Steps in estimating...

Figure 18.5 Illustration of Benders...

Figure 18.6 Illustration of Benders...

Figure 18.7 Comparison of Benders...

Figure 18.8 Illustration of cuts...

Chapter 19

Figure 19.1 Relationship between the...

Figure 19.2 Illustration of (a...

Figure 19.3 Simulation of a...

Figure 19.4 The performance elbow...

Figure 19.5 Growth in CPU...

Figure 19.6 Illustration of the...

Figure 19.7 Illustration of Monte...

Figure 19.8 Sampled MCTS algorithm...

Figure 19.9 The tree policy...

Figure 19.10 This function simulates...

Figure 19.11 Backup process which...

Figure 19.12 Sample of a...

Figure 19.13 Performance of (a...

Chapter 20

Figure 20.1 The canonical model...

Figure 20.2 The canonical model...

Figure 20.3 Information flow diagram...

Figure 20.4 (a) Blood network...

Figure 20.5 Blood management for...

List of Tables

Chapter 1

Table 1.1. A list of application...

Table 1.2. Fields that deal with...

Table 1.3. Comparison of classical problems...

Chapter 3

Table 3.1. Examples of aggregations of...

Chapter 5

Table 5.1. Illustration of six sample...

Chapter 7

Table 7.1. A sample of the...

Table 7.2. Gittins indices...

Table 7.3. History of hitting performance...

Table 7.4. The calculations behind the...

Table 7.5. Correlation matrix describing the...

Table 7.6. Optimal tuned parameters for...

Table 7.7. Priors for exercise 7.19...

Table 7.8. Three observations, for three...

Table 7.12. Priors

Table 7.13. Prior beliefs for learning exercise...

Chapter 8

Table 8.1. Allowable blood substitutions for...

Table 8.2. Contributions for different types...

Chapter 9

Table 9.1. Sample of words in...

Table 9.2. A set of sample...

Table 9.3. Comparison of formulations for...

Table 9.4. Set of demand outcomes...

Chapter 10

Table 10.1. Illustration of different types...

Table 10.2. Illustration of a set...

Chapter 11

Table 11.1. Words from the English...

Table 11.2. where the bold entries (in the diagonal) indicates the...

Table 11.3. Illustration of complicating state...

Table 11.4. Suggested starting strategies for...

Chapter 15

Table 15.1. Comparison of revenues generated...

Chapter 16

Table 16.1. Effect of stepsize on...

Chapter 17

Table 17.1. Illustration of a single...

Table 17.2. Illustration of a double...

Table 17.3. Ten sample realizations of...

Table 17.4. The payout at time...

Table 17.5. The data table for...

Table 17.6. The payout if we...

Chapter 18

Table 18.1. Sample list of resource...

Chapter 20

Table 20.1. Environmental state variables and...

Table 20.2. Table of notation for...

Guide

Cover

Title page

Copyright

Table of Contents

Preface

Acknowledgments

Begin Reading

Index

End User License Agreement

Pages

i

ii

iii

iv

v

vi

vii

viii

ix

x

xi

xii

xiii

xiv

xv

xvi

xvii

xviii

xix

xx

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

xxx

xxxi

xxxii

xxxiii

xxxiv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087

1088

1089

1090

1091

1092

1093

1094

1095

1096

1097

1098

1099

Preface

Preface toReinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions

This books represents a lifetime of research into what I now call sequential decision problems, which dates to 1982 when I was introduced to the problem arising in truckload trucking (think of Uber/Lyft for trucks) where we have to choose which driver to assign to a load, and which loads to accept to move, given the high level of randomness in future customer demands, representing requests to move full truckloads of freight.

It took me 20 years to figure out a practical algorithm to solve this problem, which led to my first book (in 2007) on approximate dynamic programming, where the major breakthrough was the introduction of the post-decision state and the use of hierarchical aggregation for approximating value functions to solve these high-dimensional problems. However, I would argue today that the most important chapter in the book (and I recognized it at the time), was chapter 5 on how to model these problems, without any reference to algorithms to solve the problem. I identified five elements to a sequential decision problem, leading up to the objective function which was written

It was not until the second edition (in 2011) that I realized that approximate dynamic programming (specifically, policies that depend on value functions) was not the only way to solve these problems; rather, there were four classes of policies, and only one used value functions. The 2011 edition of the book listed three of the four classes of policies that are described in this book, but most of the book still focused on approximating value functions. It was not until a 2014 paper (“Clearing the Jungle of Stochastic Optimization”) that I identified the four classes of policies I use now. Then, in 2016 I realized that the four classes of policies could be divided between two major strategies: the policy search strategy, where we search over a family of functions to find the one that works best, and the lookahead strategy, where we make good decisions by approximating the downstream impact of a decision made now.

Finally, I combined these ideas in a 2019 paper (“A Unified Framework for Stochastic Optimization” published in the European Journal for Operational Research) with a better appreciation of major classes of problems such as state-independent problems (the pure learning problems that include derivative-based and derivative-free stochastic search) and the more general state-dependent problems; cumulative and final reward objective functions; and the realization that any adaptive search algorithm was a sequential decision problem. The material in the 2019 paper is effectively the outline for this book.

This book builds on the 2011 edition of my approximate dynamic programming book, and includes a number of chapters (some heavily edited) from the ADP book. It would be nice to call this a third edition, but the entire framework of this book is completely different. “Approximate dynamic programming” is a term that still refers to making decisions based on the idea of approximating the downstream value of being in a state. After decades of working with this approach (which is still covered over a span of five chapters in this volume), I can now say with confidence that value function approximations, despite all the attention they have received, is a powerful methodology for a surprisingly narrow set of decision problems.

By contrast, I finally developed the confidence to claim that the four classes of policies are universal. This means that any method for making decisions will fall in one of these four classes, or a hybrid of two or more. This is a game changer, because it shifts the focus from an algorithm (the method for making decisions) to the model (specifically the optimization problem above, along with the state-transition function and the model of the exogenous information process). This means we write out the elements of a problem before we tackle the problem of designing policies to decisions. I call this:

Model first, then solve.

The communities working on sequential decision problems are very focused on methods, just as I was with my earlier work with approximate dynamic programming. The problem is that any particular method will be inherently limited to a narrow class of problems. In this book, I demonstrate how you can take a simple inventory problem, and then tweak the data to make each of the four classes work best.

This new approach has opened up an entirely new way of approaching a problem class that, in the last year of writing the book, I started calling “sequential decision analytics,” which is any problem consisting of the sequence:

Decision, information, decision, information, …

I allow decisions to range from binary (selling an asset) to discrete choices (favored in computer science) to the high-dimensional resource allocation problems popular in operations research. This approach starts with a problem, shifts to the challenging task of modeling uncertainty, and then finishes with designing policies to make decisions to optimize some metric. The approach is practical, scalable, and universally applicable.

It is exciting to be able to create a single framework that spans 15 different communities, and which represents every possible method for solving sequential decision problems. While having a common language to model any sequential decision problem, combined with the general approach of the four classes of policies, is clearly of value, this framework has been developed by standing on the shoulders of the giants who have laid the foundational work for all of these methods. I have had to make choices regarding the best notation and modeling conventions, but my framework is completely inclusive of all the methods that have been developed to solve these problems. Rather than joining the chorus of researchers promoting specific algorithmic strategies (as I once did), my goal is to raise the visibility of all methods, so that someone looking to solve a real problem is working with the biggest possible toolbox, rather than just the tools developed within a specific community.

A word needs to be said about the title of the book. As this is being written, there is a massive surge of interest in “reinforcement learning,” which started as a form of approximate dynamic programming (I used to refer to ADP and RL as similar to American English and British English). However, as the RL community has grown and started working on harder problems, they encountered the same experience that I and everyone else working in ADP found: value function approximations are not a panacea. Not only is it the case that they often do not work, they usually do not work. As a result, the RL community branched out (just as I did) into other methods such as “policy gradient methods” (my “policy function approximations” or PFA), upper confidence bounding (a form of “cost function approximation” or CFA), the original Q-learning (which produces a policy based on “value function approximations” or VFA), and finally Monte Carlo tree search (a policy based on “direct lookahead approximations” or DLA). All of these methods are found in the second edition of Sutton and Barto’s landmark book Reinforcement Learning: An introduction, but only as specific methods rather than general classes. This book takes the next step and identifies the general classes.

This evolution from one core method to all four classes of policies is being repeated among other fields that I came to call the “jungle of stochastic optimization.” Stochastic search, simulation-optimization, and bandit problems all feature methods from each of the four classes of policies. Over time, I came to realize that all these fields (including reinforcement learning) were playing catchup to the grandfather of all of this work, which is optimal control (and stochastic control). The field of optimal control was the first to introduce and seriously explore the use of value function approximations (they call these cost-to-go functions), linear decision rules (a form of PFA), and the workhorse “model predictive control” (a great name for a simple rolling horizon procedure, which is a “direct lookahead approximation” in this book). I also found that my modeling framework was closest to that used in the optimal control literature, which was the first field to introduce the concept of a transition function, a powerful modeling device that has been largely overlooked by the other communities. I make a few small tweaks such as using state St instead of xt, and decision xt (widely used in the field of math programming) instead of ut.

Then I introduce one big change, which is to maximize over all four classes of policies. Perhaps the most important innovation of this book is to break the almost automatic link between optimizing over policies, and then assuming that we are going to compute an optimal policy from either Bellman’s equation or the Hamilton-Jacobi equations. These are rarely computable for real problems, which then leads people to assume that the natural next step is to approximate these equations. This is simply false, supported by decades of research where people have developed methods that do not depend on HJB equations. I recognize this body of research developing different classes of policies by making the inclusion of all four classes of policies fundamental to the original statement of the optimization problem above.

It will take some time for people from the different communities to learn to speak this common language. More likely, there will be an adaptation of existing modeling languages to this framework. For example, the optimal control community could keep their notation, but learn to write their objective functions as I have above, recognizing that the search over policies needs to span all four classes (which, I might point out, they are already using). I would hope that the reinforcement learning community, which adopted the notation for discrete action a, might learn to use the more general x (as the bandit community has already done).

I have tried to write this book to appeal to newcomers to the field, as well as people who already have training in one or more of the subfields that deal with decisions and uncertainty; recognizing these two broad communities was easily the biggest challenge while writing this book. Not surprisingly, the book is quite long. I have tried to make it more accessible to people who are new to the field by marking many sections with an * as an indication that this section can be skipped on a first-read. I also hope that the book will appeal to people from many application domains. However, the core audience is people who are looking to solve real problems by modeling applications and implementing the work in software. The notation is designed to facilitate writing computer programs, where there should be a direct relationship between the mathematical model and the software. This is particularly important when modeling the flow of information, something that is often overlooked in mainstream reinforcement learning papers.

Warren B. Powell

Princeton, New JerseyAugust, 2021

Acknowledgments

The foundation of this book is a modeling framework for sequential decision problems that involves searching over four classes of policies for making decisions. The recognition that we needed all four classes of policies came from working on a wide range of problems spanning freight transportation (almost all modes), energy, health, e-commerce, finance, and even materials science (!!).

This research required a lot of computational work, which was only possible through the efforts of the many students and staff that worked in CASTLE Lab. Over my 39 years of teaching at Princeton, I benefited tremendously from the interactions with 70 graduate students and post-doctoral associates, along with nine professional staff. I am deeply indebted to the contributions of this exceptionally talented group of men and women who allowed me to participate in the challenges of getting computational methods to work on such a wide range of problems. It was precisely this diversity of problem settings that led me to appreciate the motivation for the different methods for solving problems. In the process, I met people from across the jungle, and learned to speak their language not just by reading papers, but by talking to them and, often, working on their problems.