122,99 €
REINFORCEMENT LEARNING AND STOCHASTIC OPTIMIZATION Clearing the jungle of stochastic optimization Sequential decision problems, which consist of "decision, information, decision, information," are ubiquitous, spanning virtually every human activity ranging from business applications, health (personal and public health, and medical decision making), energy, the sciences, all fields of engineering, finance, and e-commerce. The diversity of applications attracted the attention of at least 15 distinct fields of research, using eight distinct notational systems which produced a vast array of analytical tools. A byproduct is that powerful tools developed in one community may be unknown to other communities. Reinforcement Learning and Stochastic Optimization offers a single canonical framework that can model any sequential decision problem using five core components: state variables, decision variables, exogenous information variables, transition function, and objective function. This book highlights twelve types of uncertainty that might enter any model and pulls together the diverse set of methods for making decisions, known as policies, into four fundamental classes that span every method suggested in the academic literature or used in practice. Reinforcement Learning and Stochastic Optimization is the first book to provide a balanced treatment of the different methods for modeling and solving sequential decision problems, following the style used by most books on machine learning, optimization, and simulation. The presentation is designed for readers with a course in probability and statistics, and an interest in modeling and applications. Linear programming is occasionally used for specific problem classes. The book is designed for readers who are new to the field, as well as those with some background in optimization under uncertainty. Throughout this book, readers will find references to over 100 different applications, spanning pure learning problems, dynamic resource allocation problems, general state-dependent problems, and hybrid learning/resource allocation problems such as those that arose in the COVID pandemic. There are 370 exercises, organized into seven groups, ranging from review questions, modeling, computation, problem solving, theory, programming exercises and a "diary problem" that a reader chooses at the beginning of the book, and which is used as a basis for questions throughout the rest of the book.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1914
Veröffentlichungsjahr: 2022
Warren B. Powell Princeton University Princeton, NJ
This edition first published 2022
© 2022 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right ofWarren B. Powell to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer ofWarranty
The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
A catalogue record for this book is available from the Library of Congress
Hardback ISBN: 9781119815037; ePub ISBN: 9781119815051; ePDF ISBN: 9781119815044; Obook ISBN: 9781119815068.
Cover image: © DeniseSmit/Getty Images
Cover design by Wiley
Set in 9.5/12.5pt STIXTwoText by Integra Software Services Pvt. Ltd, Pondicherry, India
Cover
Title page
Copyright
Preface
Acknowledgments
Part I – Introduction
1 Sequential Decision Problems
1.1 The Audience
1.2 The Communities of Sequential Decision Problems
1.3 Our Universal Modeling Framework
1.4 DesigningPolicies for Sequential Decision Problems
1.4.1 Policy Search
1.4.2 Policies Based on Lookahead Approximations
1.4.3 Mixing and Matching
1.4.4 Optimality of the Four Classes
1.4.5 Pulling it All Together
1.5 Learning
1.6 Themes
1.6.1 Blending Learning and Optimization
1.6.2 Bridging Machine Learning to Sequential Decisions
1.6.3 From Deterministic to Stochastic Optimization
1.6.4 From Single to Multiple Agents
1.7 Our Modeling Approach
1.8 How to Read this Book
1.8.1 Organization of Topics
1.8.2 How to Read Each Chapter
1.8.3 Organization of Exercises
1.9 Bibliographic Notes
Exercises
Bibliography
2 Canonical Problems and Applications
2.1 Canonical Problems
2.1.1 Stochastic Search – Derivative-based and Derivative-free
2.1.1.1 Derivative-based Stochastic Search
2.1.1.2 Derivative-free Stochastic Search
2.1.2 Decision Trees
2.1.3 Markov Decision Processes
2.1.4 Optimal Control
2.1.5 Approximate Dynamic Programming
2.1.6 Reinforcement Learning
2.1.7 Optimal Stopping
2.1.8 Stochastic Programming
2.1.9 The Multiarmed Bandit Problem
2.1.10 Simulation Optimization
2.1.11 Active Learning
2.1.12 Chance-constrained Programming
2.1.13 Model Predictive Control
2.1.14 Robust Optimization
2.2 A Universal Modeling Framework for Sequential Decision Problems
2.2.1 Our Universal Model for Sequential Decision Problems
2.2.2 A Compact Modeling Presentation
2.2.3 MDP/RL vs. Optimal Control Modeling Frameworks
2.3 Applications
2.3.1 The Newsvendor Problems
2.3.1.1 Basic Newsvendor – Final Reward
2.3.1.2 Basic Newsvendor – Cumulative Reward
2.3.1.3 Contextual Newsvendor
2.3.1.4 Multidimensional Newsvendor Problems
2.3.2 Inventory/Storage Problems
2.3.2.1 Inventory Without Lags
2.3.2.2 Inventory Planning with Forecasts
2.3.2.3 Lagged Decisions
2.3.3 Shortest Path Problems
2.3.3.1 A Deterministic Shortest Path Problem
2.3.3.2 A Stochastic Shortest Path Problem
2.3.3.3 A Dynamic Shortest Path Problem
2.3.3.4 A Robust Shortest Path Problem
2.3.4 Some Fleet Management Problems
2.3.4.1 The Nomadic Trucker
2.3.4.2 From One Driver to a Fleet
2.3.5 Pricing
2.3.6 Medical Decision Making
2.3.7 Scientific Exploration
2.3.8 Machine Learning vs. Sequential Decision Problems
2.4 Bibliographic Notes
Exercises
Bibliography
3 Online Learning
3.1 Machine Learning for Sequential Decisions
3.1.1 Observations and Data in Stochastic Optimization
3.1.2 Indexing Input
x
n
and Response
y
n
+1
3.1.3 FunctionsWe are Learning
3.1.4 Sequential Learning: From Very Little Data to … More Data
3.1.5 Approximation Strategies
3.1.6 From Data Analytics to Decision Analytics
3.1.7 Batch vs. Online Learning
3.2 Adaptive Learning Using Exponential Smoothing
3.3 Lookup Tables with Frequentist Updating
3.4 Lookup Tables with Bayesian Updating
3.4.1 The Updating Equations for Independent Beliefs
3.4.2 Updating for Correlated Beliefs
3.4.3 Gaussian Process Regression
3.5 Computing Bias and Variance*
3.6 Lookup Tables and Aggregation*
3.6.1 Hierarchical Aggregation
3.6.2 Estimates of Different Levels of Aggregation
3.6.3 Combining Multiple Levels of Aggregation
3.7 Linear Parametric Models
3.7.1 Linear Regression Review
3.7.2 Sparse Additive Models and Lasso
3.8 Recursive Least Squares for Linear Models
3.8.1 Recursive Least Squares for Stationary Data
3.8.2 Recursive Least Squares for Nonstationary Data*
3.8.3 Recursive Estimation Using Multiple Observations*
3.9 Nonlinear Parametric Models
3.9.1 Maximum Likelihood Estimation
3.9.2 Sampled Belief Models
3.9.3 Neural Networks – Parametric*
3.9.4 Limitations of Neural Networks
3.10 Nonparametric Models*
3.10.1 K-Nearest Neighbor
3.10.2 Kernel Regression
3.10.3 Local Polynomial Regression
3.10.4 Deep Neural Networks
3.10.5 Support Vector Machines
3.10.6 Indexed Functions, Tree Structures, and Clustering
3.10.7 Comments on Nonparametric Models
3.11 Nonstationary Learning*
3.11.1 Nonstationary Learning I – Martingale Truth
3.11.2 Nonstationary Learning II – Transient Truth
3.11.3 Learning Processes
3.12 The Curse of Dimensionality
3.13 Designing Approximation Architectures in Adaptive Learning
3.14 Why Does ItWork?
**
3.14.1 Derivation of the Recursive Estimation Equations
3.14.2 The Sherman-Morrison Updating Formula
3.14.3 Correlations in Hierarchical Estimation
3.14.4 Proof of Proposition 3.14.1
3.15 Bibliographic Notes
Exercises
Bibliography
4 Introduction to Stochastic Search
4.1 Illustrations of the Basic Stochastic Optimization Problem
4.2 Deterministic Methods
4.2.1 A “Stochastic” Shortest Path Problem
4.2.2 A Newsvendor Problem with Known Distribution
4.2.3 Chance-Constrained Optimization
4.2.4 Optimal Control
4.2.5 Discrete Markov Decision Processes
4.2.6 Remarks
4.3 Sampled Models
4.3.1 Formulating a Sampled Model
4.3.1.1 A Sampled Stochastic Linear Program
4.3.1.2 Sampled Chance-Constrained Models
4.3.1.3 Sampled Parametric Models
4.3.2 Convergence
4.3.3 Creating a Sampled Model
4.3.4 Decomposition Strategies*
4.4 Adaptive Learning Algorithms
4.4.1 Modeling Adaptive Learning Problems
4.4.2 Online vs. Offline Applications
4.4.2.1 Machine Learning
4.4.2.2 Optimization
4.4.3 Objective Functions for Learning
4.4.4 Designing Policies
4.5 Closing Remarks
4.6 Bibliographic Notes
Exercises
Bibliography
Part II – Stochastic Search
5 Derivative-Based Stochastic Search
5.1 Some Sample Applications
5.2 Modeling Uncertainty
5.2.1 Training Uncertainty
W
1
, …
W
N
5.2.2 Model Uncertainty
S
0
5.2.3 Testing Uncertainty
5.2.4 Policy Evaluation
5.2.5 Closing Notes
5.3 Stochastic Gradient Methods
5.3.1 A Stochastic Gradient Algorithm
5.3.2 Introduction to Stepsizes
5.3.3 Evaluating a Stochastic Gradient Algorithm
5.3.4 A Note on Notation
5.4 Styles of Gradients
5.4.1 Gradient Smoothing
5.4.2 Second-Order Methods
5.4.3 Finite Differences
5.4.4 SPSA
5.4.5 Constrained Problems
5.5 Parameter Optimization for Neural Networks*
5.5.1 Computing the Gradient
5.5.2 The Stochastic Gradient Algorithm
5.6 Stochastic Gradient Algorithm as a Sequential Decision Problem
5.7 Empirical Issues
5.8 Transient Problems
*
5.9 Theoretical Performance
*
5.10 Why Does itWork?
5.10.1 Some Probabilistic Preliminaries
5.10.2 An Older Proof
*
5.10.3 A More Modern Proof
**
5.11 Bibliographic Notes
Exercises
Bibliography
6 Stepsize Policies
6.1 Deterministic Stepsize Policies
6.1.1 Properties for Convergence
6.1.2 A Collection of Deterministic Policies
6.1.2.1 Constant Stepsizes
6.1.2.2 Generalized Harmonic Stepsizes
6.1.2.3 Polynomial Learning Rates
6.1.2.4 McClain’s Formula
6.1.2.5 Search-then-Converge Learning Policy
6.2 Adaptive Stepsize Policies
6.2.1 The Case for Adaptive Stepsizes
6.2.2 Convergence Conditions
6.2.3 A Collection of Stochastic Policies
6.2.3.1 Kesten’s Rule
6.2.3.2 Trigg’s Formula
6.2.3.3 Stochastic Gradient Adaptive Stepsize Rule
6.2.3.4 ADAM
6.2.3.5 AdaGrad
6.2.3.6 RMSProp
6.2.4 Experimental Notes
6.3 Optimal Stepsize Policies
*
6.3.1 Optimal Stepsizes for Stationary Data
6.3.2 Optimal Stepsizes for Nonstationary Data – I
6.3.3 Optimal Stepsizes for Nonstationary Data – II
6.4 OptimalStepsizesforApproximateValueIteration
*
6.5 Convergence
6.6 Guidelines for Choosing Stepsize Policies
6.7 Why Does itWork
*
6.7.1 Proof of BAKF Stepsize
6.8 Bibliographic Notes
Exercises
Bibliography
7 Derivative-Free Stochastic Search
7.1 Overview of Derivative-free Stochastic Search
7.1.1 Applications and Time Scales
7.1.2 The Communities of Derivative-free Stochastic Search
7.1.3 The Multiarmed Bandit Story
7.1.4 From Passive Learning to Active Learning to Bandit Problems
7.2 Modeling Derivative-free Stochastic Search
7.2.1 The Universal Model
7.2.2 Illustration: Optimizing a Manufacturing Process
7.2.3 Major Problem Classes
7.3 Designing Policies
7.4 Policy Function Approximations
7.5 Cost Function Approximations
7.6 VFA-based Policies
7.6.1 An Optimal Policy
7.6.2 Beta-Bernoulli Belief Model
7.6.3 Backward Approximate Dynamic Programming
7.6.4 Gittins Indices for Learning in Steady State
7.7 Direct Lookahead Policies
7.7.1 When do we Need Lookahead Policies?
7.7.2 Single Period Lookahead Policies
7.7.3 Restricted Multiperiod Lookahead
7.7.4 Multiperiod Deterministic Lookahead
7.7.5 Multiperiod Stochastic Lookahead Policies
7.7.6 Hybrid Direct Lookahead
7.8 The Knowledge Gradient (Continued)
*
7.8.1 The Belief Model
7.8.2 The Knowledge Gradient for Maximizing Final Reward
7.8.3 The Knowledge Gradient for Maximizing Cumulative Reward
7.8.4 The Knowledge Gradient for Sampled Belief Model
*
7.8.5 Knowledge Gradient for Correlated Beliefs
7.9 Learning in Batches
7.10 Simulation Optimization
*
7.10.1 An Indifference Zone Algorithm
7.10.2 Optimal Computing Budget Allocation
7.11 Evaluating Policies
7.11.1 Alternative Performance Metrics
*
7.11.2 Perspectives of Optimality
*
7.12 Designing Policies
7.12.1 Characteristics of a Policy
7.12.2 The Effect of Scaling
7.12.3 Tuning
7.13 Extensions
*
7.13.1 Learning in Nonstationary Settings
7.13.2 Strategies for Designing Time-dependent Policies
7.13.3 A Transient Learning Model
7.13.4 The Knowledge Gradient for Transient Problems
7.13.5 Learning with Large or Continuous Choice Sets
7.13.6 Learning with Exogenous State Information – the Contextual Bandit Problem
7.13.7 State-dependent vs. State-independent Problems
7.14 Bibliographic Notes
Exercises
Bibliography
Part III – State-dependent Problems
8 State-dependent Problems
8.1 Graph Problems
8.1.1 A Stochastic Shortest Path Problem
8.1.2 The Nomadic Trucker
8.1.3 The Transformer Replacement Problem
8.1.4 Asset Valuation
8.2 Inventory Problems
8.2.1 A Basic Inventory Problem
8.2.2 The Inventory Problem – II
8.2.3 The Lagged Asset Acquisition Problem
8.2.4 The Batch Replenishment Problem
8.3 Complex Resource Allocation Problems
8.3.1 The Dynamic Assignment Problem
8.3.2 The Blood Management Problem
8.4 State-dependent Learning Problems
8.4.1 Medical Decision Making
8.4.2 Laboratory Experimentation
8.4.3 Bidding for Ad-clicks
8.4.4 An Information-collecting Shortest Path Problem
8.5 A Sequence of Problem Classes
8.6 Bibliographic Notes
Exercises
Bibliography
9 Modeling Sequential Decision Problems
9.1 A Simple Modeling Illustration
9.2 Notational Style
9.3 Modeling Time
9.4 The States of Our System
9.4.1 Defining the State Variable
9.4.2 The Three States of Our System
9.4.3 Initial State
S
0
vs. Subsequent States
S
t
t
> 0
9.4.4 Lagged State Variables
*
9.4.5 The Post-decision State Variable
*
9.4.6 A Shortest Path Illustration
9.4.7 Belief States
*
9.4.8 Latent Variables
*
9.4.9 Rolling Forecasts
*
9.4.10 Flat vs. Factored State Representations
*
9.4.11 A Programmer’s Perspective of State Variables
9.5 Modeling Decisions
9.5.1 Types of Decisions
9.5.2 Initial Decision x0 vs. Subsequent Decisions xt, t > 0
9.5.3 Strategic, Tactical, and Execution Decisions
9.5.4 Constraints
9.5.5 Introducing Policies
9.6 The Exogenous Information Process
9.6.1 Basic Notation for Information Processes
9.6.2 Outcomes and Scenarios
9.6.3 Lagged Information Processes
*
9.6.4 Models of Information Processes
*
9.6.5 Supervisory Processes
*
9.7 The Transition Function
9.7.1 A General Model
9.7.2 Model-free Dynamic Programming
9.7.3 Exogenous Transitions
9.8 The Objective Function
9.8.1 The Performance Metric
9.8.2 Optimizing the Policy
9.8.3 Dependence of Optimal Policy on S0
9.8.4 State-dependent Variations
9.8.5 Uncertainty Operators
9.9 Illustration: An Energy Storage Model
9.9.1 With a Time-series Price Model
9.9.2 With Passive Learning
9.9.3 With Active Learning
9.9.4 With Rolling Forecasts
9.10 Base Models and Lookahead Models
9.11 A Classification of Problems
*
9.12 Policy Evaluation
*
9.13 Advanced Probabilistic Modeling Concepts
**
9.13.1 A Measure-theoretic View of Information
**
9.13.2 Policies and Measurability
9.14 Looking Forward
9.15 Bibliographic Notes
Exercises
Bibliography
10 Uncertainty Modeling
10.1 Sources of Uncertainty
10.1.1 Observational Errors
10.1.2 Exogenous Uncertainty
10.1.3 Prognostic Uncertainty
10.1.4 Inferential (or Diagnostic) Uncertainty
10.1.5 Experimental Variability
10.1.6 Model Uncertainty
10.1.7 Transitional Uncertainty
10.1.8 Control/implementation Uncertainty
10.1.9 Communication Errors and Biases
10.1.10 Algorithmic Instability
10.1.11 Goal Uncertainty
10.1.12 Political/regulatory Uncertainty
10.1.13 Discussion
10.2 A Modeling Case Study: The COVID Pandemic
10.3 Stochastic Modeling
10.3.1 Sampling Exogenous Information
10.3.2 Types of Distributions
10.3.3 Modeling Sample Paths
10.3.4 State-action-dependent Processes
10.3.5 Modeling Correlations
10.4 Monte Carlo Simulation
10.4.1 Generating Uniform [0, 1] Random Variables
10.4.2 Uniform and Normal Random Variable
10.4.3 Generating Random Variables from Inverse Cumulative Distributions
10.4.4 Inverse Cumulative From Quantile Distributions
10.4.5 Distributions with Uncertain Parameters
10.5 Case Study: Modeling Electricity Prices
10.5.1 Mean Reversion
10.5.2 Jump-diffusion Models
10.5.3 Quantile Distributions
10.5.4 Regime Shifting
10.5.5 Crossing Times
10.6 Sampling vs. Sampled Models
10.6.1 Iterative Sampling: A Stochastic Gradient Algorithm
10.6.2 Static Sampling: Solving a Sampled Model
10.6.3 Sampled Representation with Bayesian Updating
10.7 Closing Notes
10.8 Bibliographic Notes
Exercises
Bibliography
11 Designing Policies
11.1 From Optimization to Machine Learning to Sequential Decision Problems
11.2 The Classes of Policies
11.3 Policy Function Approximations
11.4 Cost Function Approximations
11.5 Value Function Approximations
11.6 Direct Lookahead Approximations
11.6.1 The Basic Idea
11.6.2 Modeling the Lookahead Problem
11.6.3 The Policy-Within-a-Policy
11.7 Hybrid Strategies
11.7.1 Cost Function Approximation with Policy Function Approximations
11.7.2 Lookahead Policies with Value Function Approximations
11.7.3 Lookahead Policies with Cost Function Approximations
11.7.4 Tree Search with Rollout Heuristic and a Lookup Table Policy
11.7.5 Value Function Approximation with Policy Function Approximation
11.7.6 Fitting Value Functions Using ADP and Policy Search
11.8 Randomized Policies
11.9 Illustration: An Energy Storage Model Revisited
11.9.1 Policy Function Approximation
11.9.2 Cost Function Approximation
11.9.3 Value Function Approximation
11.9.4 Deterministic Lookahead
11.9.5 Hybrid Lookahead-Cost Function Approximation
11.9.6 Experimental Testing
11.10 Choosing the Policy Class
11.10.1 The Policy Classes
11.10.2 Policy Complexity-Computational Tradeoffs
11.10.3 Screening Questions
11.11 Policy Evaluation
11.12 Parameter Tuning
11.12.1 The Soft Issues
11.12.2 Searching Across Policy Classes
11.13 Bibliographic Notes
Exercises
Bibliography
Part IV – Policy Search
12 Policy Function Approximations and Policy Search
12.1 Policy Search as a Sequential Decision Problem
12.2 Classes of Policy Function Approximations
12.2.1 Lookup Table Policies
12.2.2 Boltzmann Policies for Discrete Actions
12.2.3 Linear Decision Rules
12.2.4 Monotone Policies
12.2.5 Nonlinear Policies
12.2.6 Nonparametric/Locally Linear Policies
12.2.7 Contextual Policies
12.3 Problem Characteristics
12.4 Flavors of Policy Search
12.5 Policy Search with Numerical Derivatives
12.6 Derivative-Free Methods for Policy Search
12.6.1 Belief Models
12.6.2 Learning Through Perturbed PFAs
12.6.3 Learning CFAs
12.6.4 DLA Using the Knowledge Gradient
12.6.5 Comments
12.7 Exact Derivatives for Continuous Sequential Problems
*
12.8 ExactDerivativesforDiscreteDynamicPrograms
**
12.8.1 A Stochastic Policy
12.8.2 The Objective Function
12.8.3 The Policy Gradient Theorem
12.8.4 Computing the Policy Gradient
12.9 Supervised Learning
12.10 Why Does itWork?
12.10.1 Derivation of the Policy Gradient Theorem
12.11 Bibliographic Notes
Exercises
Bibliography
13 Cost Function Approximations
13.1 General Formulation for Parametric CFA
13.2 Objective-Modified CFAs
13.2.1 Linear Cost Function Correction
13.2.2 CFAs for Dynamic Assignment Problems
13.2.3 Dynamic Shortest Paths
13.2.4 Dynamic Trading Policy
13.2.5 Discussion
13.3 Constraint-Modified CFAs
13.3.1 General Formulation of Constraint-Modified CFAs
13.3.2 A Blood Management Problem
13.3.3 An Energy Storage Example with Rolling Forecasts
13.4 Bibliographic Notes
Exercises
Bibliography
Part V – Lookahead Policies
14 Exact Dynamic Programming
14.1 Discrete Dynamic Programming
14.2 The Optimality Equations
14.2.1 Bellman’s Equations
14.2.2 Computing the Transition Matrix
14.2.3 Random Contributions
14.2.4 Bellman’s Equation Using Operator Notation
*
14.3 Finite Horizon Problems
14.4 Continuous Problems with Exact Solutions
14.4.1 The Gambling Problem
14.4.2 The Continuous Budgeting Problem
14.5 Infinite Horizon Problems
*
14.6 Value Iteration for Infinite Horizon Problems
*
14.6.1 A Gauss-Seidel Variation
14.6.2 Relative Value Iteration
14.6.3 Bounds and Rates of Convergence
14.7 Policy Iteration for Infinite Horizon Problems
*
14.8 Hybrid Value-Policy Iteration
*
14.9 Average Reward Dynamic Programming
*
14.10 The Linear Programming Method for Dynamic Programs
**
14.11 Linear Quadratic Regulation
14.12 Why Does itWork?
**
14.12.1 The Optimality Equations
14.12.2 Convergence of Value Iteration
14.12.3 Monotonicity of Value Iteration
14.12.4 Bounding the Error from Value Iteration
14.12.5 Randomized Policies
14.13 Bibliographic Notes
Exercises
Bibliography
15 Backward Approximate Dynamic Programming
15.1 Backward Approximate Dynamic Programming for Finite Horizon Problems
15.1.1 Some Preliminaries
15.1.2 Backward ADP Using Lookup Tables
15.1.3 Backward ADP Algorithm with Continuous Approximations
15.2 FittedValueIterationforInfiniteHorizonProblems
15.3 Value Function Approximation Strategies
15.3.1 Linear Models
15.3.2 Monotone Functions
15.3.3 Other Approximation Models
15.4 Computational Observations
15.4.1 Experimental Benchmarking of Backward ADP
15.4.2 Computational Notes
15.5 Bibliographic Notes
Exercises
Bibliography
16 Forward ADP I: The Value of a Policy
16.1 Sampling the Value of a Policy
16.1.1 Direct Policy Evaluation for Finite Horizon Problems
16.1.2 Policy Evaluation for Infinite Horizon Problems
16.1.3 Temporal Difference Updates
16.1.4 TD(
γ
)
16.1.5 TD(0) and Approximate Value Iteration
16.1.6 TD Learning for Infinite Horizon Problems
16.2 Stochastic Approximation Methods
16.3 Bellman’s Equation Using a Linear Model
*
16.3.1 A Matrix-based Derivation
**
16.3.2 A Simulation-based Implementation
16.3.3 Least Squares Temporal Difference Learning (LSTD)
16.3.4 Least Squares Policy Evaluation
16.4 Analysis of TD(0), LSTD, and LSPE Using a Single State
*
16.4.1 Recursive Least Squares and TD(0)
16.4.2 Least Squares Policy Evaluation
16.4.3 Least Squares Temporal Difference Learning
16.4.4 Discussion
16.5 Gradient-based Methods for Approximate Value Iteration
*
16.5.1 Approximate Value Iteration with Linear Models
**
16.5.2 A Geometric View of Linear Models
*
16.6 Value Function Approximations Based on Bayesian Learning
*
16.6.1 Minimizing Bias for Infinite Horizon Problems
16.6.2 Lookup Tables with Correlated Beliefs
16.6.3 Parametric Models
16.6.4 Creating the Prior
16.7 Learning Algorithms and Atepsizes
16.7.1 Least Squares Temporal Differences
16.7.2 Least Squares Policy Evaluation
16.7.3 Recursive Least Squares
16.7.4 Bounding 1/
n
Convergence for Approximate value Iteration
16.7.5 Discussion
16.8 Bibliographic Notes
Exercises
Bibliography
17 Forward ADP II: Policy Optimization
17.1 Overview of Algorithmic Strategies
17.2 Approximate Value Iteration and Q-Learning Using Lookup Tables
17.2.1 Value Iteration Using a Pre-Decision State Variable
17.2.2 Q-Learning
17.2.3 Value Iteration Using a Post-Decision State Variable
17.2.4 Value Iteration Using a Backward Pass
17.3 Styles of Learning
17.3.1 Offline Learning
17.3.2 From Offline to Online
17.3.3 Evaluating Offline and Online Learning Policies
17.3.4 Lookahead Policies
17.4 Approximate Value Iteration Using Linear Models
17.5 On-policy vs. off-policy learning and the exploration–exploitation problem
17.5.1 Terminology
17.5.2 Learning with Lookup Tables
17.5.3 Learning with Generalized Belief Models
17.6 Applications
17.6.1 Pricing an American Option
17.6.2 Playing “Lose Tic-Tac-Toe”
17.6.3 Approximate Dynamic Programming for Deterministic Problems
17.7 Approximate Policy Iteration
17.7.1 Finite Horizon Problems Using Lookup Tables
17.7.2 Finite Horizon Problems Using Linear Models
17.7.3 LSTD for Infinite Horizon Problems Using Linear Models
17.8 The Actor–Critic Paradigm
17.9 Statistical Bias in the Max Operator
*
17.10 The Linear Programming Method Using Linear Models
*
17.11 Finite Horizon Approximations for Steady-State Applications
17.12 Bibliographic Notes
Exercises
Bibliography
18 Forward ADP III: Convex Resource Allocation Problems
18.1 Resource Allocation Problems
18.1.1 The Newsvendor Problem
18.1.2 Two-Stage Resource Allocation Problems
18.1.3 A General Multiperiod Resource Allocation Model
*
18.2 Values Versus Marginal Values
18.3 Piecewise Linear Approximations for Scalar Funtions
18.3.1 The Leveling Algorithm
18.3.2 The CAVE Algorithm
18.4 Regression Methods
18.5 Separable Piecewise Linear Approximations
18.6 Benders Decomposition for Nonseparable Approximations
**
18.6.1 Benders’ Decomposition for Two-Stage Problems
18.6.2 Asymptotic Analysis of Benders with Regularization
**
18.6.3 Benders with Regularization
18.7 Linear Approximations for High-Dimensional Applications
18.8 Resource Allocation with Exogenous Information State
18.9 Closing Notes
18.10 Bibliographic Notes
Exercises
Bibliography
19 Direct Lookahead Policies
19.1 Optimal Policies Using Lookahead Models
19.2 Creating an Approximate Lookahead Model
19.2.1 Modeling the Lookahead Model
19.2.2 Strategies for Approximating the Lookahead Model
19.3 Modified Objectives in Lookahead Models
19.3.1 Managing Risk
19.3.2 Utility Functions for Multiobjective Problems
19.3.3 Model Discounting
19.4 Evaluating DLA Policies
19.4.1 Evaluating Policies in a Simulator
19.4.2 Evaluating Risk-Adjusted Policies
19.4.3 Evaluating Policies in the Field
19.4.4 Tuning Direct Lookahead Policies
19.5 Why Use a DLA?
19.6 Deterministic Lookaheads
19.6.1 A Deterministic Lookahead: Shortest Path Problems
19.6.2 Parameterized Lookaheads
19.7 A Tour of Stochastic Lookahead Policies
19.7.1 Lookahead PFAs
19.7.2 Lookahead CFAs
19.7.3 Lookahead VFAs for the Lookahead Model
19.7.4 Lookahead DLAs for the Lookahead Model
19.7.5 Discussion
19.8 Monte Carlo Tree Search for Discrete Decisions
19.8.1 Basic Idea
19.8.2 The Steps of MCTS
19.8.3 Discussion
19.8.4 Optimistic Monte Carlo Tree Search
19.9 Two-Stage Stochastic Programming for Vector Decisions
*
19.9.1 The Basic Two-Stage Stochastic Program
19.9.2 Two-Stage Approximation of a Sequential Problem
19.9.3 Discussion
19.10 Observations on DLA Policies
19.11 Bibliographic Notes
Exercises
Bibliography
Part VI – Multiagent Systems
20 Multiagent Modeling and Learning
20.1 Overview of Multiagent Systems
20.1.1 Dimensions of a Multiagent System
20.1.2 Communication
20.1.3 Modeling a Multiagent System
20.1.4 Controlling Architectures
20.2 A Learning Problem – Flu Mitigation
20.2.1 Model 1: A Static Model
20.2.2 Variations of Our Flu Model
20.2.3 Two-Agent Learning Models
20.2.4 Transition Functions for Two-Agent Model
20.2.5 Designing Policies for the Flu Problem
20.3 The POMDP Perspective
*
20.4 The Two-Agent Newsvendor Problem
20.5 Multiple Independent Agents – An HVAC Controller Model
20.5.1 Model
20.5.2 Designing Policies
20.6 Cooperative Agents – A Spatially Distributed Blood Management Problem
20.7 Closing Notes
20.8 Why Does itWork?
20.8.1 Derivation of the POMDP Belief Transition Function
20.9 Bibliographic Notes
Exercises
Bibliography
Index
End User License Agreement
Chapter 1
Figure 1.1 Illustration of shipments...
Figure 1.2 A sampling of...
Figure 1.3 Deterministic vs. stochastic...
Chapter 2
Figure 2.1 Decision tree illustrating...
Figure 2.2 Canonical model for...
Figure 2.3 Finding a path...
Figure 2.4 A basic shortest...
Figure 2.5 Illustration of a...
Chapter 3
Figure 3.1 Illustration of the...
Figure 3.2 Illustration of a...
Figure 3.3 Illustration of a...
Figure 3.4 Average weight (across...
Figure 3.5 Neural networks with...
Figure 3.6 A three-layer...
Figure 3.7 Illustrative logistics function...
Figure 3.8 A neural network...
Figure 3.9 Illustration of Gaussian...
Figure 3.10 Illustration of a...
Figure 3.11 Illustration of penalty...
Figure 3.12 Fitting noisy data...
Figure 3.13 Illustration of the...
Chapter 4
Figure 4.1 Concave function, showing...
Chapter 5
Figure 5.1 Illustration of gradient...
Figure 5.2 Different estimates of...
Figure 5.3 Convergence of SPSA...
Figure 5.4 A three-layer...
Figure 5.5 (a) Illustration of...
Figure 5.6 A stochastic gradient...
Figure 5.7 A four-layer...
Chapter 6
Figure 6.1 Illustration of poor...
Figure 6.2 Illustration of the...
Figure 6.3 Stepsizes for a...
Figure 6.4 The McClain stepsize...
Figure 6.5 The search-then...
Figure 6.6 Different parameters can...
Figure 6.7 The bias-adjusted...
Figure 6.8 The BAKF stepsize...
Figure 6.9 The objective function...
Figure 6.10 Performance of stochastic...
Chapter 7
Figure 7.1 A set of...
Figure 7.2 (a) Optimizing over...
Figure 7.3 A family of...
Figure 7.4 Value of making...
Figure 7.5 The KG(*) policy...
Figure 7.6 The decision tree...
Figure 7.7 (a) Expected value...
Figure 7.8 The knowledge gradient...
Figure 7.9 Sampled set of...
Figure 7.10 Regions of z...
Figure 7.11 (a) Sampling pattern...
Figure 7.12 Comparison of correlated...
Figure 7.13 The knowledge gradient...
Figure 7.14 Policy search algorithm...
Figure 7.15 Optimal computing budget...
Figure 7.16 Use to plot...
Chapter 8
Figure 8.1 Illustration of a...
Figure 8.2 Illustration of the...
Figure 8.3 (a) The assignment...
Chapter 9
Figure 9.1 The path from...
Figure 9.2 Relationship between discrete...
Figure 9.3 Illustration of the...
Figure 9.4 Pre-decision state...
Figure 9.5 A deterministic network...
Figure 9.6 A stochastic network...
Figure 9.7 Deterministic shortest path...
Figure 9.8 A path problem...
Figure 9.9 A path problem...
Chapter 10
Figure 10.1 Generating uniformly and...
Figure 10.2 Electricity spot prices...
Figure 10.3 Illustration of a...
Figure 10.4 Actual vs. predicted...
Chapter 11
Figure 11.1 Decision tree showing...
Figure 11.2 (a) Decision to...
Figure 11.3 Illustration of the...
Chapter 12
Figure 12.1 Illustration of a...
Figure 12.2 Actively learning a...
Figure 12.3 Plot of the...
Figure 12.4 Sampled belief model...
Figure 12.5 The knowledge gradient...
Chapter 13
Figure 13.1 Illustration of a...
Figure 13.2 Illustration of rolling...
Figure 13.3 Allowable substitutions of...
Figure 13.4 Multiperiod model of...
Figure 13.5 Energy storage system...
Figure 13.6 (a) Energy load...
Figure 13.7 Objective vs...
Figure 13.8 2-d heatmaps...
Figure 13.9 Improvement in profits...
Chapter 14
Figure 14.1 Decision tree illustrating...
Figure 14.2 Collapsed version of...
Figure 14.3 A backward dynamic...
Figure 14.4 The value iteration...
Figure 14.5 The Gauss-Seidel...
Figure 14.6 Relative value iteration...
Figure 14.7 Policy iteration...
Figure 14.9 Cumulative contribution over...
Chapter 15
Figure 15.1 Illustration of transitions...
Figure 15.2 Calculation of the...
Figure 15.3 A backward dynamic...
Figure 15.4 Calculation of value...
Figure 15.5 A backward dynamic...
Figure 15.6 Illustration of the...
Figure 15.7 Video snapshot of...
Figure 15.8 Real-time energy...
Figure 15.9 Performance of the...
Chapter 16
Figure 16.1 Basic policy evaluation...
Figure 16.2 Five-state Markov...
Figure 16.3 A TD...
Figure 16.4 v¯n...
Chapter 17
Figure 17.1 Approximate dynamic programming...
Figure 17.2 A Q-learning...
Figure 17.3 Approximate value iteration...
Figure 17.4 Double-pass version...
Figure 17.5 Approximate value iteration...
Figure 17.6 Two-state dynamic...
Figure 17.7 The effect of...
Figure 17.8 Some tic-tac...
Figure 17.9 A policy iteration...
Figure 17.10 A policy iteration...
Figure 17.11 Approximate policy iteration...
Figure 17.12 Approximate policy iteration...
Figure 17.13 EmaxxUx−maxxEUx...
Figure 17.14 Queueing network with...
Figure 17.15 Exact value function...
Chapter 18
Figure 18.1 (a) The shape...
Figure 18.2 A two-stage...
Figure 18.3 Steps of the...
Figure 18.4 Steps in estimating...
Figure 18.5 Illustration of Benders...
Figure 18.6 Illustration of Benders...
Figure 18.7 Comparison of Benders...
Figure 18.8 Illustration of cuts...
Chapter 19
Figure 19.1 Relationship between the...
Figure 19.2 Illustration of (a...
Figure 19.3 Simulation of a...
Figure 19.4 The performance elbow...
Figure 19.5 Growth in CPU...
Figure 19.6 Illustration of the...
Figure 19.7 Illustration of Monte...
Figure 19.8 Sampled MCTS algorithm...
Figure 19.9 The tree policy...
Figure 19.10 This function simulates...
Figure 19.11 Backup process which...
Figure 19.12 Sample of a...
Figure 19.13 Performance of (a...
Chapter 20
Figure 20.1 The canonical model...
Figure 20.2 The canonical model...
Figure 20.3 Information flow diagram...
Figure 20.4 (a) Blood network...
Figure 20.5 Blood management for...
Chapter 1
Table 1.1. A list of application...
Table 1.2. Fields that deal with...
Table 1.3. Comparison of classical problems...
Chapter 3
Table 3.1. Examples of aggregations of...
Chapter 5
Table 5.1. Illustration of six sample...
Chapter 7
Table 7.1. A sample of the...
Table 7.2. Gittins indices...
Table 7.3. History of hitting performance...
Table 7.4. The calculations behind the...
Table 7.5. Correlation matrix describing the...
Table 7.6. Optimal tuned parameters for...
Table 7.7. Priors for exercise 7.19...
Table 7.8. Three observations, for three...
Table 7.12. Priors
Table 7.13. Prior beliefs for learning exercise...
Chapter 8
Table 8.1. Allowable blood substitutions for...
Table 8.2. Contributions for different types...
Chapter 9
Table 9.1. Sample of words in...
Table 9.2. A set of sample...
Table 9.3. Comparison of formulations for...
Table 9.4. Set of demand outcomes...
Chapter 10
Table 10.1. Illustration of different types...
Table 10.2. Illustration of a set...
Chapter 11
Table 11.1. Words from the English...
Table 11.2. where the bold entries (in the diagonal) indicates the...
Table 11.3. Illustration of complicating state...
Table 11.4. Suggested starting strategies for...
Chapter 15
Table 15.1. Comparison of revenues generated...
Chapter 16
Table 16.1. Effect of stepsize on...
Chapter 17
Table 17.1. Illustration of a single...
Table 17.2. Illustration of a double...
Table 17.3. Ten sample realizations of...
Table 17.4. The payout at time...
Table 17.5. The data table for...
Table 17.6. The payout if we...
Chapter 18
Table 18.1. Sample list of resource...
Chapter 20
Table 20.1. Environmental state variables and...
Table 20.2. Table of notation for...
Cover
Title page
Copyright
Table of Contents
Preface
Acknowledgments
Begin Reading
Index
End User License Agreement
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
xxx
xxxi
xxxii
xxxiii
xxxiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
Preface toReinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions
This books represents a lifetime of research into what I now call sequential decision problems, which dates to 1982 when I was introduced to the problem arising in truckload trucking (think of Uber/Lyft for trucks) where we have to choose which driver to assign to a load, and which loads to accept to move, given the high level of randomness in future customer demands, representing requests to move full truckloads of freight.
It took me 20 years to figure out a practical algorithm to solve this problem, which led to my first book (in 2007) on approximate dynamic programming, where the major breakthrough was the introduction of the post-decision state and the use of hierarchical aggregation for approximating value functions to solve these high-dimensional problems. However, I would argue today that the most important chapter in the book (and I recognized it at the time), was chapter 5 on how to model these problems, without any reference to algorithms to solve the problem. I identified five elements to a sequential decision problem, leading up to the objective function which was written
It was not until the second edition (in 2011) that I realized that approximate dynamic programming (specifically, policies that depend on value functions) was not the only way to solve these problems; rather, there were four classes of policies, and only one used value functions. The 2011 edition of the book listed three of the four classes of policies that are described in this book, but most of the book still focused on approximating value functions. It was not until a 2014 paper (“Clearing the Jungle of Stochastic Optimization”) that I identified the four classes of policies I use now. Then, in 2016 I realized that the four classes of policies could be divided between two major strategies: the policy search strategy, where we search over a family of functions to find the one that works best, and the lookahead strategy, where we make good decisions by approximating the downstream impact of a decision made now.
Finally, I combined these ideas in a 2019 paper (“A Unified Framework for Stochastic Optimization” published in the European Journal for Operational Research) with a better appreciation of major classes of problems such as state-independent problems (the pure learning problems that include derivative-based and derivative-free stochastic search) and the more general state-dependent problems; cumulative and final reward objective functions; and the realization that any adaptive search algorithm was a sequential decision problem. The material in the 2019 paper is effectively the outline for this book.
This book builds on the 2011 edition of my approximate dynamic programming book, and includes a number of chapters (some heavily edited) from the ADP book. It would be nice to call this a third edition, but the entire framework of this book is completely different. “Approximate dynamic programming” is a term that still refers to making decisions based on the idea of approximating the downstream value of being in a state. After decades of working with this approach (which is still covered over a span of five chapters in this volume), I can now say with confidence that value function approximations, despite all the attention they have received, is a powerful methodology for a surprisingly narrow set of decision problems.
By contrast, I finally developed the confidence to claim that the four classes of policies are universal. This means that any method for making decisions will fall in one of these four classes, or a hybrid of two or more. This is a game changer, because it shifts the focus from an algorithm (the method for making decisions) to the model (specifically the optimization problem above, along with the state-transition function and the model of the exogenous information process). This means we write out the elements of a problem before we tackle the problem of designing policies to decisions. I call this:
Model first, then solve.
The communities working on sequential decision problems are very focused on methods, just as I was with my earlier work with approximate dynamic programming. The problem is that any particular method will be inherently limited to a narrow class of problems. In this book, I demonstrate how you can take a simple inventory problem, and then tweak the data to make each of the four classes work best.
This new approach has opened up an entirely new way of approaching a problem class that, in the last year of writing the book, I started calling “sequential decision analytics,” which is any problem consisting of the sequence:
Decision, information, decision, information, …
I allow decisions to range from binary (selling an asset) to discrete choices (favored in computer science) to the high-dimensional resource allocation problems popular in operations research. This approach starts with a problem, shifts to the challenging task of modeling uncertainty, and then finishes with designing policies to make decisions to optimize some metric. The approach is practical, scalable, and universally applicable.
It is exciting to be able to create a single framework that spans 15 different communities, and which represents every possible method for solving sequential decision problems. While having a common language to model any sequential decision problem, combined with the general approach of the four classes of policies, is clearly of value, this framework has been developed by standing on the shoulders of the giants who have laid the foundational work for all of these methods. I have had to make choices regarding the best notation and modeling conventions, but my framework is completely inclusive of all the methods that have been developed to solve these problems. Rather than joining the chorus of researchers promoting specific algorithmic strategies (as I once did), my goal is to raise the visibility of all methods, so that someone looking to solve a real problem is working with the biggest possible toolbox, rather than just the tools developed within a specific community.
A word needs to be said about the title of the book. As this is being written, there is a massive surge of interest in “reinforcement learning,” which started as a form of approximate dynamic programming (I used to refer to ADP and RL as similar to American English and British English). However, as the RL community has grown and started working on harder problems, they encountered the same experience that I and everyone else working in ADP found: value function approximations are not a panacea. Not only is it the case that they often do not work, they usually do not work. As a result, the RL community branched out (just as I did) into other methods such as “policy gradient methods” (my “policy function approximations” or PFA), upper confidence bounding (a form of “cost function approximation” or CFA), the original Q-learning (which produces a policy based on “value function approximations” or VFA), and finally Monte Carlo tree search (a policy based on “direct lookahead approximations” or DLA). All of these methods are found in the second edition of Sutton and Barto’s landmark book Reinforcement Learning: An introduction, but only as specific methods rather than general classes. This book takes the next step and identifies the general classes.
This evolution from one core method to all four classes of policies is being repeated among other fields that I came to call the “jungle of stochastic optimization.” Stochastic search, simulation-optimization, and bandit problems all feature methods from each of the four classes of policies. Over time, I came to realize that all these fields (including reinforcement learning) were playing catchup to the grandfather of all of this work, which is optimal control (and stochastic control). The field of optimal control was the first to introduce and seriously explore the use of value function approximations (they call these cost-to-go functions), linear decision rules (a form of PFA), and the workhorse “model predictive control” (a great name for a simple rolling horizon procedure, which is a “direct lookahead approximation” in this book). I also found that my modeling framework was closest to that used in the optimal control literature, which was the first field to introduce the concept of a transition function, a powerful modeling device that has been largely overlooked by the other communities. I make a few small tweaks such as using state St instead of xt, and decision xt (widely used in the field of math programming) instead of ut.
Then I introduce one big change, which is to maximize over all four classes of policies. Perhaps the most important innovation of this book is to break the almost automatic link between optimizing over policies, and then assuming that we are going to compute an optimal policy from either Bellman’s equation or the Hamilton-Jacobi equations. These are rarely computable for real problems, which then leads people to assume that the natural next step is to approximate these equations. This is simply false, supported by decades of research where people have developed methods that do not depend on HJB equations. I recognize this body of research developing different classes of policies by making the inclusion of all four classes of policies fundamental to the original statement of the optimization problem above.
It will take some time for people from the different communities to learn to speak this common language. More likely, there will be an adaptation of existing modeling languages to this framework. For example, the optimal control community could keep their notation, but learn to write their objective functions as I have above, recognizing that the search over policies needs to span all four classes (which, I might point out, they are already using). I would hope that the reinforcement learning community, which adopted the notation for discrete action a, might learn to use the more general x (as the bandit community has already done).
I have tried to write this book to appeal to newcomers to the field, as well as people who already have training in one or more of the subfields that deal with decisions and uncertainty; recognizing these two broad communities was easily the biggest challenge while writing this book. Not surprisingly, the book is quite long. I have tried to make it more accessible to people who are new to the field by marking many sections with an * as an indication that this section can be skipped on a first-read. I also hope that the book will appeal to people from many application domains. However, the core audience is people who are looking to solve real problems by modeling applications and implementing the work in software. The notation is designed to facilitate writing computer programs, where there should be a direct relationship between the mathematical model and the software. This is particularly important when modeling the flow of information, something that is often overlooked in mainstream reinforcement learning papers.
Warren B. Powell
Princeton, New JerseyAugust, 2021
The foundation of this book is a modeling framework for sequential decision problems that involves searching over four classes of policies for making decisions. The recognition that we needed all four classes of policies came from working on a wide range of problems spanning freight transportation (almost all modes), energy, health, e-commerce, finance, and even materials science (!!).
This research required a lot of computational work, which was only possible through the efforts of the many students and staff that worked in CASTLE Lab. Over my 39 years of teaching at Princeton, I benefited tremendously from the interactions with 70 graduate students and post-doctoral associates, along with nine professional staff. I am deeply indebted to the contributions of this exceptionally talented group of men and women who allowed me to participate in the challenges of getting computational methods to work on such a wide range of problems. It was precisely this diversity of problem settings that led me to appreciate the motivation for the different methods for solving problems. In the process, I met people from across the jungle, and learned to speak their language not just by reading papers, but by talking to them and, often, working on their problems.
