130,99 €
Praise for the First Edition
"Finally, a book devoted to dynamic programming and written using the language of operations research (OR)! This beautiful book fills a gap in the libraries of OR specialists and practitioners."
—Computing Reviews
This new edition showcases a focus on modeling and computation for complex classes of approximate dynamic programming problems
Understanding approximate dynamic programming (ADP) is vital in order to develop practical and high-quality solutions to complex industrial problems, particularly when those problems involve making decisions in the presence of uncertainty. Approximate Dynamic Programming, Second Edition uniquely integrates four distinct disciplines—Markov decision processes, mathematical programming, simulation, and statistics—to demonstrate how to successfully approach, model, and solve a wide range of real-life problems using ADP.
The book continues to bridge the gap between computer science, simulation, and operations research and now adopts the notation and vocabulary of reinforcement learning as well as stochastic search and simulation optimization. The author outlines the essential algorithms that serve as a starting point in the design of practical solutions for real problems. The three curses of dimensionality that impact complex problems are introduced and detailed coverage of implementation challenges is provided. The Second Edition also features:
A new chapter describing four fundamental classes of policies for working with diverse stochastic optimization problems: myopic policies, look-ahead policies, policy function approximations, and policies based on value function approximations
A new chapter on policy search that brings together stochastic search and simulation optimization concepts and introduces a new class of optimal learning strategies
Updated coverage of the exploration exploitation problem in ADP, now including a recently developed method for doing active learning in the presence of a physical state, using the concept of the knowledge gradient
A new sequence of chapters describing statistical methods for approximating value functions, estimating the value of a fixed policy, and value function approximation while searching for optimal policies
The presented coverage of ADP emphasizes models and algorithms, focusing on related applications and computation while also discussing the theoretical side of the topic that explores proofs of convergence and rate of convergence. A related website features an ongoing discussion of the evolving fields of approximation dynamic programming and reinforcement learning, along with additional readings, software, and datasets.
Requiring only a basic understanding of statistics and probability, Approximate Dynamic Programming, Second Edition is an excellent book for industrial engineering and operations research courses at the upper-undergraduate and graduate levels. It also serves as a valuable reference for researchers and professionals who utilize dynamic programming, stochastic programming, and control theory to solve problems in their everyday work.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1104
Veröffentlichungsjahr: 2011
Table of Contents
Series Page
Title Page
Copyright
Preface to the Second Edition
Preface to the First Edition
Acknowledgments
Chapter 1: The Challenges of Dynamic Programming
1.1 A Dynamic Programming Example: A Shortest Path Problem
1.2 The Three Curses of Dimensionality
1.3 Some Real Applications
1.4 Problem Classes
1.5 The Many Dialects of Dynamic Programming
1.6 What is New in this Book?
1.7 Pedagogy
1.8 Bibliographic Notes
Chapter 2: Some Illustrative Models
2.1 Deterministic Problems
2.2 Stochastic Problems
2.3 Information Acquisition Problems
2.4 A Simple Modeling Framework for Dynamic Programs
2.5 Bibliographic notes
Chapter 3: Introduction to Markov Decision Processes
3.1 The Optimality Equations
3.2 Finite Horizon Problems
3.3 Infinite Horizon Problems
3.4 Value Iteration
3.5 Policy Iteration
3.6 Hybrid Value-policy Iteration
3.7 Average Reward Dynamic Programming
3.8 The Linear Programming Method for Dynamic Programs
3.9 Monotone Policies*
3.10 Why does it Work?**
3.11 Bibliographic Notes
Chapter 4: Introduction to Approximate Dynamic Programming
4.1 The Three Curses of Dimensionality (Revisited)
4.2 The Basic Idea
4.3 Q-learning and SARSA
4.4 Real-time Dynamic Programming
4.5 Approximate Value Iteration
4.6 The Post-decision State Variable
4.7 Low-dimensional Representations of Value Functions
4.8 So just what is Approximate Dynamic Programming?
4.9 Experimental Issues
4.10 But does it Work?
4.11 Bibliographic Notes
Chapter 5: Modeling Dynamic Programs
5.1 Notational Style
5.2 Modeling Time
5.3 Modeling Resources
5.4 The States of our System
5.5 Modeling decisions
5.6 The exogenous information process
5.7 The Transition Function
5.8 The Objective Function
5.9 A Measure-theoretic View of Information**
5.10 Bibliographic Notes
Chapter 6: Policies
6.1 Myopic Policies
6.2 Lookahead Policies
6.3 Policy Function Approximations
6.4 Value Function Approximations
6.5 Hybrid Strategies
6.6 Randomized Policies
6.7 How to Choose a Policy?
6.8 Bibliographic Notes
Chapter 7: Policy Search
7.1 Background
7.2 Gradient Search
7.3 Direct Policy Search for Finite Alternatives
7.4 The Knowledge Gradient Algorithm for Discrete Alternatives
7.5 Simulation Optimization
7.6 Why does it Work?**
7.7 Bibliographic Notes
Chapter 8: Approximating Value Functions
8.1 Lookup Tables and Aggregation
8.2 Parametric Models
8.3 Regression Variations
8.4 Nonparametric Models
8.5 Approximations and the Curse of Dimensionality
8.6 Why does it Work?**
8.7 Bibliographic Notes
Chapter 9: Learning Value Function Approximations
9.1 Sampling the Value of a Policy
9.2 Stochastic Approximation Methods
9.3 Recursive Least Squares for Linear Models
9.4 Temporal Difference Learning with a Linear Model
9.5 Bellman's Equation Using a Linear Model
9.6 Analysis of TD(0), LSTD, and LSPE using a Single State
9.7 Gradient-based Methods for Approximate Value Iteration*
9.8 Least Squares Temporal Differencing with Kernel Regression*
9.9 Value Function Approximations Based on Bayesian Learning*
9.10 Why does it Work*
9.11 Bibliographic Notes
Chapter 10: Optimizing While Learning
10.1 Overview of Algorithmic Strategies
10.2 Approximate Value Iteration and Q-learning Using Lookup Tables
10.3 Statistical Bias in the Max Operator
10.4 Approximate Value Iteration and Q-learning Using Linear Models
10.5 Approximate Policy Iteration
10.6 The Actor–critic Paradigm
10.7 Policy Gradient Methods
10.8 The Linear Programming Method Using Basis Functions
10.9 Approximate Policy Iteration Using Kernel Regression*
10.10 Finite Horizon Approximations for Steady-state Applications
10.11 Bibliographic Notes
Chapter 11: Adaptive Estimation and Stepsizes
11.1 Learning Algorithms and Stepsizes
11.2 Deterministic Stepsize Recipes
11.3 Stochastic Stepsizes
11.4 Optimal Stepsizes for Nonstationary Time Series
11.5 Optimal Stepsizes for Approximate Value Iteration
11.6 Convergence
11.7 Guidelines for Choosing Stepsize Formulas
11.8 Bibliographic notes
Chapter 12: Exploration Versus Exploitation
12.1 A Learning Exercise: the Nomadic Trucker
12.2 An Introduction to Learning
12.3 Heuristic Learning Policies
12.4 Gittins Indices for Online Learning
12.5 The Knowledge Gradient Policy
12.6 Learning with a Physical State
12.7 Bibliographic Notes
Chapter 13: Value Function Approximations for Resource Allocation Problems
13.1 Value Functions versus Gradients
13.2 Linear Approximations
13.3 Piecewise-linear Approximations
13.4 Solving a Resource Allocation Problem using Piecewise-linear Functions
13.5 The Shape Algorithm
13.6 Regression Methods
13.7 Cutting Planes*
13.8 Why does it Work?**
13.9 Bibliographic Notes
Chapter 14: Dynamic Resource Allocation Problems
14.1 An Asset Acquisition Problem
14.2 The Blood Management Problem
14.3 A Portfolio Optimization Problem
14.4 A General Resource Allocation Problem
14.5 A Fleet Management Problem
14.6 A Driver Management Problem
14.7 Bibliographic NOTES
Chapter 15: Implementation Challenges
15.1 Will ADP Work for Your Problem?
15.2 Designing an ADP Algorithm for Complex Problems
15.3 Debugging an ADP Algorithm
15.4 Practical Issues
15.5 Modeling Your Problem
15.6 Online versus Offline Models
15.7 If it Works, Patent It!
Bibliography
Index
Wiley Series
Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Powell, Warren B., 1955–
Approximate dynamic programming : solving the curses of dimensionality / Warren B. Powell.
–2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-60445-8 (cloth)
1. Dynamic programming. I. Title.
T57.83.P76 2011
519.7'03–dc22
2010047227
Preface to the Second Edition
The writing for the first edition of this book ended around 2005, followed by a year of editing before it was submitted to the publisher in 2006. As with everyone who works in this very rich field, my understanding of the models and algorithms was strongly shaped by the projects I had worked on. While I was very proud of the large industrial applications that were the basis of my success, at the time I had a very limited understanding of many other important problem classes that help to shape the algorithms that have evolved (and continue to evolve) in this field.
In the five years that passed before this second edition went to the publisher, my understanding of the field and my breadth of applications have grown dramatically. Reflecting my own personal growth, I realized that the book needed a fundamental restructuring along several dimensions. I came to appreciate that approximate dynamic programming is much more than approximating value functions. After writing an article that included a list of nine types of policies, I realized that every policy I had encountered could be broken down into four fundamental classes: myopic policies, lookahead policies, policy function approximations, and policies based on value function approximations. Many other policies can be created by combining these four fundamental classes into different types of hybrids.
I also realized that methods for approximating functions (whether they be policy function approximations or value function approximations) could be usefully organized into three fundamental strategies: lookup tables, parametric models, and nonparametric models. Of course, these can also be combined in different forms.
In preparing the second edition, I came to realize that the nature of the decision variable plays a critical role in the design of an algorithm. In the first edition, one of my goals was to create a bridge between dynamic programming (which tended to focus on small action spaces) and math programming, with its appreciation of vector-valued decisions. As a result I had adopted x as my generic decision variable. In preparing the new edition, I had come to realize that small action spaces cover a very important class of problems, and these are also the problems that a beginner is most likely to start with to learn the field. Also action “a” pervades the reinforcement learning community (along with portions of the operations research community), to the point that it is truly part of the language. As a result the second edition now uses action “a” for most of its presentation, but reverts to x specifically for problems where the decisions are continuous and/or (more frequently) vectors. The challenges of vector-valued decisions has been largely overlooked in the reinforcement learning community, while the operations research community that works on these problems has largely ignored the power of dynamic programming.
The second edition now includes a new chapter (Chapter 6) devoted purely to a discussion of different types of policies, a summary of some hybrid strategies, and a discussion of problems that are well suited to each of the different strategies. This is followed by a chapter (Chapter 7) that focuses purely on the issue of policy search. This chapter brings together fields such as stochastic search and simulation optimization. The chapter also introduces a new class of optimal learning strategies based on the concept of the knowledge gradient, an idea that was developed originally to address the exploration–exploitation problem before realizing that it had many other applications.
I also acquired a much better understanding of the different methods for approximating value functions. I found that the best way to communicate the rich set of strategies that have evolved was to divide the material into three chapters. The first of these (Chapter 8) focuses purely on different statistical procedures for approximating value functions. While this can be viewed partly as a tutorial into statistics and machine learning, the focus is on strategies that have been used in the approximate dynamic programming/reinforcement learning literature. ADP imposes special demands on statistical learning algorithms, including the importance of recursive estimation, and the need to start with a small number of observations (which works better with a low-dimensional model) and transition to a larger number of observations with models that are high-dimensional in certain regions. Next, Chapter 9 summarizes different methods for estimating the value of being in a state using sample information, with the goal of estimating the value function for a fixed policy. Since I have found that a number of papers focus on a single policy without making this apparent, this chapter makes this very explicit by indexing variables that depend on a policy with a superscript π. Finally, Chapter 10 addresses the very difficult problem of estimating the value of being in a state while simultaneously optimizing over policies.
Chapter 11 of this book is a refined version of the old Chapter 6, which focused on stepsize rules. Chapter 11 is streamlined, with a new discussion of the implications of algorithms based on policy iteration (including least squares policy evaluation (LSPE), least squares temporal differences) and algorithms based on approximate value iteration and Q-learning. Following some recent research, I use the setting of a single state to develop a much clearer understanding of the demands on a stepsize that are placed by these different algorithmic strategy. A new section has been added introducing a stepsize rule that is specifically optimized for approximate value iteration.
Chapter 12, on the famous exploration–exploitation problem in approximate dynamic programming, has been heavily revised to reflect a much more thorough understanding of the general field that is coming to be known as optimal learning. This chapter includes a recently developed method for doing active learning in the presence of a physical state, by way of the concept of the knowledge gradient. While this method looks promising, the general area of doing active learning in the context of dynamic programs (with a physical state) is an active area of research at the time of this writing.
A major theme of the first edition was to bridge the gap between disciplines, primarily reinforcement learning (computer science), simulation, and math programming (operations research). This edition reinforces this theme first by adopting more broadly the notation and vocabulary of reinforcement learning (which has made most of the contributions to this field) while retaining the bridge to math programming, but now also including stochastic search and simulation optimization (primarily in the context of policy search).
The mathematical level of the book continues to require only an understanding of statistics and probability. A goal of the first edition was that the material would be accessible to an advanced undergraduate audience. With this second edition a more accurate description would be that the material is accessible to a highly motivated and well prepared undergraduate, but the breadth of the material is more suitable to a graduate audience.
Warren B. Powell
Princeton, New Jersey
October 2010
Preface to the First Edition
The path to completing this book began in the early 1980s when I first started working on dynamic models arising in the management of fleets of vehicles for the truckload motor carrier industry. It is often said that necessity is the mother of invention, and as with many of my colleagues in this field, the methods that emerged evolved out of a need to solve a problem. The initially ad hoc models and algorithms I developed to solve these complex industrial problems evolved into a sophisticated set of tools supported by an elegant theory within a field that is increasingly being referred to as approximate dynamic programming.
The methods in this book reflect the original motivating applications. I started with elegant models for which academie is so famous, but my work with industry revealed the need to handle a number of complicating factors that were beyond the scope of these models. One of these was a desire from one company to understand the effect of uncertainty on operations, requiring the ability to solve these large-scale optimization problems in the presence of various forms of randomness (but most notably customer demands). This question launched what became a multiple-decade search for a modeling and algorithmic strategy that would provide practical, but high-quality, solutions.
This process of discovery took me through multiple fields, including linear and nonlinear programming, Markov decision processes, optimal control, and stochastic programming. It is somewhat ironic that the framework of Markov decision processes, which originally appeared to be limited to toy problems (three trucks moving between five cities), turned out to provide the critical theoretical framework for solving truly industrial-strength problems (thousands of drivers moving between hundreds of locations, each described by complex vectors of attributes).
The ability to solve these problems required the integration of four major disciplines: dynamic programming (Markov decision processes), math programming (linear, nonlinear and integer programming), simulation, and statistics. My desire to bring together the fields of dynamic programming and math programming motivated some fundamental notational choices (in particular, the use of x as a decision variable). In this book there is as a result a heavy dependence on the Monte Carlo methods so widely used in simulation, but a knowledgeable reader will quickly see how much is missing. The book covers in some depth a number of important techniques from statistics, but even this presentation only scratches the surface of tools and concepts available from with fields such as nonparametric statistics, signal processing and approximation theory.
Audience
The book is aimed primarily at an advanced undergraduate/masters audience with no prior background in dynamic programming. The presentation does expect a first course in probability and statistics. Some topics require an introductory course in linear programming. A major goal of the book is the clear and precise presentation of dynamic problems, which means there is an emphasis on modeling and notation.
The body of every chapter focuses on models and algorithms with a minimum of the mathematical formalism that so often makes presentations of dynamic programs inaccessible to a broader audience. Using numerous examples, each chapter emphasizes the presentation of algorithms that can be directly applied to a variety of applications. The book contains dozens of algorithms that are intended to serve as a starting point in the design of practical solutions for real problems. Material for more advanced graduate students (with measure-theoretic training and an interest in theory) is contained in sections marked with **.
The book can be used quite effectively in a graduate level course. Several chapters include “Why does it work” sections at the end that present proofs at an advanced level (these are all marked with **). This material can be easily integrated into the teaching of the material within the chapter.
Approximate dynamic programming is also a field that has emerged from several disciplines. I have tried to expose the reader to the many dialects of ADP, reflecting its origins in artificial intelligence, control theory, and operations research. In addition to the diversity of words and phrases that mean the same thing—but often with different connotations—I have had to make difficult notational choices.
I have found that different communities offer unique insights into different dimensions of the problem. In the main, the control theory community has the most thorough understanding of the meaning of a state variable. The artificial intelligence community has the most experience with deeply nested problems (which require numerous steps before earning a reward). The operations research community has evolved a set of tools that are well suited for high-dimensional resource allocation, contributing both math programming and a culture of careful modeling.
W. B. P.
Acknowledgments
The work in this book reflects the contributions of many. Perhaps most important are the problems that motivated the development of this material. This work would not have been possible without the corporate sponsors who posed these problems in the first place. I would like to give special recognition to Schneider National, the largest truckload carrier in the United States, Yellow Freight System, the largest less-than-truckload carrier, and Norfolk Southern Railroad, one of the four major railroads that serves the United States. These three companies not only posed difficult problems, they provided years of research funding that allowed me to work on the development of tools that became the foundation of this book. This work would never have progressed without the thousands of hours of my two senior professional staff members, Hugo Simão and Belgacem Bouzaiëne-Ayari, who have written hundreds of thousands of lines of code to solve industrial-strength problems. It is their efforts working with our corporate sponsors that brought out the richness of real applications, and therefore the capabilities that our tools needed to possess.
While industrial sponsors provided the problems, without the participation of my graduate students, I would simply have a set of ad hoc procedures. It is the work of my graduate students that provided most of the fundamental insights and algorithms, and virtually all of the convergence proofs. In the order in which they joined by research program, the students are Linos Frantzeskakis, Raymond Cheung, Tassio Carvalho, Zhi-Long Chen, Greg Godfrey, Joel Shapiro, Mike Spivey, Huseyin Topaloglu, Katerina Papadaki, Arun Marar, Tony Wu, Abraham George, Juliana Nascimento, Peter Frazier, and Ilya Ryzhov, all of whom are my current and former students and have contributed directly to the material presented in this book. My undergraduate senior thesis advisees provided many colorful applications of dynamic programming, and they contributed their experiences with their computational work.
The presentation has benefited from numerous conversations with professionals in this community. I am particularly grateful to Erhan Çinlar, who taught me the language of stochastic processes that played a fundamental role in guiding my notation in the modeling of information. I am also grateful for many conversations with Ben van Roy, Dimitri Bertsekas, Andy Barto, Mike Fu, Dan Adelman, Lei Zhao, and Diego Klabjan. I would also like to thank Paul Werbos at NSF for introducing me to the wonderful neural net community in IEEE, which contributed what for me was a fresh perspective on dynamic problems. Jennie Si, Don Wunsch, George Lendaris and Frank Lewis all helped educate me in the language and concepts of the control theory community.
For the second edition of the book, I would like to add special thanks to Peter Frazier and Ilya Ryzhov, who contributed the research on the knowledge gradient for optimal learning in ADP, and improvements in my presentation of Gittins indices. The research of Jun Ma on convergence theory for approximate policy iteration for continuous states and actions contributed to my understanding in a significant way. This edition also benefited from the contributions of Warren Scott, Lauren Hannah, and Emre Barut who have combined to improve my understanding of nonparametric statistics.
This research was first funded by the National Science Foundation, but the bulk of my research in this book was funded by the Air Force Office of Scientific Research, and I am particularly grateful to Dr. Neal Glassman for his support through the early years. The second edition has enjoyed continued support from AFOSR by Donald Hearn, and I appreciate Don's dedication to the AFOSR program.
Many people have assisted with the editing of this volume through numerous comments. Mary Fan, Tamas Papp, and Hugo Simão all read various drafts of the first edition cover to cover. I would like to express my appreciation to Boris Defourny for an exceptionally thorough proofreading of the second edition. Diego Klabjan and his dynamic programming classes at the University of Illinois provided numerous comments and corrections. Special thanks are due to the students in my own undergraduate and graduate dynamic programming classes who had to survive the very early versions of the text. The second edition of the book benefited from the many comments of my graduate students, and my ORF 569 graduate seminar on approximate dynamic programming. Based on their efforts, many hundreds of corrections have been made, though I am sure that new errors have been introduced. I appreciate the patience of the readers who understand that this is the price of putting in textbook form material that is evolving so quickly.
Of course, the preparation of this book required tremendous patience from my wife Shari and my children Elyse and Danny, who had to tolerate my ever-present laptop at home. Without their support, this project could never have been completed.
W.B.P.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
