Reinforcement Learning and Approximate Dynamic Programming for Feedback Control -  - E-Book

Reinforcement Learning and Approximate Dynamic Programming for Feedback Control E-Book

0,0
141,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single player decision and control and multi-player games. Edited by the pioneers of RL and ADP research, the book brings together ideas and methods from many fields and provides an important and timely guidance on controlling a wide variety of systems, such as robots, industrial processes, and economic decision-making.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1011

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Cover

Series Page

Title Page

Copyright

Preface

Contributors

Part I: Feedback Control Using RL And ADP

Chapter 1: Reinforcement Learning and Approximate Dynamic Programming (RLADP)—Foundations, Common Misconceptions, and the Challenges Ahead

1.1 Introduction

1.2 What is RLADP?

1.3 Some Basic Challenges in Implementing ADP

Disclaimer

References

Chapter 2: Stable Adaptive Neural Control of Partially Observable Dynamic Systems

2.1 Introduction

2.2 Background

2.3 Stability Bias

2.4 Example Application

References

Chapter 3: Optimal Control of Unknown Nonlinear Discrete-Time Systems Using the Iterative Globalized Dual Heuristic Programming Algorithm

3.1 Background Material

3.2 Neuro-Optimal Control Scheme Based on the Iterative ADP Algorithm

3.3 Generalization

3.4 Simulation Studies

3.5 Summary

References

Chapter 4: Learning and Optimization in Hierarchical Adaptive Critic Design

4.1 Introduction

4.2 Hierarchical ADP Architecture with Multiple-Goal Representation

4.3 Case Study: The Ball-and-Beam System

4.4 Conclusions and Future Work

Acknowledgments

References

Chapter 5: Single Network Adaptive Critics Networks—Development, Analysis, and Applications

5.1 Introduction

5.2 Approximate Dynamic Programing

5.3 SNAC

5.4 J-SNAC

5.5 Finite-SNAC

5.6 Conclusions

Acknowledgments

References

Chapter 6: Linearly Solvable Optimal Control

6.1 Introduction

6.2 Linearly Solvable Optimal Control Problems

6.3 Extension to Risk-Sensitive Control and Game Theory

6.4 Properties and Algorithms

6.5 Conclusions and Future Work

References

Chapter 7: Approximating Optimal Control with Value Gradient Learning

7.1 Introduction

7.2 Value Gradient Learning and BPTT Algorithms

7.3 A Convergence Proof for VGL(1) for Control with Function Approximation

7.4 Vertical Lander Experiment

7.5 Conclusions

References

Chapter 8: A Constrained Backpropagation Approach to Function Approximation and Approximate Dynamic Programming

8.1 Background

8.2 Constrained Backpropagation (CPROP) Approach

8.3 Solution of Partial Differential Equations in Nonstationary Environments

8.4 Preserving Prior Knowledge in Exploratory Adaptive Critic Designs

8.5 Summary

Algebraic ANN Control Matrices

References

Chapter 9: Toward Design of Nonlinear ADP Learning Controllers with Performance Assurance

9.1 Introduction

9.2 Direct Heuristic Dynamic Programming

9.3 A Control Theoretic View on the Direct HDP

9.4 Direct HDP Design with Improved Performance Case 1—Design Guided by a Priori LQR Information

9.5 Direct HDP Design with Improved Performance Case 2—Direct HDP for Coorindated Damping Control of Low-Frequency Oscillation

9.6 Summary

Acknowledgment

References

Chapter 10: Reinforcement Learning Control with Time-Dependent Agent Dynamics

10.1 Introduction

10.2 Q-Learning

10.3 Sampled Data Q-Learning

10.4 System Dynamics Approximation

10.5 Closing Remarks

References

Chapter 11: Online Optimal Control of Nonaffine Nonlinear Discrete-Time Systems without Using Value and Policy Iterations

11.1 Introduction

11.2 Background

11.3 Reinforcement Learning Based Control

11.4 Time-Based Adaptive Dynamic Programming-Based Optimal Control

11.5 Simulation Result

References

Chapter 12: An Actor–Critic–Identifier Architecture for Adaptive Approximate Optimal Control

12.1 Introduction

12.2 Actor–Critic–Identifier Architecture for HJB Approximation

12.3 Actor–Critic Design

12.4 Identifier Design

12.5 Convergence and Stability Analysis

12.6 Simulation

12.7 Conclusion

References

Chapter 13: Robust Adaptive Dynamic Programming

13.1 Introduction

13.2 Optimality Versus Robustness

13.3 Robust-ADP Design for Disturbance Attenuation

13.4 Robust-ADP for Partial-State Feedback Control

13.5 Applications

13.6 Summary

Acknowledgment

References

Part II: Learning and Control in Multiagent Games

Chapter 14: Hybrid Learning in Stochastic Games and Its Application in Network Security

14.1 Introduction

14.2 Two-Person Game

14.3 Learning in NZSGs

14.4 Main Results

14.5 Security Application

14.6 Conclusions and future works

Appendix: Assumptions for Stochastic Approximation

References

Chapter 15: Integral Reinforcement Learning for Online Computation of Nash Strategies of Nonzero-Sum Differential Games

15.1 Introduction

15.2 Two-Player Games and Integral Reinforcement Learning

15.3 Continuous-Time Value Iteration to Solve the Riccati Equation

15.4 Online Algorithm to Solve Nonzero-Sum Games

15.5 Analysis of the Online Learning Algorithm for NZS Games

15.6 Simulation Result for the Online Game Algorithm

15.7 Conclusion

References

Chapter 16: Online Learning Algorithms for Optimal Control and Dynamic Games

16.1 Introduction

16.2 Optimal Control and the Continuous Time Hamilton–Jacobi–Bellman Equation

16.3 Online Solution of Nonlinear Two-Player Zero-Sum Games and Hamilton–Jacobi–Isaacs Equation

16.4 Online Solution of Nonlinear Nonzero-Sum Games and Coupled Hamilton–Jacobi Equations

References

Part III: Foundations in MDP And RL

Chapter 17: Lambda-Policy Iteration: A Review and a New Implementation

17.1 Introduction

17.2 Lambda-Policy Iteration without Cost Function Approximation

17.3 Approximate Policy Evaluation Using Projected Equations

17.4 Lambda-Policy Iteration with Cost Function Approximation

17.5 Conclusions

Acknowledgments

References

Chapter 18: Optimal Learning and Approximate Dynamic Programming

18.1 Introduction

18.2 Modeling

18.3 The Four Classes of Policies

18.4 Basic Learning Policies for Policy Search

18.5 Optimal Learning Policies for Policy Search

18.6 Learning with a Physical State

References

Chapter 19: An Introduction to Event-Based Optimization: Theory and Applications

19.1 Introduction

19.2 Literature Review

19.3 Problem Formulation

19.4 Policy Iteration for EBO

19.5 Example: Material Handling Problem

19.6 Conclusions

Acknowledgments

References

Chapter 20: Bounds for Markov Decision Processes

20.1 Introduction

20.2 Problem Formulation

20.3 The Linear Programming Approach

20.4 The Martingale Duality Approach

20.5 The Pathwise Optimization Method

20.6 Applications

20.7 Conclusion

References

Chapter 21: Approximate Dynamic Programming and Backpropagation on Timescales

21.1 Introduction: Timescales Fundamentals

21.2 Dynamic Programming

21.3 Backpropagation

21.4 Conclusions

Acknowledgments

References

Chapter 22: A Survey of Optimistic Planning in Markov Decision Processes

22.1 Introduction

22.2 Optimistic Online Optimization

22.3 Optimistic Planning Algorithms

22.4 Related Planning Algorithms

22.5 Numerical Example

References

Chapter 23: Adaptive Feature Pursuit: Online Adaptation of Features in Reinforcement Learning

23.1 Introduction

23.2 The Framework

23.3 The Feature Adaptation Scheme

23.4 Convergence Analysis

23.5 Application to Traffic Signal Control

23.6 Conclusions

References

Chapter 24: Feature Selection for Neuro-Dynamic Programming

24.1 Introduction

24.2 Optimality Equations

24.3 Neuro-Dynamic Algorithms

24.4 Fluid Models

24.5 Diffusion Models

24.6 Mean Field Games

24.7 Conclusions

References

Chapter 25: Approximate Dynamic Programming for Optimizing Oil Production

25.1 Introduction

25.2 Petroleum Reservoir Production Optimization Problem

25.3 Review of Dynamic Programming and Approximate Dynamic Programming

25.4 Approximate Dynamic Programming Algorithm for Reservoir Production Optimization

25.5 Simulation Results

25.6 Concluding Remarks

Acknowledgments

References

Chapter 26: A Learning Strategy for Source Tracking in Unstructured Environments

26.1 Introduction

26.2 Reinforcement Learning

26.3 Light-Following Robot

26.4 Simulation Results

26.5 Experimental Results

26.6 Conclusions and Future Work

Acknowledgments

References

Index

IEEE Press Series on Computational Intelligence

IEEE Press

445 Hoes Lane

Piscataway, NJ 08854

IEEE Press Editorial Board 2012

John Anderson, Editor in Chief

Kenneth Moore, Director of IEEE Book and Information Services (BIS)

Cover Illustration: Courtesy of Frank L. Lewis and Derong Liu

Cover Design: John Wiley & Sons, Inc.

Copyright © 2013 by The Institute of Electrical and Electronics Engineers, Inc.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Reinforcement learning and approximate dynamic programming for feedback control / edited by Frank L. Lewis, Derong Liu.

p. cm.

ISBN 978-1-118-10420-0 (hardback)

1. Reinforcement learning. 2. Feedback control systems. I. Lewis, Frank L.

II. Liu, Derong, 1963-

Q325.6.R464 2012

003′.5—dc23

2012019014

Preface

Modern day society relies on the operation of complex systems including aircraft, automobiles, electric power systems, economic entities, business organizations, banking and finance systems, computer networks, manufacturing systems, and industrial processes. Decision and control are responsible for ensuring that these systems perform properly and meet prescribed performance objectives. The safe, reliable, and efficient control of these systems is essential for our society. Therefore, automatic decision and control systems are ubiquitous in human engineered systems and have had an enormous impact on our lives. As modern systems become more complex and performance requirements more stringent, improved methods of decision and control are required that deliver guaranteed performance and the satisfaction of prescribed goals.

Feedback control works on the principle of observing the actual outputs of a system, comparing them to desired trajectories, and computing a control signal based on that error, which is used to modify the performance of the system to make the actual output follow the desired trajectory. The optimization of sequential decisions or controls that are repeated over time arises in many fields, including artificial intelligence, automatic control systems, power systems, economics, medicine, operations research, resource allocation, collaboration and coalitions, business and finance, and games including chess and backgammon. Optimal control theory provides methods for computing feedback control systems that deliver optimal performance. Optimal controllers optimize user-prescribed performance functions and are normally designed offline by solving Hamilton–Jacobi–Bellman (HJB) design equations. This requires knowledge of the full system dynamics model. However, it is often difficult to determine an accurate dynamical model of practical systems. Moreover, determining optimal control policies for nonlinear systems requires the offline solution of nonlinear HJB equations, which are often difficult or impossible to solve. Dynamic programming (DP) is a sequential algorithmic method for finding optimal solutions in sequential decision problems. DP was developed beginning in the 1960s with the work of Bellman and Pontryagin. DP is fundamentally a backwards-in-time procedure that does not offer methods for solving optimal decision problems in a forward manner in real time.

The real-time adaptive learning of optimal controllers for complex unknown systems has been solved in nature. Every agent or system is concerned with acting on its environment in such a way as to achieve its goals. Agents seek to learn how to collaborate to improve their chances of survival and increase. The idea that there is a cause and effect relation between actions and rewards is inherent in animal learning. Most organisms in nature act in an optimal fashion to conserve resources while achieving their goals. It is possible to study natural methods of learning and use them to develop computerized machine learning methods that solve sequential decision problems.

Reinforcement learning (RL) describes a family of machine learning systems that operate based on principles used in animals, social groups, and naturally occurring systems. RL methods were used by Ivan Pavlov in the 1860s to train his dogs. RL refers to an actor or agent that interacts with its environment and modifies its actions, or control policies, based on stimuli received in response to its actions. RL computational methods have been developed by the Computational Intelligence Community that solve optimal decision problems in real time and do not require the availability of analytical system models. The RL algorithms are constructed on the idea that successful control decisions should be remembered, by means of a reinforcement signal, such that they become more likely to be used another time. Successful collaborating groups should be reinforced. Although the idea originates from experimental animal learning, it has also been observed that RL has strong support from neurobiology, where it has been noted that the dopamine neurotransmitter in the basal ganglia acts as a reinforcement informational signal, which favors learning at the level of the neurons in the brain. RL techniques were first developed for Markov decision processes having finite state spaces. They have been extended for the control of dynamical systems with infinite state spaces.

One class of RL methods is based on the actor–critic structure, where an actor component applies an action or a control policy to the environment, whereas a critic component assesses the value of that action. Actor–critic structures are particularly well adapted for solving optimal decision problems in real time through reinforcement learning techniques. Approximate dynamic programing (ADP) refers to a family of practical actor–critic methods for finding optimal solutions in real time. These techniques use computational enhancements such as function approximation to develop practical algorithms for complex systems with disturbances and uncertain dynamics. Now, the ADP approach has become a key direction for future research in understanding brain intelligence and building intelligent systems.

The purpose of this book is to give an exposition of recently developed RL and ADP techniques for decision and control in human engineered systems. Included are both single-player decision and control and multiplayer games. RL is strongly connected from a theoretical point of view with both adaptive learning control and optimal control methods. There has been a great deal of interest in RL and recent work has shown that ideas based on ADP can be used to design a family of adaptive learning algorithms that converge in real-time to optimal control solutions by measuring data along the system trajectories. The study of RL and ADP requires methods from many fields, including computational intelligence, automatic control systems, Markov decision processes, stochastic games, psychology, operations research, cybernetics, neural networks, and neurobiology. Therefore, this book is interested in bringing together ideas from many communities.

This book has three parts. Part I develops methods for feedback control of systems based on RL and ADP. Part II treats learning and control in multiagent games. Part III presents some ideas of fundamental importance in understanding and implementing decision algorithm in Markov processes.

F.L. LewisDerong Liu

Fort Worth, TX

Chicago, IL

Contributors

Eduardo Alonso, School of Informatics, City University, London, UK

Charles W. Anderson, Department of Computer Science, Colorado State University, Fort Collins, CO, USA

Titus Appel, MARHES Lab, Department of Electrical & Computer Engineering, University of New Mexico, Albuquerque, NM, USA

Khalid Aziz, Department of Energy Resources Engineering, Stanford University, Stanford, CA, USA

Robert Babuska, Delft Center for Systems and Control, Delft University of Technology, Delft, The Netherlands

S.N. Balakrishnan, Department of Mechanical and Aerospace Engineering, Missouri University of Science and Technology, Rolla, MO, USA

Tamer Baar, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Dimitri Bertsekas, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

Shubhendu Bhasin, Department of Electrical Engineering, Indian Institute of Technology, Delhi, India

Shalabh Bhatnagar, Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India

V.S. Borkar, Department of Electrical Engineering, Indian Institute of Technology, Powai, Mumbai, India

Lucian Busoniu, Université de Lorraine, CRAN, UMR 7039 and CNRS, CRAN, UMR 7039, Vandœuvre-lès-Nancy, France

Xi-Ren Cao, Shanghai Jiaotong University, Shanghai, China

W. Chen, Coordinated Science Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Vijay Desai, Industrial Engineering and Operations Research, Columbia University, New York, NY, USA

Gianluca Di Muro, Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, USA

Jie Ding, Department of Mechanical and Aerospace Engineering, Missouri University of Science and Technology, Rolla, MO, USA

Warren E. Dixon, Department of Mechanical and Aerospace Engineering, University of Florida, FL, USA

Louis J. Durlofsky, Department of Energy Resources Engineering, Stanford University, Stanford, CA, USA

Krishnamurthy Dvijotham, Computer Science and Engineering, University of Washington, Seattle, WA, USA

Michael Fairbank, School of Informatics, City University, London, UK

Vivek Farias, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA

Silvia Ferrari, Laboratory for Intelligent Systems and Control (LISC), Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, USA

Rafael Fierro, MARHES Lab, Department of Electrical & Computer Engineering, University of New Mexico, Albuquerque, NM, USA

Haibo He, Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI, USA

Ali Heydari, Department of Mechanical and Aerospace Engineering, Missouri University of Science and Technology, Rolla, MO, USA

Dayu Huang, Coordinated Science Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA

S. Jagannathan, Electrical and Computer Engineering Department, Missouri University of Science and Technology, Rolla, MI, USA

Qing-Shan Jia, Department of Automation, Tsinghua University, Beijing, China

Yu Jiang, Department of Electrical and Computer Engineering, Polytechnic Institute of New York University, Brooklyn, NY, USA

Marcus Johnson, Department of Mechanical and Aerospace Engineering, University of Florida, FL, USA

Zhong-Ping Jiang, Department of Electrical and Computer Engineering, Polytechnic Institute of New York University, Brooklyn, NY, USA

Rushikesh Kamalapurkar, Department of Mechanical and Aerospace Engineering, University of Florida, FL, USA

Kenton Kirkpatrick, Department of Aerospace Engineering, Texas A&M University, College Station, TX, USA

J. Nate Knight, Numerica Corporation, Loveland, CO, USA

F.L. Lewis, UTA Research Institute, University of Texas, Arlington, TX, USA

Derong Liu, State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, P.R. China

Chao Lu, Department of Electrical Engineering, Tsinghua University, Beijing, P. R. China

Ron Lumia, Department of Mechanical Engineering, University of New Mexico, Albuquerque, NM, USA

P. Mehta, Coordinated Science Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Sean Meyn, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA

Ciamac Moallemi, Graduate School of Business, Columbia University, New York, NY, USA

Remi Munos, SequeL team, INRIA Lille– Nord Europe, France

Zhen Ni, Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI, USA

Warren B. Powell, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, USA

L.A. Prashanth, Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India

Danil Prokhorov, Toyota Research Institute North America, Toyota Technical Center, Ann Arbor, MI, USA

Armando A. Rodriguez, School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ, USA

Brandon Rohrer, Sandia National Laboratories, Albuquerque, NM, USA

Keith Rudd, Laboratory for Intelligent Systems and Control (LISC), Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, USA

I.O. Ryzhov, Department of Decision, Operations and Information Technologies, Robert H. Smith School of Business, University of Maryland, College Park, MD, USA

John Seiffertt, Department of Electrical and Computer Engineering, Missouri University of Science & Technology, Rolla, MO, USA

Jennie Si, School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ, USA

A. Surana, United Technologies Research Center, East Hartford, CT, USA

Hamidou Tembine, Telecommunication Department, Supelec, Gif sur Yvette, France

Emanuel Todorov, Applied Mathematics, Computer Science and Engineering, University of Washington, Seattle, WA, USA

Kostas S. Tsakalis, School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ, USA

John Valasek, Department of Aerospace Engineering, Texas A&M University, College Station, TX, USA

K. Vamvoudaki, Center for Control, Dynamical-Systems and Computation, University of California, Santa Barbara, CA, USA

Benjamin Van Roy, Department of Management Science and Engineering and Department of Electrical Engineering, Stanford University, Stanford, CA, USA

Draguna Vrabie, United Technologies Research Center, East Hartford, CT, USA

Ding Wang, State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, P.R. China

Zheng Wen, Department of Electrical Engineering, Stanford University, Stanford, CA, USA

Paul Werbos, National Science Foundation, Arlington, VA, USA

John Wood, Department of Mechanical Engineering, University of New Mexico, Albuquerque, NM, USA

Don Wunsch, Department of Electrical and Computer Engineering, Missouri University of Science & Technology, Rolla, MO, USA

Lei Yang, College of Information and Control Science and Engineering, Zhejiang University, Hangzhou, China

Qinmin Yang, State Key Laboratory of Industrial Control Technology, Department of Control Science and Engineering, Zhejiang University, Hangzhou, Zhejiang, China

Hassan Zargarzadeh, Embedded Systems and Networking Laboratory, Electrical and Computer Engineering Department, Missouri University of Science and Technology, Rolla, MI, USA

Dongbin Zhao, State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Qianchuan Zhao, Department of Automation, Tsinghua University, Beijing, China

Yanjia Zhao, Department of Automation, Tsinghua University, Beijing, China

Quanyan Zhu, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Part I

Feedback Control Using RL And ADP

Chapter 1

Reinforcement Learning and Approximate Dynamic Programming (RLADP)—Foundations, Common Misconceptions, and the Challenges Ahead

Paul J. Werbos

National Science Foundation (NSF), Arlington, VA, USA

Abstract

Many new formulations of reinforcement learning and approximate dynamic programming (RLADP) have appeared in recent years, as it has grown in control applications, control theory, operations research, computer science, robotics, and efforts to understand brain intelligence. The chapter reviews the foundations and challenges common to all these areas, in a unified way but with reference to their variations. It highlights cases where experience in one area sheds light on obstacles or common misconceptions in another. Many common beliefs about the limits of RLADP are based on such obstacles and misconceptions, for which solutions already exist. Above all, this chapter pinpoints key opportunities for future research important to the field as a whole and to the larger benefits it offers.

1.1 Introduction

The field of reinforcement learning and approximate dynamic programming (RLADP) has undergone enormous expansion since about 1988 [1], the year of the first NSF workshop on Neural Networks for Control, which evaluated RLADP as one of several important new tools for intelligent control, with or without neural networks. Since then, RLADP has grown enormously in many disciplines of engineering, computer science, and cognitive science, especially in neural networks, control engineering, operations research, robotics, machine learning, and efforts to reverse engineer the higher intelligence of the brain. In 1988, when I began funding this area, many people viewed the area as a small and curious niche within a small niche, but by the year 2006, when the Directorate of Engineering at NSF was reorganized, many program directors said “we all do ADP now.”

Many new tools, serious applications, and stability theorems have appeared, and are still appearing, in ever great numbers. But at the same time, a wide variety of misconceptions about RLADP have appeared, even within the field itself. The sheer variety of methods and approaches has made it ever more difficult for people to appreciate the underlying unity of the field and of the mathematics, and to take advantage of the best tools and concepts from all parts of the field. At NSF, I have often seen cases where the most advanced and accomplished researchers in the field have become stuck because of fundamental questions or assumptions that were taken care of 30 years before, in a different part of the field. The goal of this chapter is to provide a kind of unified view of the past, present, and future of this field, to address those challenges. I will review many points that, though basic, continue to be obstacles to progress. I will also focus on the larger, long-term research goal of building real-time learning systems which can cope effectively with the degree of system complexity, nonlinearity, random disturbance, computer hardware complexity, and partial observability which even a mouse brain somehow seems to be able to handle [2]. I will also try to clarify issues of notation that have become more and more of a problem as the field grows more diverse. I will try to make this chapter accessible to people across multiple disciplines, but will often make side comments for specialists in different disciplines—as in the next paragraph.

Optimal control, robust control, and adaptive control are often seen as the three main pillars of modern control theory. ADP may be seen as part of optimal control, the part that seeks computationally feasible general methods for the nonlinear stochastic case. It may be seen as a computational tool to find the most accurate possible solutions, subject to computational constraints, to the HJB equation, as required by general nonlinear robust control. It may be formulated as an extension of adaptive control which, because of the implicit “look ahead,” achieves stability under much weaker conditions than the well-known forms of direct and indirect adaptive control. The most impressive practical applications so far have involved highly nonlinear challenges, such as missile interception [3] and continuous production of carbon–carbon thermoplastic parts [4].

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!