103,99 €
Model-Based Reinforcement Learning Explore a comprehensive and practical approach to reinforcement learning Reinforcement learning is an essential paradigm of machine learning, wherein an intelligent agent performs actions that ensure optimal behavior from devices. While this paradigm of machine learning has gained tremendous success and popularity in recent years, previous scholarship has focused either on theory--optimal control and dynamic programming - or on algorithms--most of which are simulation-based. Model-Based Reinforcement Learning provides a model-based framework to bridge these two aspects, thereby creating a holistic treatment of the topic of model-based online learning control. In doing so, the authors seek to develop a model-based framework for data-driven control that bridges the topics of systems identification from data, model-based reinforcement learning, and optimal control, as well as the applications of each. This new technique for assessing classical results will allow for a more efficient reinforcement learning system. At its heart, this book is focused on providing an end-to-end framework--from design to application--of a more tractable model-based reinforcement learning technique. Model-Based Reinforcement Learning readers will also find: * A useful textbook to use in graduate courses on data-driven and learning-based control that emphasizes modeling and control of dynamical systems from data * Detailed comparisons of the impact of different techniques, such as basic linear quadratic controller, learning-based model predictive control, model-free reinforcement learning, and structured online learning * Applications and case studies on ground vehicles with nonholonomic dynamics and another on quadrator helicopters * An online, Python-based toolbox that accompanies the contents covered in the book, as well as the necessary code and data Model-Based Reinforcement Learning is a useful reference for senior undergraduate students, graduate students, research assistants, professors, process control engineers, and roboticists.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 327
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
About the Authors
Preface
Acronyms
Introduction
I.1 Background and Motivation
I.2 Literature Review
Bibliography
1 Nonlinear Systems Analysis
1.1 Notation
1.2 Nonlinear Dynamical Systems
1.3 Lyapunov Analysis of Stability
1.4 Stability Analysis of Discrete Time Dynamical Systems
1.5 Summary
Bibliography
2 Optimal Control
2.1 Problem Formulation
2.2 Dynamic Programming
2.3 Linear Quadratic Regulator
2.4 Summary
Bibliography
Notes
3 Reinforcement Learning
3.1 Control‐Affine Systems with Quadratic Costs
3.2 Exact Policy Iteration
3.3 Policy Iteration with Unknown Dynamics and Function Approximations
3.4 Summary
Bibliography
Note
4 Learning of Dynamic Models
4.1 Introduction
4.2 Model Selection
4.3 Parametric Model
4.4 Parametric Learning Algorithms
4.5 Persistence of Excitation
4.6 Python Toolbox
4.7 Comparison Results
4.8 Summary
Bibliography
5 Structured Online Learning‐Based Control of Continuous‐Time Nonlinear Systems
5.1 Introduction
5.2 A Structured Approximate Optimal Control Framework
5.3 Local Stability and Optimality Analysis
5.4 SOL Algorithm
5.5 Simulation Results
5.6 Summary
Bibliography
6 A Structured Online Learning Approach to Nonlinear Tracking with Unknown Dynamics
6.1 Introduction
6.2 A Structured Online Learning for Tracking Control
6.3 Learning‐based Tracking Control Using SOL
6.4 Simulation Results
6.5 Summary
Bibliography
Piecewise Learning and Control with Stability Guarantees
7.1 Introduction
7.2 Problem Formulation
7.3 The Piecewise Learning and Control Framework
7.4 Analysis of Uncertainty Bounds
7.5 Stability Verification for Piecewise‐Affine Learning and Control
7.6 Numerical Results
7.7 Summary
Bibliography
8 An Application to Solar Photovoltaic Systems
8.1 Introduction
8.2 Problem Statement
8.3 Optimal Control of PV Array
8.4 Application Considerations
8.5 Simulation Results
8.6 Summary
Bibliography
9 An Application to Low‐level Control of Quadrotors
9.1 Introduction
9.2 Quadrotor Model
9.3 Structured Online Learning with RLS Identifier on Quadrotor
9.4 Numerical Results
9.5 Summary
Bibliography
10 Python Toolbox
10.1 Overview
10.2 User Inputs
10.3 SOL
10.4 Display and Outputs
10.5 Summary
Bibliography
Appendix
A.1 Supplementary Analysis of Remark 5.4
A.2 Supplementary Analysis of Remark 5.5
Index
End User License Agreement
Chapter 5
Table 5.1 The system dynamics and the corresponding value function obtained...
Chapter 8
Table 8.1 Nomenclature.
Table 8.2 Electrical data of the CS6X‐335M‐FG module [Canadian Solar, 2022]...
Chapter 9
Table 9.1 The coefficients of the simulated Crazyflie.
Chapter 4
Figure 4.1 A view of parametric and nonparametric learning methods in terms ...
Figure 4.2 We assume the exact bases for the model are not know. Hence, only...
Figure 4.3 Assuming that the exact set of bases are included in the model, w...
Figure 4.4 A comparison of the RMSE for different algorithms while learning....
Figure 4.5 We assume only some of the exact bases that are included in the b...
Figure 4.6 We assume that the exact bases are known, and they are included i...
Figure 4.7 Assuming that the exact bases are included in the set of bases c...
Figure 4.8 A comparison of runtimes for different algorithms.
Figure 4.9 The effect of increasing number of bases on the runtime of differ...
Figure 4.10 The runtime of GD and RLS are compared in terms of the increasin...
Figure 4.11 The runtime of SINDy and LS are compared in terms of the increas...
Chapter 5
Figure 5.1 Responses of the Lorenz system while learning by using the approx...
Figure 5.2 The value, components of
, and prediction error corresponding to...
Figure 5.3 Responses of the cartpole system while learning by using the appr...
Figure 5.4 The value, components of
, and prediction error corresponding to...
Figure 5.5 Convergence of the components of
to the LQR solution, obtained ...
Figure 5.6 For the linear system, we illustrate that the feedback gain obtai...
Figure 5.7 The state trajectories and the control signal while learning that...
Figure 5.8 (a). Controlled and uncontrolled trajectories of the suspension s...
Figure 5.9 Responses of the double‐inverted pendulum system while learning b...
Figure 5.10 The value, components of
, and prediction error corresponding t...
Figure 5.11 A view of the graphical simulations of the benchmark cartpole an...
Chapter 6
Figure 6.1 The control and states of the pendulum system within a run of the...
Figure 6.2 The control and states of the pendulum system within a run of the...
Figure 6.3 The control and states of the pendulum system within a run of the...
Figure 6.4 A view of the 3D simulation done for synchronizing the chaotic Lo...
Figure 6.5 The states and the obtained control of the Lorenz system while le...
Figure 6.6 The evolutions of the value and parameters while learning the tra...
Chapter 7
Figure 7.1 A scheme of obtaining a continuous piecewise model is illustrated...
Figure 7.2 Subfigures (a)–(f) denote the sample gaps located for different n...
Figure 7.3 The scheme for obtaining the uncertainty bound according to the s...
Figure 7.4 A view of the second dynamic of pendulum system (5.23) assuming
Figure 7.5 The procedure for learning the dynamics by the PWA model is illus...
Figure 7.6 To better illustrate the learning procedure, the step‐by‐step res...
Figure 7.7 The step‐by‐step results of the sampling procedure and the sample...
Figure 7.8 (a) The obtained ROA of the closed loop PWA system is illustrated...
Figure 7.9 (a) The state and control signals are illustrated within an episo...
Figure 7.10 A comparison of the runtime results for the identification and c...
Chapter 8
Figure 8.1 DC‐DC boost converter used to interface the load to the solar arr...
Figure 8.2 The equivalent electrical model of the solar array.
Figure 8.3 Output results of the simulated solar array by changing the irrad...
Figure 8.4 Output results of the simulated solar array by changing the ambie...
Figure 8.5 Evolutions of the operating point of system on
–
curve before ...
Figure 8.6 Evolutions of the operating point of system on
–
curve before a...
Figure 8.7 A sketch of the simulated solar PV system together with the propo...
Figure 8.8 The obtained control signal (NOC) compared to SMC and second‐orde...
Figure 8.9 Comparison results of the output power under the changing irradia...
Figure 8.10 The obtained control signal (NOC) compared to SMC and second‐ord...
Figure 8.11 Comparison results of the output power under the changing ambien...
Figure 8.12 The results obtained by simulating the system with the proposed ...
Figure 8.13 ...
Figure 8.14 The learning result of the solar PV system, given by Figure 8.13...
Figure 8.15 The solar PV array illustrating the partial shading condition co...
Figure 8.16 Output voltage and current signals of the solar PV array togethe...
Chapter 9
Figure 9.1 The histogram of the runtime of the identification and the contro...
Figure 9.2 A video of the training procedure can be found at https://youtu.b...
Figure 9.3 The attitude control results in the learning procedure illustrate...
Figure 9.4 The position control results in the learning procedure illustrate...
Figure 9.5 The model coefficients, identified by RLS, are shown within a run...
Figure 9.6 The PWM inputs of the quadrotor generated by the learned control....
Figure 9.7 The parameters of the value function within a sample run together...
Chapter 10
Figure 10.1 A view of the
Structured Online Learning
(
SOL
) toolbox is shown....
Figure 10.2 The
Process
class is illustrated including the methods available...
Figure 10.3 The objective class that is defined based on an LQR cost.
Figure 10.4 A view of the procedure for updating the model is given. The sam...
Figure 10.5 The
Database
class are illustrated, where given the samples of t...
Figure 10.6 Using the samples of the system and the set of bases chosen by t...
Figure 10.7 The
Control
class is illustrated, where the
Objective
and
Librar
...
Figure 10.8 The class shown is used to record and visualize the simulation r...
Figure 10.9 The 3D graphical simulation is handled by the illustrated object...
Cover Page
Table of Contents
Series Page
Title Page
Copyright
About the Authors
Begin Reading
Appendix
Index
Wiley End User License Agreement
i
ii
iv
xi
xiii
xiv
xv
xvi
xvii
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
xxx
xxxi
xxxii
xxxiii
xxxiv
xxxv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
IEEE Press Editorial Board
Sarah Spurgeon, Editor in Chief
Jón Atli Benediktsson
Anjan Bose
Adam Drobot
Peter (Yong) Lian
Andreas Molisch
Saeid Nahavandi
Diomidis Spinellis
Ahmet Murat Tekalp
Jeffrey Reed
Thomas Robertazzi
Milad Farsi and Jun Liu
University of Waterloo, Ontario, Canada
IEEE Press Series on Control Systems Theory and Applications
Maria Domenica Di Benedetto, Series Editor
Copyright © 2023 by The Institute of Electrical and Electronics Engineers, Inc.
All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data applied for:
Hardback ISBN: 9781119808572
Cover Design: Wiley
Cover Images: © Pobytov/Getty Images; Login/Shutterstock; Sazhnieva Oksana/Shutterstock
Milad Farsi received a B.S. degree in Electrical Engineering (Electronics) from the University of Tabriz in 2010. He obtained an M.S. degree also in Electrical Engineering (Control Systems) from Sahand University of Technology in 2013. Moreover, he gained industrial experience as a Control System Engineer between 2012 and 2016. Later, he acquired a Ph.D. degree in Applied Mathematics from the University of Waterloo, Canada, in 2022 and is currently a Postdoctoral Fellow at the same institution. His research interests include control systems, reinforcement learning, and their applications in robotics and power electronics.
Jun Liu received a B.S. degree in Applied Mathematics from Shanghai Jiao‐Tong University in 2002, the M.S. degree in Mathematics from Peking University in 2005, and the Ph.D. degree in Applied Mathematics from the University of Waterloo, Canada, in 2010. He is currently an Associate Professor of Applied Mathematics and a Canada Research Chair in Hybrid Systems and Control at the University of Waterloo, where he directs the Hybrid Systems Laboratory. From 2012 to 2015, he was a Lecturer in Control and Systems Engineering at the University of Sheffield. During 2011 and 2012, he was a Postdoctoral Scholar in Control and Dynamical Systems at the California Institute of Technology. His main research interests are in the theory and applications of hybrid systems and control, including rigorous computational methods for control design with applications in cyber‐physical systems and robotics.
The subject of Reinforcement Learning (RL) is popularly associated with the psychology of animal learning through a trial‐and‐error mechanism. The underlying mathematical principle of RL techniques, however, is undeniably the theory of optimal control, as exemplified by landmark results in the late 1950s on dynamic programming by Bellman, the maximum principle by Pontryagin, and the Linear Quadratic Regulator (LQR) by Kalman. Optimal control itself has its roots in the much older subject of calculus of variations, which dated back to late 1600s. Pontryagin's maximum principle and the Hamilton–Jacobi–Bellman (HJB) equation are the two main pillars of optimal control, the latter of which provides feedback control strategies through an optimal value function, whereas the former characterizes open‐loop control signals.
Reinforcement learning was developed by Barto and Sutton in the 1980s, inspired by animal learning and behavioral psychology. The subject has experienced a resurgence of interest in both academia and industry over the past decade, among the new explosive wave of AI and machine learning research. A notable recent success of RL was in tackling the otherwise seemingly intractable game of Go and defeating the world champion in 2016.
Arguably, the problems originally solved by RL techniques are mostly discrete in nature. For example, navigating mazes and playing video games, where both the states and actions are discrete (finite), or simple control tasks such as pole balancing with impulsive forces, where the actions (controls) are chosen to be discrete. More recently, researchers started to investigate RL methods for problems with both continuous state and action spaces. On the other hand, classical optimal control problems by definition have continuous state and control variables. It seems natural to simply formulate optimal control problems in a more general way and develop RL techniques to solve them. Nonetheless, there are two main challenges in solving such optimal control problems from a computational perspective. First, most techniques require exact or at least approximate model information. Second, the computation of optimal value functions and feedback controls often suffers from the curse of dimensionality. As a result, such methods are often too slow to be applied in an online fashion.
The book was motivated by this very challenge of developing computationally efficient methods for online learning of feedback controllers for continuous control problems. A main part of this book was based on the PhD thesis of the first author, which presented a Structured Online Learning (SOL) framework for computing feedback controllers by forward integration of a state‐dependent differential Riccati equation along state trajectories. In the special case of Linear Time‐Invariant (LTI) systems, this reduces to solving the well‐known LQR problem without prior knowledge of the model. The first part of the book (Chapters 1–3) provides some background materials including Lyapunov stability analysis, optimal control, and RL for continuous control problems. The remaining part (Chapters 4–9) discusses the SOL framework in detail, covering both regulation and tracking problems, their further extensions, and various case studies.
The first author would like to convey his heartfelt thanks to those who encouraged and supported him during his research. The second author is grateful to the mentors, students, colleagues, and collaborators who have supported him throughout his career. We gratefully acknowledge financial support for the research through the Natural Sciences and Engineering Research Council of Canada, the Canada Research Chairs Program, and the Ontario Early Researcher Award Program.
Waterloo, Ontario, Canada
Milad Farsi and Jun Liu
April 2022
ACCPM
analytic center cutting‐plane method
ARE
algebraic Riccati equation
DNN
deep neural network
DP
dynamic programming
DRE
differential Riccati equation
FPRE
forward‐propagating Riccati equation
GD
gradient descent
GUAS
globally uniformly asymptotically stable
GUES
globally uniformly exponentially stable
HJB
Hamilton–Jacobi–Bellman
KDE
kernel density estimation
LMS
least mean squares
LQR
linear quadratic regulator
LQT
linear quadratic tracking
LS
least square
LTI
linear time‐invariant
MBRL
model‐based reinforcement learning
MDP
Markov decision process
MIQP
mixed‐integer quadratic program
MPC
model predictive control
MPP
maximum power point
MPPT
maximum power point tracking
NN
neural network
ODE
ordinary differential equation
PDE
partial differential equation
PE
persistence of excitation
PI
policy iteration
PV
Photovoltaic
PWA
piecewise affine
PWM
pulse‐width modulation
RL
reinforcement learning
RLS
recursive least squares
RMSE
root mean square error
ROA
region of attraction
SDRE
state‐dependent Riccati equations
SINDy
sparse identification of nonlinear dynamics
SMC
sliding mode control
SOL
structured online learning
SOS
sum of squares
TD
temporal difference
UAS
uniformly asymptotically stable
UES
uniformly exponentially stable
VI
value iteration
Optimal control theory plays an important role in designing effective control systems. For linear systems, a class of optimal control problems are solved successfully under the framework of Linear Quadratic Regulator (LQR). LQR problems are concerned with minimizing a quadratic cost for linear systems in terms of the control input and state, solving which allows us to regulate the state and the control input of the system. In control systems applications, this provides an opportunity to specifically regulate the behavior of the system by adjusting the weighting coefficients used in the cost functional. However, when it turns to nonlinear dynamical systems, there is no systematic method for efficiently obtaining an optimal feedback control for the general nonlinear systems. Thus, many of the techniques available in the literature on linear systems do not apply in general.
Despite the complexity of nonlinear dynamical systems, they have attracted much attention from researchers in recent years. This is mostly because of their practical benefits in establishing a wide variety of applications in engineering, including power electronics, flight control, and robotics, among many others. Considering the control of a general nonlinear dynamical system, optimal control involves finding a control input that minimizes a cost functional that depends on the controlled state trajectory and the control input. While such a problem formulation can cover a wide range of applications, how to efficiently solve such problems remains a topic of active research.
In general, there exist two well‐known approaches to solving such optimal control problems: the maximum (or minimum) principles [Pontryagin, 1987] and the Dynamic Programming (DP) method [Bellman and Dreyfus, 1962]. To solve an optimization problem that involves dynamics, maximum principles require us to solve a two‐point boundary value problem, where the solution is not in a feedback form.
There exist plenty of numerical techniques presented in the literature to solve the optimal control problem. Such approaches generally rely on knowledge of the exact model of the system. In the case where such a model exists, the optimal control input is obtained in the open‐loop form as a time‐dependent signal. Consequently, implementing these approaches in real‐world problems often involves many complications that are well known by the control community. This is because of the model mismatch, noises, and disturbances that greatly affect the online solution, causing it to diverge from the preplanned offline solution. Therefore, obtaining a closed‐loop solution for the optimal control problem is often preferred in such applications.
The DP approach analytically results in a feedback control for linear systems with a quadratic cost. Moreover, employing the Hamilton‐Jacobi‐Bellman (HJB) equation with a value function, one might manage to derive an optimal feedback control rule for some real‐world applications, provided that the value function can be updated in an efficient manner. This motivates us to consider conditions leading to an optimal feedback control rule that can be efficiently implemented in real‐world problems.
Consider an optimal control problem over an infinite horizon involving a nonquadratic performance measure. Using the idea of inverse optimal control, the cost functional can be then be evaluated in closed form as long as the running cost depends somehow on an underlying Lyapunov function by which the asymptotic stability of the nonlinear closed‐loop system is guaranteed. Then it can be obtained that the Lyapunov function is indeed the solution of the steady‐state HJB equation. Although such a formulation allows analytically obtaining an optimal feedback rule, choosing the proper performance measure may not be trivial. Moreover, from a practical point of view, because of the nonlinearity in the performance measure, it might cause unpredictable behavior.
A well‐studied method for solving an optimal control problem online is employing a value function assuming a given policy. Then, for any state, the value function gives a measure of how good the state is by collecting the cost starting from that state while the policy is applied. If such a value function can be obtained, and the system model is known, the optimal policy is actually the one that takes the system in the direction by which the value decreases the most in the space of the states. Such Reinforcement Learning (RL) techniques, which are known as value‐based methods, including the Value Iteration (VI) and the Policy Iteration (PI) algorithms, are shown to be effective in finite state and control spaces. However, the computations cannot efficiently scale with the size of the state and control spaces.
One way of facilitating the computations regarding the value updates is employing an approximate scheme. This is done by parameterizing the value function and adjusting the parameters in the training process. Then, the optimal policy given by the value function is also parameterized and approximated accordingly. The complexity of any value update depends directly on the number of parameters employed, where one may try limiting the number of the parameters by sacrificing the optimality. Therefore, we are motivated to obtain a more efficient update rule for the value parameters, rather than limiting the number of the parameters. We achieve this by reformulating the problem with a quadratically parameterized value function.
Moreover, the classical VI algorithm does not explicitly use the system model for evaluating the policy. This benefits applications in that the full knowledge of the system dynamics is no longer required. However, online training with VI alone may take much longer time to converge, since the model only participates implicitly through the future state. Therefore, the learning process can be potentially accelerated by introducing the system model. Furthermore, this creates an opportunity for running a separate identifier unit, where the model obtained can be simulated offline to complete the training or can be used for learning optimal policies for different objectives.
It can be shown that the VI algorithm for linear systems results in a Lyapunov recursion in the policy evaluation step. Such a Lyapunov equation in terms of the system matrices can be efficiently solved. However, for the general nonlinear case, methods for obtaining an equivalent are not amenable to efficient solutions. Hence, we are motivated to investigate the possibility of acquiring an efficient update rule for nonlinear systems.
One of the most common problems in the control of dynamical systems is to track a desired reference trajectory, which is found in a variety of real‐world applications. However, designing an efficient tracking controller using conventional methods often necessitates a thorough understanding of the model, as well as computations and considerations for each application. RL approaches, on the other hand, propose a more flexible framework that requires less information about the system dynamics. While this may create additional problems, such as safety or computing limits, there are already effective outcomes from the use of such approaches in real‐world situations. Similar to regulation problems, the applications of tracking control can benefit from Model‐based Reinforcement Learning (MBRL) that can handle the parameter updates more efficiently.
In the approximate optimal control technique, employing a limited number of parameters can only yield a local approximation of the model and the value function. However, if an approximation within a larger domain is intended, a considerably higher number of parameters may be needed. As a result, the identification and the controller's complexity might be rather too high to be performed online in real‐world applications. This convinces us to circumvent this constraint by considering a set of local simple learners instead, in a piecewise approach.
As mentioned, there exist already interesting real‐world applications of MBRL. Motivated by this, in this monograph, we aim on introducing automated ways of solving optimal control problems that can replace the conventional controllers. Hence, detailed applications of the proposed approaches are included, which are demonstrated with numerical simulations.
The main motivation for this monograph can be summarized as follows:
Optimal control is highly favored, while there is no general analytical technique applicable to all nonlinear systems.
Feedback control techniques are known to be more robust and computationally efficient compared to the numerical techniques, especially in the continuous space.
The chance of obtaining a feedback control in closed form is low, and the known techniques are limited to some special classes of systems.
Approximate DP provides a systematic way of obtaining an optimal feedback control, while the complexity grows significantly with the number of parameters.
An efficient parameterization of the optimal value may provide an opportunity for more complex real‐time applications in control regulation and tracking problems.
We summarize the main contents of the book as follows:
Chapter 1
introduces Lyapunov stability analysis of nonlinear systems, which are used in subsequent chapters for analyzing the closed‐loop performance of the feedback controllers.
Chapter 2
formulates the optimal control problem and introduces the basic concepts of using the HJB equation to characterize optimal feedback controllers, where LQR is treated as a special case. A focus is on optimal feedback controllers for asymptotic stabilization tasks with an infinite‐horizon performance criterion.
Chapter 3
discusses PI as a prominent RL technique for solving continuous optimal control problems. PI algorithms for both linear and nonlinear systems with and without any knowledge of the system model are discussed. Proofs of convergence and stability analysis are provided in a self‐contained manner.
Chapter 4
presents different techniques for learning a dynamic model for continuous control in terms of a set of basis functions, including least squares, recursive least squares, gradient descent, and sparse identification techniques for parameter updates. Comparison results are shown using numerical examples.
Chapter 5
introduces the
Structured Online Learning
(
SOL
) framework for control, including the algorithm and local analysis of stability and optimality. The focus is on regulation problems.
Chapter 6
extends the SOL framework to tracking with unknown dynamics. Simulation results are given to show the effectiveness of the SOL approach. Numerical results on comparison with alternative RL approaches are also shown.
Chapter 7
presents a piecewise learning framework work as a further extension of the SOL approach, where we limit to linear bases, while allowing models to be learned in a piecewise fashion. Accordingly, closed‐loop stability guarantees are provided with Lyapunov analysis facilitated by
Mixed‐Integer Quadratic Program
(
MIQP
)‐based verification.
Chapters 8
and
9
present two case studies on
Photovoltaic
(
PV
) and quadrotor systems.
Chapter 10
introduces the associated Python‐based tool for SOL.
It should be noted that, some of the contents of chapters 5–9 have been previously published in Farsi and Liu [2020, 2021], Farsi et al. [2022], Farsi and Liu [2022b, 2019], and they are included in this book with the permission of the cited publishers.
RL is a well‐known class of machine learning methods that are concerned with learning to achieve a particular task through interactions with the environment. The task is often defined by some reward mechanism. The intelligent agent has to take actions in different situations. Then, the reward accumulated is used as a measure to improve the agent's actions in future, where the objective is to accumulate as much as rewards as possible over some time. Therefore, it is expected that the agent's actions approach the optimal behavior in a long term. RL has gained a lot of successes in the simulation environment. However, the lack of explainability [Dulac‐Arnold et al., 2019] and data efficiency [Duan et al., 2016] make them less favorable as an online learning technique that can be directly employed in real‐world problems, unless there exists a way to safely transfer the experience from simulation‐based learning to the real world. The main challenges in the implementations of the RL techniques are discussed in Dulac‐Arnold et al. [2019]. Numerous studies are done on this subject; see, e.g. Sutton and Barto [2018], Wiering and Van Otterlo [2012], Kaelbling et al. [1996], and Arulkumaran et al. [2017] for a list of related works. RL has found a variety of interesting applications in robotics [Kober et al., 2013], multiagent systems [Zhang et al., 2021; Da Silva and Costa, 2019; Hernandez‐Leal et al., 2019], power systems [Zhang et al., 2019; Yang et al., 2020], autonomous driving [Kiran et al., 2021] and intelligent transportation [Haydari and Yilmaz, 2020], and healthcare [Yu et al., 2021], among others.
MBRL techniques, as opposed to model‐free methods in learning, are known to be more data efficient. Direct model‐free methods usually require enormous data and hours of training even for simple applications [Duan et al., 2016], while model‐based techniques can show optimal behavior in a limited number of trials. This property, in addition to the flexibilities in changing learning objectives and performing further safety analysis, makes them more suitable for real‐world implementations, such as robotics [Polydoros and Nalpantidis, 2017]. In model‐based approaches, having a deterministic or probabilistic description of the transition system saves much of the effort spent by direct methods in treating any point in the state‐control space individually. Hence, the role of model‐based techniques becomes even more significant when it comes to problems with continuous controls rather than discrete actions [Sutton, 1990; Atkeson and Santamaria, 1997; Powell, 2004].
In Moerland et al. [2020], the authors provide a survey of some recent MBRL methods which are formulated based on Markov Decision Processes (MDPs). In general, there exist two approaches for approximating a system: parametric and nonparametric. Parametric models are usually preferred over nonparametric, since the number of the parameters is independent of the number of samples. Therefore, they can be implemented more efficiently on complex systems, where many samples are needed. On the other hand, in nonparametric approaches, the prediction for a given sample is obtained by comparing it with a set of samples already stored, which represent the model. Therefore, the complexity increases with the size of the dataset. In this book, because of this advantage of parametric models, we focus on the parametric techniques.
Let us specifically consider implementations of RL on control systems. Regardless of the fact that RL techniques do not require the dynamical model to solve the problem, they are in fact intended to find a solution for the optimal control problem. This problem is extensively investigated by the control community. The LQR problem has been solved satisfactorily for linear systems using Riccati equations [Kalman, 1960], which also ensure system stability for infinite‐horizon problems. However, in the case of nonlinear systems, obtaining such a solution is not trivial and requires us to solve the HJB equation, either analytically or numerically, which is a challenging task, especially when we do not have knowledge of the system model.
Model Predictive Control (MPC) [Camacho and Alba, 2013; Garcia et al., 1989; Qin and Badgwell, 2003; Grüne and Pannek, 2017; Garcia et al., 1989; Mayne and Michalska, 1988; Morari and Lee, 1999] has been frequently used as an optimal control technique, which is inherently model‐based. Furthermore, it deals with the control problem only across a restricted prediction horizon. For this reason, and for the fact that the problem is not considered in the closed‐loop form, stability analysis is hard to establish. For the same reasons, the online computational complexity is considerably high, compared to a feedback control rule that can be efficiently implemented.
Forward‐Propagating Riccati Equation (FPRE) [Weiss et al., 2012; Prach et al., 2015] is one of the techniques presented for solving the LQR problem. Normally, the Differential Riccati Equation (DRE) is solved backward from a final condition. In an analogous technique, it can be solved in forward time with some initial condition instead. A comparison between these two schemes is given in Prach et al. [2015]. Employing forward‐integration methods makes it suitable for solving the problem for time‐varying systems [Weiss et al., 2012; Chen and Kao, 1997] or in the RL setting [Lewis et al., 2012], since the future dynamics are not needed, whereas the backward technique requires the knowledge of the future dynamics from the final condition. FPRE has been shown to be an efficient technique for finding a suboptimal solution for linear systems, while, for nonlinear systems, the assumption is that the system is linearized along the system's trajectories.
State‐dependent Riccati Equations (SDRE) [Çimen, 2008; Erdem and Alleyne, 2004; Cloutier, 1997] is another technique that can be found in the literature for solving the optimal control problem for nonlinear systems. This technique relies on the fact that any nonlinear system can be written in the form of a linear system with state‐dependent matrices. However, this conversion is not unique. Hence, a suboptimal solution is expected. Similar to MPC, it does not yield a feedback control rule since the control at each state is computed by solving a DRE that depends on the system's trajectory.
Other model‐based approaches can be found in the literature that are mainly categorized under RL in two groups: value function and policy search methods. In value function‐based methods, known also as approximate/adaptive DP techniques [Wang et al., 2009; Lewis and Vrabie, 2009; Balakrishnan et al., 2008], a value function is used to construct the policy. On the other hand, policy search methods directly improve the policy to achieve optimality. Adaptive DP has found different applications [Prokhorov, 2008; Ferrari‐Trecate et al., 2003; Prokhorov et al., 1995; Murray et al., 2002; Yu et al., 2014; Han and Balakrishnan, 2002; Lendaris et al., 2000; Liu and Balakrishnan, 2000; Ferrari‐Trecate et al., 2003] in automotive control, flight control, power control, among others. A review of recent techniques can be found in Kalyanakrishnan and Stone [2009], Busoniu et al. [2017], Recht [2019], Polydoros and Nalpantidis [2017], and Kamalapurkar et al. [2018]. The Q‐learning approach learns an action‐dependent function using Temporal Difference (TD) to obtain the optimal policy. This is inherently a discrete approach. There are continuous extensions of this technique, such as [Millán et al., 2002; Gaskett et al., 1999; Ryu et al., 2019; Wei et al., 2018]. However, for an efficient implementation, the state and action spaces ought to be finite, which is highly restrictive for continuous problems.
Adaptive controllers [Åström and Wittenmark, 2013