Model-Based Reinforcement Learning - Milad Farsi - E-Book

Model-Based Reinforcement Learning E-Book

Milad Farsi

0,0
103,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Model-Based Reinforcement Learning Explore a comprehensive and practical approach to reinforcement learning Reinforcement learning is an essential paradigm of machine learning, wherein an intelligent agent performs actions that ensure optimal behavior from devices. While this paradigm of machine learning has gained tremendous success and popularity in recent years, previous scholarship has focused either on theory--optimal control and dynamic programming - or on algorithms--most of which are simulation-based. Model-Based Reinforcement Learning provides a model-based framework to bridge these two aspects, thereby creating a holistic treatment of the topic of model-based online learning control. In doing so, the authors seek to develop a model-based framework for data-driven control that bridges the topics of systems identification from data, model-based reinforcement learning, and optimal control, as well as the applications of each. This new technique for assessing classical results will allow for a more efficient reinforcement learning system. At its heart, this book is focused on providing an end-to-end framework--from design to application--of a more tractable model-based reinforcement learning technique. Model-Based Reinforcement Learning readers will also find: * A useful textbook to use in graduate courses on data-driven and learning-based control that emphasizes modeling and control of dynamical systems from data * Detailed comparisons of the impact of different techniques, such as basic linear quadratic controller, learning-based model predictive control, model-free reinforcement learning, and structured online learning * Applications and case studies on ground vehicles with nonholonomic dynamics and another on quadrator helicopters * An online, Python-based toolbox that accompanies the contents covered in the book, as well as the necessary code and data Model-Based Reinforcement Learning is a useful reference for senior undergraduate students, graduate students, research assistants, professors, process control engineers, and roboticists.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 327

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

About the Authors

Preface

Acronyms

Introduction

I.1 Background and Motivation

I.2 Literature Review

Bibliography

1 Nonlinear Systems Analysis

1.1 Notation

1.2 Nonlinear Dynamical Systems

1.3 Lyapunov Analysis of Stability

1.4 Stability Analysis of Discrete Time Dynamical Systems

1.5 Summary

Bibliography

2 Optimal Control

2.1 Problem Formulation

2.2 Dynamic Programming

2.3 Linear Quadratic Regulator

2.4 Summary

Bibliography

Notes

3 Reinforcement Learning

3.1 Control‐Affine Systems with Quadratic Costs

3.2 Exact Policy Iteration

3.3 Policy Iteration with Unknown Dynamics and Function Approximations

3.4 Summary

Bibliography

Note

4 Learning of Dynamic Models

4.1 Introduction

4.2 Model Selection

4.3 Parametric Model

4.4 Parametric Learning Algorithms

4.5 Persistence of Excitation

4.6 Python Toolbox

4.7 Comparison Results

4.8 Summary

Bibliography

5 Structured Online Learning‐Based Control of Continuous‐Time Nonlinear Systems

5.1 Introduction

5.2 A Structured Approximate Optimal Control Framework

5.3 Local Stability and Optimality Analysis

5.4 SOL Algorithm

5.5 Simulation Results

5.6 Summary

Bibliography

6 A Structured Online Learning Approach to Nonlinear Tracking with Unknown Dynamics

6.1 Introduction

6.2 A Structured Online Learning for Tracking Control

6.3 Learning‐based Tracking Control Using SOL

6.4 Simulation Results

6.5 Summary

Bibliography

Piecewise Learning and Control with Stability Guarantees

7.1 Introduction

7.2 Problem Formulation

7.3 The Piecewise Learning and Control Framework

7.4 Analysis of Uncertainty Bounds

7.5 Stability Verification for Piecewise‐Affine Learning and Control

7.6 Numerical Results

7.7 Summary

Bibliography

8 An Application to Solar Photovoltaic Systems

8.1 Introduction

8.2 Problem Statement

8.3 Optimal Control of PV Array

8.4 Application Considerations

8.5 Simulation Results

8.6 Summary

Bibliography

9 An Application to Low‐level Control of Quadrotors

9.1 Introduction

9.2 Quadrotor Model

9.3 Structured Online Learning with RLS Identifier on Quadrotor

9.4 Numerical Results

9.5 Summary

Bibliography

10 Python Toolbox

10.1 Overview

10.2 User Inputs

10.3 SOL

10.4 Display and Outputs

10.5 Summary

Bibliography

Appendix

A.1 Supplementary Analysis of Remark 5.4

A.2 Supplementary Analysis of Remark 5.5

Index

End User License Agreement

List of Tables

Chapter 5

Table 5.1 The system dynamics and the corresponding value function obtained...

Chapter 8

Table 8.1 Nomenclature.

Table 8.2 Electrical data of the CS6X‐335M‐FG module [Canadian Solar, 2022]...

Chapter 9

Table 9.1 The coefficients of the simulated Crazyflie.

List of Illustrations

Chapter 4

Figure 4.1 A view of parametric and nonparametric learning methods in terms ...

Figure 4.2 We assume the exact bases for the model are not know. Hence, only...

Figure 4.3 Assuming that the exact set of bases are included in the model, w...

Figure 4.4 A comparison of the RMSE for different algorithms while learning....

Figure 4.5 We assume only some of the exact bases that are included in the b...

Figure 4.6 We assume that the exact bases are known, and they are included i...

Figure 4.7 Assuming that the exact bases are included in the set of bases c...

Figure 4.8 A comparison of runtimes for different algorithms.

Figure 4.9 The effect of increasing number of bases on the runtime of differ...

Figure 4.10 The runtime of GD and RLS are compared in terms of the increasin...

Figure 4.11 The runtime of SINDy and LS are compared in terms of the increas...

Chapter 5

Figure 5.1 Responses of the Lorenz system while learning by using the approx...

Figure 5.2 The value, components of

, and prediction error corresponding to...

Figure 5.3 Responses of the cartpole system while learning by using the appr...

Figure 5.4 The value, components of

, and prediction error corresponding to...

Figure 5.5 Convergence of the components of

to the LQR solution, obtained ...

Figure 5.6 For the linear system, we illustrate that the feedback gain obtai...

Figure 5.7 The state trajectories and the control signal while learning that...

Figure 5.8 (a). Controlled and uncontrolled trajectories of the suspension s...

Figure 5.9 Responses of the double‐inverted pendulum system while learning b...

Figure 5.10 The value, components of

, and prediction error corresponding t...

Figure 5.11 A view of the graphical simulations of the benchmark cartpole an...

Chapter 6

Figure 6.1 The control and states of the pendulum system within a run of the...

Figure 6.2 The control and states of the pendulum system within a run of the...

Figure 6.3 The control and states of the pendulum system within a run of the...

Figure 6.4 A view of the 3D simulation done for synchronizing the chaotic Lo...

Figure 6.5 The states and the obtained control of the Lorenz system while le...

Figure 6.6 The evolutions of the value and parameters while learning the tra...

Chapter 7

Figure 7.1 A scheme of obtaining a continuous piecewise model is illustrated...

Figure 7.2 Subfigures (a)–(f) denote the sample gaps located for different n...

Figure 7.3 The scheme for obtaining the uncertainty bound according to the s...

Figure 7.4 A view of the second dynamic of pendulum system (5.23) assuming

Figure 7.5 The procedure for learning the dynamics by the PWA model is illus...

Figure 7.6 To better illustrate the learning procedure, the step‐by‐step res...

Figure 7.7 The step‐by‐step results of the sampling procedure and the sample...

Figure 7.8 (a) The obtained ROA of the closed loop PWA system is illustrated...

Figure 7.9 (a) The state and control signals are illustrated within an episo...

Figure 7.10 A comparison of the runtime results for the identification and c...

Chapter 8

Figure 8.1 DC‐DC boost converter used to interface the load to the solar arr...

Figure 8.2 The equivalent electrical model of the solar array.

Figure 8.3 Output results of the simulated solar array by changing the irrad...

Figure 8.4 Output results of the simulated solar array by changing the ambie...

Figure 8.5 Evolutions of the operating point of system on

curve before ...

Figure 8.6 Evolutions of the operating point of system on

curve before a...

Figure 8.7 A sketch of the simulated solar PV system together with the propo...

Figure 8.8 The obtained control signal (NOC) compared to SMC and second‐orde...

Figure 8.9 Comparison results of the output power under the changing irradia...

Figure 8.10 The obtained control signal (NOC) compared to SMC and second‐ord...

Figure 8.11 Comparison results of the output power under the changing ambien...

Figure 8.12 The results obtained by simulating the system with the proposed ...

Figure 8.13 ...

Figure 8.14 The learning result of the solar PV system, given by Figure 8.13...

Figure 8.15 The solar PV array illustrating the partial shading condition co...

Figure 8.16 Output voltage and current signals of the solar PV array togethe...

Chapter 9

Figure 9.1 The histogram of the runtime of the identification and the contro...

Figure 9.2 A video of the training procedure can be found at https://youtu.b...

Figure 9.3 The attitude control results in the learning procedure illustrate...

Figure 9.4 The position control results in the learning procedure illustrate...

Figure 9.5 The model coefficients, identified by RLS, are shown within a run...

Figure 9.6 The PWM inputs of the quadrotor generated by the learned control....

Figure 9.7 The parameters of the value function within a sample run together...

Chapter 10

Figure 10.1 A view of the

Structured Online Learning

(

SOL

) toolbox is shown....

Figure 10.2 The

Process

class is illustrated including the methods available...

Figure 10.3 The objective class that is defined based on an LQR cost.

Figure 10.4 A view of the procedure for updating the model is given. The sam...

Figure 10.5 The

Database

class are illustrated, where given the samples of t...

Figure 10.6 Using the samples of the system and the set of bases chosen by t...

Figure 10.7 The

Control

class is illustrated, where the

Objective

and

Librar

...

Figure 10.8 The class shown is used to record and visualize the simulation r...

Figure 10.9 The 3D graphical simulation is handled by the illustrated object...

Guide

Cover Page

Table of Contents

Series Page

Title Page

Copyright

About the Authors

Begin Reading

Appendix

Index

Wiley End User License Agreement

Pages

i

ii

iv

xi

xiii

xiv

xv

xvi

xvii

xv

xvi

xvii

xviii

xix

xx

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

xxx

xxxi

xxxii

xxxiii

xxxiv

xxxv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

IEEE Press

445 Hoes Lane

Piscataway, NJ 08854

IEEE Press Editorial Board

Sarah Spurgeon, Editor in Chief

Jón Atli Benediktsson

Anjan Bose

Adam Drobot

Peter (Yong) Lian

Andreas Molisch

Saeid Nahavandi

Diomidis Spinellis

Ahmet Murat Tekalp

Jeffrey Reed

Thomas Robertazzi

Model‐Based Reinforcement Learning

From Data to Continuous Actions with a Python‐based Toolbox

Milad Farsi and Jun Liu

University of Waterloo, Ontario, Canada

 

 

 

 

 

 

 

 

 

 

 

 

 

 

IEEE Press Series on Control Systems Theory and Applications

Maria Domenica Di Benedetto, Series Editor

Copyright © 2023 by The Institute of Electrical and Electronics Engineers, Inc.

All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication Data applied for:

Hardback ISBN: 9781119808572

Cover Design: Wiley

Cover Images: © Pobytov/Getty Images; Login/Shutterstock; Sazhnieva Oksana/Shutterstock

About the Authors

Milad Farsi received a B.S. degree in Electrical Engineering (Electronics) from the University of Tabriz in 2010. He obtained an M.S. degree also in Electrical Engineering (Control Systems) from Sahand University of Technology in 2013. Moreover, he gained industrial experience as a Control System Engineer between 2012 and 2016. Later, he acquired a Ph.D. degree in Applied Mathematics from the University of Waterloo, Canada, in 2022 and is currently a Postdoctoral Fellow at the same institution. His research interests include control systems, reinforcement learning, and their applications in robotics and power electronics.

Jun Liu received a B.S. degree in Applied Mathematics from Shanghai Jiao‐Tong University in 2002, the M.S. degree in Mathematics from Peking University in 2005, and the Ph.D. degree in Applied Mathematics from the University of Waterloo, Canada, in 2010. He is currently an Associate Professor of Applied Mathematics and a Canada Research Chair in Hybrid Systems and Control at the University of Waterloo, where he directs the Hybrid Systems Laboratory. From 2012 to 2015, he was a Lecturer in Control and Systems Engineering at the University of Sheffield. During 2011 and 2012, he was a Postdoctoral Scholar in Control and Dynamical Systems at the California Institute of Technology. His main research interests are in the theory and applications of hybrid systems and control, including rigorous computational methods for control design with applications in cyber‐physical systems and robotics.

Preface

The subject of Reinforcement Learning (RL) is popularly associated with the psychology of animal learning through a trial‐and‐error mechanism. The underlying mathematical principle of RL techniques, however, is undeniably the theory of optimal control, as exemplified by landmark results in the late 1950s on dynamic programming by Bellman, the maximum principle by Pontryagin, and the Linear Quadratic Regulator (LQR) by Kalman. Optimal control itself has its roots in the much older subject of calculus of variations, which dated back to late 1600s. Pontryagin's maximum principle and the Hamilton–Jacobi–Bellman (HJB) equation are the two main pillars of optimal control, the latter of which provides feedback control strategies through an optimal value function, whereas the former characterizes open‐loop control signals.

Reinforcement learning was developed by Barto and Sutton in the 1980s, inspired by animal learning and behavioral psychology. The subject has experienced a resurgence of interest in both academia and industry over the past decade, among the new explosive wave of AI and machine learning research. A notable recent success of RL was in tackling the otherwise seemingly intractable game of Go and defeating the world champion in 2016.

Arguably, the problems originally solved by RL techniques are mostly discrete in nature. For example, navigating mazes and playing video games, where both the states and actions are discrete (finite), or simple control tasks such as pole balancing with impulsive forces, where the actions (controls) are chosen to be discrete. More recently, researchers started to investigate RL methods for problems with both continuous state and action spaces. On the other hand, classical optimal control problems by definition have continuous state and control variables. It seems natural to simply formulate optimal control problems in a more general way and develop RL techniques to solve them. Nonetheless, there are two main challenges in solving such optimal control problems from a computational perspective. First, most techniques require exact or at least approximate model information. Second, the computation of optimal value functions and feedback controls often suffers from the curse of dimensionality. As a result, such methods are often too slow to be applied in an online fashion.

The book was motivated by this very challenge of developing computationally efficient methods for online learning of feedback controllers for continuous control problems. A main part of this book was based on the PhD thesis of the first author, which presented a Structured Online Learning (SOL) framework for computing feedback controllers by forward integration of a state‐dependent differential Riccati equation along state trajectories. In the special case of Linear Time‐Invariant (LTI) systems, this reduces to solving the well‐known LQR problem without prior knowledge of the model. The first part of the book (Chapters 1–3) provides some background materials including Lyapunov stability analysis, optimal control, and RL for continuous control problems. The remaining part (Chapters 4–9) discusses the SOL framework in detail, covering both regulation and tracking problems, their further extensions, and various case studies.

The first author would like to convey his heartfelt thanks to those who encouraged and supported him during his research. The second author is grateful to the mentors, students, colleagues, and collaborators who have supported him throughout his career. We gratefully acknowledge financial support for the research through the Natural Sciences and Engineering Research Council of Canada, the Canada Research Chairs Program, and the Ontario Early Researcher Award Program.

Waterloo, Ontario, Canada

Milad Farsi and Jun Liu

April 2022

Acronyms

ACCPM

analytic center cutting‐plane method

ARE

algebraic Riccati equation

DNN

deep neural network

DP

dynamic programming

DRE

differential Riccati equation

FPRE

forward‐propagating Riccati equation

GD

gradient descent

GUAS

globally uniformly asymptotically stable

GUES

globally uniformly exponentially stable

HJB

Hamilton–Jacobi–Bellman

KDE

kernel density estimation

LMS

least mean squares

LQR

linear quadratic regulator

LQT

linear quadratic tracking

LS

least square

LTI

linear time‐invariant

MBRL

model‐based reinforcement learning

MDP

Markov decision process

MIQP

mixed‐integer quadratic program

MPC

model predictive control

MPP

maximum power point

MPPT

maximum power point tracking

NN

neural network

ODE

ordinary differential equation

PDE

partial differential equation

PE

persistence of excitation

PI

policy iteration

PV

Photovoltaic

PWA

piecewise affine

PWM

pulse‐width modulation

RL

reinforcement learning

RLS

recursive least squares

RMSE

root mean square error

ROA

region of attraction

SDRE

state‐dependent Riccati equations

SINDy

sparse identification of nonlinear dynamics

SMC

sliding mode control

SOL

structured online learning

SOS

sum of squares

TD

temporal difference

UAS

uniformly asymptotically stable

UES

uniformly exponentially stable

VI

value iteration

Introduction

I.1 Background and Motivation

I.1.1 Lack of an Efficient General Nonlinear Optimal Control Technique

Optimal control theory plays an important role in designing effective control systems. For linear systems, a class of optimal control problems are solved successfully under the framework of Linear Quadratic Regulator (LQR). LQR problems are concerned with minimizing a quadratic cost for linear systems in terms of the control input and state, solving which allows us to regulate the state and the control input of the system. In control systems applications, this provides an opportunity to specifically regulate the behavior of the system by adjusting the weighting coefficients used in the cost functional. However, when it turns to nonlinear dynamical systems, there is no systematic method for efficiently obtaining an optimal feedback control for the general nonlinear systems. Thus, many of the techniques available in the literature on linear systems do not apply in general.

Despite the complexity of nonlinear dynamical systems, they have attracted much attention from researchers in recent years. This is mostly because of their practical benefits in establishing a wide variety of applications in engineering, including power electronics, flight control, and robotics, among many others. Considering the control of a general nonlinear dynamical system, optimal control involves finding a control input that minimizes a cost functional that depends on the controlled state trajectory and the control input. While such a problem formulation can cover a wide range of applications, how to efficiently solve such problems remains a topic of active research.

I.1.2 Importance of an Optimal Feedback Control

In general, there exist two well‐known approaches to solving such optimal control problems: the maximum (or minimum) principles [Pontryagin, 1987] and the Dynamic Programming (DP) method [Bellman and Dreyfus, 1962]. To solve an optimization problem that involves dynamics, maximum principles require us to solve a two‐point boundary value problem, where the solution is not in a feedback form.

There exist plenty of numerical techniques presented in the literature to solve the optimal control problem. Such approaches generally rely on knowledge of the exact model of the system. In the case where such a model exists, the optimal control input is obtained in the open‐loop form as a time‐dependent signal. Consequently, implementing these approaches in real‐world problems often involves many complications that are well known by the control community. This is because of the model mismatch, noises, and disturbances that greatly affect the online solution, causing it to diverge from the preplanned offline solution. Therefore, obtaining a closed‐loop solution for the optimal control problem is often preferred in such applications.

The DP approach analytically results in a feedback control for linear systems with a quadratic cost. Moreover, employing the Hamilton‐Jacobi‐Bellman (HJB) equation with a value function, one might manage to derive an optimal feedback control rule for some real‐world applications, provided that the value function can be updated in an efficient manner. This motivates us to consider conditions leading to an optimal feedback control rule that can be efficiently implemented in real‐world problems.

I.1.3 Limits of Optimal Feedback Control Techniques

Consider an optimal control problem over an infinite horizon involving a nonquadratic performance measure. Using the idea of inverse optimal control, the cost functional can be then be evaluated in closed form as long as the running cost depends somehow on an underlying Lyapunov function by which the asymptotic stability of the nonlinear closed‐loop system is guaranteed. Then it can be obtained that the Lyapunov function is indeed the solution of the steady‐state HJB equation. Although such a formulation allows analytically obtaining an optimal feedback rule, choosing the proper performance measure may not be trivial. Moreover, from a practical point of view, because of the nonlinearity in the performance measure, it might cause unpredictable behavior.

A well‐studied method for solving an optimal control problem online is employing a value function assuming a given policy. Then, for any state, the value function gives a measure of how good the state is by collecting the cost starting from that state while the policy is applied. If such a value function can be obtained, and the system model is known, the optimal policy is actually the one that takes the system in the direction by which the value decreases the most in the space of the states. Such Reinforcement Learning (RL) techniques, which are known as value‐based methods, including the Value Iteration (VI) and the Policy Iteration (PI) algorithms, are shown to be effective in finite state and control spaces. However, the computations cannot efficiently scale with the size of the state and control spaces.

I.1.4 Complexity of Approximate DP Algorithms

One way of facilitating the computations regarding the value updates is employing an approximate scheme. This is done by parameterizing the value function and adjusting the parameters in the training process. Then, the optimal policy given by the value function is also parameterized and approximated accordingly. The complexity of any value update depends directly on the number of parameters employed, where one may try limiting the number of the parameters by sacrificing the optimality. Therefore, we are motivated to obtain a more efficient update rule for the value parameters, rather than limiting the number of the parameters. We achieve this by reformulating the problem with a quadratically parameterized value function.

Moreover, the classical VI algorithm does not explicitly use the system model for evaluating the policy. This benefits applications in that the full knowledge of the system dynamics is no longer required. However, online training with VI alone may take much longer time to converge, since the model only participates implicitly through the future state. Therefore, the learning process can be potentially accelerated by introducing the system model. Furthermore, this creates an opportunity for running a separate identifier unit, where the model obtained can be simulated offline to complete the training or can be used for learning optimal policies for different objectives.

It can be shown that the VI algorithm for linear systems results in a Lyapunov recursion in the policy evaluation step. Such a Lyapunov equation in terms of the system matrices can be efficiently solved. However, for the general nonlinear case, methods for obtaining an equivalent are not amenable to efficient solutions. Hence, we are motivated to investigate the possibility of acquiring an efficient update rule for nonlinear systems.

I.1.5 Importance of Learning‐based Tracking Approaches

One of the most common problems in the control of dynamical systems is to track a desired reference trajectory, which is found in a variety of real‐world applications. However, designing an efficient tracking controller using conventional methods often necessitates a thorough understanding of the model, as well as computations and considerations for each application. RL approaches, on the other hand, propose a more flexible framework that requires less information about the system dynamics. While this may create additional problems, such as safety or computing limits, there are already effective outcomes from the use of such approaches in real‐world situations. Similar to regulation problems, the applications of tracking control can benefit from Model‐based Reinforcement Learning (MBRL) that can handle the parameter updates more efficiently.

I.1.6 Opportunities for Obtaining a Real‐time Control

In the approximate optimal control technique, employing a limited number of parameters can only yield a local approximation of the model and the value function. However, if an approximation within a larger domain is intended, a considerably higher number of parameters may be needed. As a result, the identification and the controller's complexity might be rather too high to be performed online in real‐world applications. This convinces us to circumvent this constraint by considering a set of local simple learners instead, in a piecewise approach.

As mentioned, there exist already interesting real‐world applications of MBRL. Motivated by this, in this monograph, we aim on introducing automated ways of solving optimal control problems that can replace the conventional controllers. Hence, detailed applications of the proposed approaches are included, which are demonstrated with numerical simulations.

I.1.7 Summary

The main motivation for this monograph can be summarized as follows:

Optimal control is highly favored, while there is no general analytical technique applicable to all nonlinear systems.

Feedback control techniques are known to be more robust and computationally efficient compared to the numerical techniques, especially in the continuous space.

The chance of obtaining a feedback control in closed form is low, and the known techniques are limited to some special classes of systems.

Approximate DP provides a systematic way of obtaining an optimal feedback control, while the complexity grows significantly with the number of parameters.

An efficient parameterization of the optimal value may provide an opportunity for more complex real‐time applications in control regulation and tracking problems.

I.1.8 Outline of the Book

We summarize the main contents of the book as follows:

Chapter 1

introduces Lyapunov stability analysis of nonlinear systems, which are used in subsequent chapters for analyzing the closed‐loop performance of the feedback controllers.

Chapter 2

formulates the optimal control problem and introduces the basic concepts of using the HJB equation to characterize optimal feedback controllers, where LQR is treated as a special case. A focus is on optimal feedback controllers for asymptotic stabilization tasks with an infinite‐horizon performance criterion.

Chapter 3

discusses PI as a prominent RL technique for solving continuous optimal control problems. PI algorithms for both linear and nonlinear systems with and without any knowledge of the system model are discussed. Proofs of convergence and stability analysis are provided in a self‐contained manner.

Chapter 4

presents different techniques for learning a dynamic model for continuous control in terms of a set of basis functions, including least squares, recursive least squares, gradient descent, and sparse identification techniques for parameter updates. Comparison results are shown using numerical examples.

Chapter 5

introduces the

Structured Online Learning

(

SOL

) framework for control, including the algorithm and local analysis of stability and optimality. The focus is on regulation problems.

Chapter 6

extends the SOL framework to tracking with unknown dynamics. Simulation results are given to show the effectiveness of the SOL approach. Numerical results on comparison with alternative RL approaches are also shown.

Chapter 7

presents a piecewise learning framework work as a further extension of the SOL approach, where we limit to linear bases, while allowing models to be learned in a piecewise fashion. Accordingly, closed‐loop stability guarantees are provided with Lyapunov analysis facilitated by

Mixed‐Integer Quadratic Program

(

MIQP

)‐based verification.

Chapters 8

and

9

present two case studies on

Photovoltaic

(

PV

) and quadrotor systems.

Chapter 10

introduces the associated Python‐based tool for SOL.

It should be noted that, some of the contents of chapters 5–9 have been previously published in Farsi and Liu [2020, 2021], Farsi et al. [2022], Farsi and Liu [2022b, 2019], and they are included in this book with the permission of the cited publishers.

I.2 Literature Review

I.2.1 Reinforcement Learning

RL is a well‐known class of machine learning methods that are concerned with learning to achieve a particular task through interactions with the environment. The task is often defined by some reward mechanism. The intelligent agent has to take actions in different situations. Then, the reward accumulated is used as a measure to improve the agent's actions in future, where the objective is to accumulate as much as rewards as possible over some time. Therefore, it is expected that the agent's actions approach the optimal behavior in a long term. RL has gained a lot of successes in the simulation environment. However, the lack of explainability [Dulac‐Arnold et al., 2019] and data efficiency [Duan et al., 2016] make them less favorable as an online learning technique that can be directly employed in real‐world problems, unless there exists a way to safely transfer the experience from simulation‐based learning to the real world. The main challenges in the implementations of the RL techniques are discussed in Dulac‐Arnold et al. [2019]. Numerous studies are done on this subject; see, e.g. Sutton and Barto [2018], Wiering and Van Otterlo [2012], Kaelbling et al. [1996], and Arulkumaran et al. [2017] for a list of related works. RL has found a variety of interesting applications in robotics [Kober et al., 2013], multiagent systems [Zhang et al., 2021; Da Silva and Costa, 2019; Hernandez‐Leal et al., 2019], power systems [Zhang et al., 2019; Yang et al., 2020], autonomous driving [Kiran et al., 2021] and intelligent transportation [Haydari and Yilmaz, 2020], and healthcare [Yu et al., 2021], among others.

I.2.2 Model‐based Reinforcement Learning

MBRL techniques, as opposed to model‐free methods in learning, are known to be more data efficient. Direct model‐free methods usually require enormous data and hours of training even for simple applications [Duan et al., 2016], while model‐based techniques can show optimal behavior in a limited number of trials. This property, in addition to the flexibilities in changing learning objectives and performing further safety analysis, makes them more suitable for real‐world implementations, such as robotics [Polydoros and Nalpantidis, 2017]. In model‐based approaches, having a deterministic or probabilistic description of the transition system saves much of the effort spent by direct methods in treating any point in the state‐control space individually. Hence, the role of model‐based techniques becomes even more significant when it comes to problems with continuous controls rather than discrete actions [Sutton, 1990; Atkeson and Santamaria, 1997; Powell, 2004].

In Moerland et al. [2020], the authors provide a survey of some recent MBRL methods which are formulated based on Markov Decision Processes (MDPs). In general, there exist two approaches for approximating a system: parametric and nonparametric. Parametric models are usually preferred over nonparametric, since the number of the parameters is independent of the number of samples. Therefore, they can be implemented more efficiently on complex systems, where many samples are needed. On the other hand, in nonparametric approaches, the prediction for a given sample is obtained by comparing it with a set of samples already stored, which represent the model. Therefore, the complexity increases with the size of the dataset. In this book, because of this advantage of parametric models, we focus on the parametric techniques.

I.2.3 Optimal Control

Let us specifically consider implementations of RL on control systems. Regardless of the fact that RL techniques do not require the dynamical model to solve the problem, they are in fact intended to find a solution for the optimal control problem. This problem is extensively investigated by the control community. The LQR problem has been solved satisfactorily for linear systems using Riccati equations [Kalman, 1960], which also ensure system stability for infinite‐horizon problems. However, in the case of nonlinear systems, obtaining such a solution is not trivial and requires us to solve the HJB equation, either analytically or numerically, which is a challenging task, especially when we do not have knowledge of the system model.

Model Predictive Control (MPC) [Camacho and Alba, 2013; Garcia et al., 1989; Qin and Badgwell, 2003; Grüne and Pannek, 2017; Garcia et al., 1989; Mayne and Michalska, 1988; Morari and Lee, 1999] has been frequently used as an optimal control technique, which is inherently model‐based. Furthermore, it deals with the control problem only across a restricted prediction horizon. For this reason, and for the fact that the problem is not considered in the closed‐loop form, stability analysis is hard to establish. For the same reasons, the online computational complexity is considerably high, compared to a feedback control rule that can be efficiently implemented.

Forward‐Propagating Riccati Equation (FPRE) [Weiss et al., 2012; Prach et al., 2015] is one of the techniques presented for solving the LQR problem. Normally, the Differential Riccati Equation (DRE) is solved backward from a final condition. In an analogous technique, it can be solved in forward time with some initial condition instead. A comparison between these two schemes is given in Prach et al. [2015]. Employing forward‐integration methods makes it suitable for solving the problem for time‐varying systems [Weiss et al., 2012; Chen and Kao, 1997] or in the RL setting [Lewis et al., 2012], since the future dynamics are not needed, whereas the backward technique requires the knowledge of the future dynamics from the final condition. FPRE has been shown to be an efficient technique for finding a suboptimal solution for linear systems, while, for nonlinear systems, the assumption is that the system is linearized along the system's trajectories.

State‐dependent Riccati Equations (SDRE) [Çimen, 2008; Erdem and Alleyne, 2004; Cloutier, 1997] is another technique that can be found in the literature for solving the optimal control problem for nonlinear systems. This technique relies on the fact that any nonlinear system can be written in the form of a linear system with state‐dependent matrices. However, this conversion is not unique. Hence, a suboptimal solution is expected. Similar to MPC, it does not yield a feedback control rule since the control at each state is computed by solving a DRE that depends on the system's trajectory.

I.2.4 Dynamic Programming

Other model‐based approaches can be found in the literature that are mainly categorized under RL in two groups: value function and policy search methods. In value function‐based methods, known also as approximate/adaptive DP techniques [Wang et al., 2009; Lewis and Vrabie, 2009; Balakrishnan et al., 2008], a value function is used to construct the policy. On the other hand, policy search methods directly improve the policy to achieve optimality. Adaptive DP has found different applications [Prokhorov, 2008; Ferrari‐Trecate et al., 2003; Prokhorov et al., 1995; Murray et al., 2002; Yu et al., 2014; Han and Balakrishnan, 2002; Lendaris et al., 2000; Liu and Balakrishnan, 2000; Ferrari‐Trecate et al., 2003] in automotive control, flight control, power control, among others. A review of recent techniques can be found in Kalyanakrishnan and Stone [2009], Busoniu et al. [2017], Recht [2019], Polydoros and Nalpantidis [2017], and Kamalapurkar et al. [2018]. The Q‐learning approach learns an action‐dependent function using Temporal Difference (TD) to obtain the optimal policy. This is inherently a discrete approach. There are continuous extensions of this technique, such as [Millán et al., 2002; Gaskett et al., 1999; Ryu et al., 2019; Wei et al., 2018]. However, for an efficient implementation, the state and action spaces ought to be finite, which is highly restrictive for continuous problems.

Adaptive controllers [Åström and Wittenmark, 2013