100,99 €
Optimization for Learning and Control Comprehensive resource providing a masters' level introduction to optimization theory and algorithms for learning and control Optimization for Learning and Control describes how optimization is used in these domains, giving a thorough introduction to both unsupervised learning, supervised learning, and reinforcement learning, with an emphasis on optimization methods for large-scale learning and control problems. Several applications areas are also discussed, including signal processing, system identification, optimal control, and machine learning. Today, most of the material on the optimization aspects of deep learning that is accessible for students at a Masters' level is focused on surface-level computer programming; deeper knowledge about the optimization methods and the trade-offs that are behind these methods is not provided. The objective of this book is to make this scattered knowledge, currently mainly available in publications in academic journals, accessible for Masters' students in a coherent way. The focus is on basic algorithmic principles and trade-offs. Optimization for Learning and Control covers sample topics such as: * Optimization theory and optimization methods, covering classes of optimization problems like least squares problems, quadratic problems, conic optimization problems and rank optimization. * First-order methods, second-order methods, variable metric methods, and methods for nonlinear least squares problems. * Stochastic optimization methods, augmented Lagrangian methods, interior-point methods, and conic optimization methods. * Dynamic programming for solving optimal control problems and its generalization to reinforcement learning. * How optimization theory is used to develop theory and tools of statistics and learning, e.g., the maximum likelihood method, expectation maximization, k-means clustering, and support vector machines. * How calculus of variations is used in optimal control and for deriving the family of exponential distributions. Optimization for Learning and Control is an ideal resource on the subject for scientists and engineers learning about which optimization methods are useful for learning and control problems; the text will also appeal to industry professionals using machine learning for different practical applications.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 654
Veröffentlichungsjahr: 2023
Cover
Title Page
Copyright
Dedication
Preface
Acknowledgments
Glossary
Acronyms
About the Companion Website
Part 1: Introductory Part
1 Introduction
1.1 Optimization
1.2 Unsupervised Learning
1.3 Supervised Learning
1.4 System Identification
1.5 Control
1.6 Reinforcement Learning
1.7 Outline
2 Linear Algebra
2.1 Vectors and Matrices
2.2 Linear Maps and Subspaces
2.3 Norms
2.4 Algorithm Complexity
2.5 Matrices with Structure
2.6 Quadratic Forms and Definiteness
2.7 Spectral Decomposition
2.8 Singular Value Decomposition
2.9 Moore–Penrose Pseudoinverse
2.10 Systems of Linear Equations
2.11 Factorization Methods
2.12 Saddle‐Point Systems
2.13 Vector and Matrix Calculus
Exercises
Note
3 Probability Theory
3.1 Probability Spaces
3.2 Conditional Probability
3.3 Independence
3.4 Random Variables
3.5 Conditional Distributions
3.6 Expectations
3.7 Conditional Expectations
3.8 Convergence of Random Variables
3.9 Random Processes
3.10 Markov Processes
3.11 Hidden Markov Models
3.12 Gaussian Processes
Exercises
Notes
Part II: Optimization
4 Optimization Theory
4.1 Basic Concepts and Terminology
4.2 Convex Sets
4.3 Convex Functions
4.4 Subdifferentiability
4.5 Convex Optimization Problems
4.6 Duality
4.7 Optimality Conditions
Exercises
5 Optimization Problems
5.1 Least‐Squares Problems
5.2 Quadratic Programs
5.3 Conic Optimization
5.4 Rank Optimization
5.5 Partially Separability
5.6 Multiparametric Optimization
5.7 Stochastic Optimization
Exercises
Notes
6 Optimization Methods
6.1 Basic Principles
6.2 Gradient Descent
6.3 Newton's Method
6.4 Variable Metric Methods
6.5 Proximal Gradient Method
6.6 Sequential Convex Optimization
6.7 Methods for Nonlinear Least‐Squares
6.8 Stochastic Optimization Methods
6.9 Coordinate Descent Methods
6.10 Interior‐Point Methods
6.11 Augmented Lagrangian Methods
Exercises
Part III: Optimal Control
7 Calculus of Variations
7.1 Extremum of Functionals
7.2 The Pontryagin Maximum Principle
7.3 The Euler–Lagrange Equations
7.4 Extensions
7.5 Numerical Solutions
Exercises
Notes
8 Dynamic Programming
8.1 Finite Horizon Optimal Control
8.2 Parametric Approximations
8.3 Infinite Horizon Optimal Control
8.4 Value Iterations
8.5 Policy Iterations
8.6 Linear Programming Formulation
8.7 Model Predictive Control
8.8 Explicit MPC
8.9 Markov Decision Processes
8.10 Appendix
Exercises
Notes
Part IV: Learning
9 Unsupervised Learning
9.1 Chebyshev Bounds
9.2 Entropy
9.3 Prediction
9.4 The Viterbi Algorithm
9.5 Kalman Filter on Innovation Form
9.6 Viterbi Decoder
9.7 Graphical Models
9.8 Maximum Likelihood Estimation
9.9 Relative Entropy and Cross Entropy
9.10 The Expectation Maximization Algorithm
9.11 Mixture Models
9.12 Gibbs Sampling
9.13 Boltzmann Machine
9.14 Principal Component Analysis
9.15 Mutual Information
9.16 Cluster Analysis
Exercises
Notes
10 Supervised Learning
10.1 Linear Regression
10.2 Regression in Hilbert Spaces
10.3 Gaussian Processes
10.4 Classification
10.5 Support Vector Machines
10.6 Restricted Boltzmann Machine
10.7 Artificial Neural Networks
10.8 Implicit Regularization
Exercises
Note
11 Reinforcement Learning
11.1 Finite Horizon Value Iteration
11.2 Infinite Horizon Value Iteration
11.3 Policy Iteration
11.4 Linear Programming Formulation
11.5 Approximation in Policy Space
11.6 Appendix – Root‐Finding Algorithms
Exercises
Note
12 System Identification
12.1 Dynamical System Models
12.2 Regression Problem
12.3 Input–Output Models
12.4 Missing Data
12.5 Nuclear Norm system Identification
12.6 Gaussian Processes for Identification
12.7 Recurrent Neural Networks
12.8 Temporal Convolutional Networks
12.9 Experiment Design
Exercises
Notes
Appendix A
A.1 Notation and Basic Definitions
A.2 Software
References
Index
End User License Agreement
Chapter 2
Table 2.1 FLOP counts for basic matrix and vector operations: denotes a s...
Table 2.2 The four subspaces and the corresponding projection matrices.
Chapter 6
Table 6.1 Asymptotic behavior of step‐size sums and the resulting upper bou...
Table 6.2 Barrier functions for elementary proper convex cones.
Chapter 2
Figure 2.1 The four subspaces associated with an matrix with rank .
Figure 2.2 Parallelogram defined by two vectors and in . The area is gi...
Figure 2.3 Affine combinations of two points and .
Figure 2.4 Householder reflection in .
Figure 2.5 The black squares in the sparsity pattern (left) denote nonzero e...
Figure 2.6 Symbolic Cholesky factorizations of two different symmetric permu...
Chapter 4
Figure 4.1 The epigraph of a function and an example of a sublevel set ....
Figure 4.2 Stationary points of a continuously differentiable function .
Figure 4.3 A set is convex if the line segment between two points and is...
Figure 4.4 Convex and conic combinations of two points and . (a) Conic co...
Figure 4.5 A hyperplane and a halfspace in : (a) hyperplane and (b) halfspa...
Figure 4.6 Examples of a polyhedral sets: the set to the left is the interse...
Figure 4.7 Examples of norm balls in . The ‐norm ball and the ‐norm ball ...
Figure 4.8 The nonnegative orthant in .
Figure 4.9 The second‐order cone in .
Figure 4.10 An example of a convex cone and its dual cone in . Note tha...
Figure 4.11 A function is convex if the line segment connecting the two po...
Figure 4.12 The epigraph of a function is a convex set if and only if is...
Figure 4.13 A continuously differentiable function is convex if and only i...
Figure 4.14 The pointwise maximum of convex functions is itself convex.
Figure 4.15 The conjugate function is the largest signed vertical distance...
Figure 4.16 Illustration of subderivatives and subdifferential of nonsmooth ...
Figure 4.17 The optimality condition (4.54) implies that an optimal point ...
Chapter 5
Figure 5.1 Example of a two‐dimensional localization problem. The anchors ...
Figure 5.2 The exponential cone and examples of the power cone : (a) expo...
Figure 5.3 The optimal value associated with the rank‐constrained problem (5...
Figure 5.4 The interaction and clique intersection graphs associated with th...
Figure 5.5 A computational tree associated with the optimization problem (5....
Chapter 6
Figure 6.1 Illustration of the Armijo condition. The gray regions correspond...
Figure 6.2 Illustration of the curvature conditions (6.9) and (6.10). The ha...
Figure 6.3 Examples of the gradient descent iteration with three different s...
Figure 6.4 Newton's method with three different starting points.
Figure 6.5 The relative suboptimality for the PG and the APG methods.
Figure 6.6 Construction of upper and lower bounds on , illustrated for . T...
Figure 6.7 The relative suboptimality for the stochastic gradient method usi...
Figure 6.8 Contour plot of the function , which is convex but nonsmooth. Th...
Figure 6.9 Coordinate descent with exact line search applied to convex quadr...
Figure 6.10 The scaled barrier function is a smooth approximation to the i...
Figure 6.11 The path‐following method converges to an ‐suboptimal point b...
Chapter 7
Figure 7.1 The Brachistochrone problem.
Chapter 8
Figure 8.1 Graph for the shortest path problem in Example 8.2.
Figure 8.2 Graph for the shortest path problem in Example 8.2. The shortest ...
Figure 8.3 The cost for the infinite horizon criterion as a function of the ...
Figure 8.4 The trajectories for the states (top), and the control signals (b...
Chapter 9
Figure 9.1 Graph showing the links between different web pages.
Figure 9.2 Plot of the function .
Figure 9.3 The level curves of the pdf for the Gaussian mixture in Example 9...
Figure 9.4 Figure showing the level curves of the pdf for the Gaussian mixtu...
Figure 9.5 Figure showing samples from the Gaussian mixture in Example 9.4 t...
Figure 9.6 A graph where the subset of nodes in separates the subset of no...
Figure 9.7 Observations and estimated GMM with components based on maximum...
Figure 9.8 Realizations of samples obtained via direct sampling and Gibbs ...
Figure 9.9 Plot of compressed data resulting from PCA. The difference spec...
Figure 9.10 Level curves for the objective function between 2.15 (dark gray)...
Figure 9.11 Plot of resulting clustering. The original data is marked with...
Chapter 10
Figure 10.1 Plot showing the fourth degree polynomial (solid line), and the ...
Figure 10.2 Plot showing the Gaussian regression model (solid line), and the...
Figure 10.3 Plots showing data points from two classes marked with for the...
Figure 10.4 Plot showing the data points from the two classes marked with ...
Figure 10.5 Plot showing the data points from the two classes marked with ...
Figure 10.6 Graph showing how the visible and hidden layers in an RBM are co...
Figure 10.7 A neural network with four hidden layers. The nodes represent th...
Figure 10.8 Figure illustrating the recursive definition of the ANN predicto...
Chapter 12
Figure 12.1 Plots showing the estimated values of and for 100 runs of sy...
Figure 12.2 The left plot shows (solid), and (dashed), as function of ,...
Figure 12.3 Plot showing the Bode‐diagram of the Prefilter.
Figure 12.4 Scatter‐plots showing the estimated values of and . The left ...
Appendix A
Figure A.1 Modeling tools allow the users to specify their problems using a ...
Cover
Table of Contents
Title Page
Copyright
Dedication
Preface
Acknowledgments
Glossary
Acronyms
About the Companion Website
Begin Reading
Appendix A
References
Index
End User License Agreement
iii
iv
v
xvii
xix
xxi
xxii
xxiii
xxiv
xxv
xxvi
xix
xx
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
173
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
379
380
381
382
383
384
385
387
388
389
390
391
392
393
394
395
396
397
398
Anders HanssonLinköping UniversityLinköpingSweden
Martin AndersenTechnical University of DenmarkKongens LyngbyDenmark
Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data
Names: Hansson, Anders, author. | Andersen, Martin, author.Title: Optimization for learning and control / Anders Hansson, Linköping University, Linköping Sweden, Martin Andersen, Technical University of Denmark.Description: First edition. | Hoboken, NJ, USA : Wiley, [2023] | Includes index.Identifiers: LCCN 2023002568 (print) | LCCN 2023002569 (ebook) | ISBN 9781119809135 (cloth) | ISBN 9781119809142 (adobe pdf) | ISBN 9781119809173 (epub)Subjects: LCSH: System analysis–Mathematics. | Mathematical optimization. | Machine learning–Mathematics. | Signal processing–Mathematics. | MATLAB.Classification: LCC T57.62 .H43 2023 (print) | LCC T57.62 (ebook) | DDC 004.2/10151–dc23/eng/20230202LC record available at https://lccn.loc.gov/2023002568LC ebook record available at https://lccn.loc.gov/2023002569
Cover Design: WileyCover Image: © pluie_r/Shutterstock
To Erik.To Cassie, Maxwell, and Patrick.
This is a book about optimization for learning and control. The literature and the techniques for learning are vast, but we will here not focus on all possible learning methods. Instead we will discuss some of them, and especially the ones that result in optimization problems. We will also discuss what optimization methods are relevant to use for these optimization problems. The book is primarily intended for graduate students with a background in science or engineering and who want to learn more about what optimization methods are suitable for learning problems. It is also useful for those who want to study optimal control. Very limited knowledge of optimization, control, or learning is needed as a background. The book is accompanied with a large number of exercises, many of which involve computer tools in order for the students to obtain hands‐on experience.
The topics in learning span a wide range from classical statistical learning problems like regression and maximum likelihood estimation to more recent problems like deep learning using, e.g. recurrent neural networks. Regarding optimization methods, we cover methods from simple gradient methods to more advanced interior‐point methods for conic optimization. A special emphasis is on stochastic methods applicable to the training of neural networks. We also put a special emphasis on nondifferentiable problems for which we discuss subgradient methods and proximal methods. We cover second‐order methods, variable‐metric methods, and augmented Lagrangian methods. Regarding applications to system identification, we discuss identification both for input–output models as well as for state‐space models. Recurrent neural networks and temporal convolutional networks are naturally introduced as ways of modeling nonlinear dynamical systems. We also cover calculus of variations and dynamic programming in detail, and its generalization to reinforcement learning.
The book can be used to teach several different courses. One could be an introductory course in optimization based on Chapters 4–6. Another course could be on optimal control covering Chapters 7–8, and possibly also Chapter 11. Another course could be on learning covering Chapters 9–10 and perhaps Chapter 12. There is of course also the possibility to combine more chapters, and a course that has been taught at Linköping University for PhD students covers all but the material for optimal control.
Anders Hansson and Martin Andersen
Linköping and Kongens LyngbyNovember 2022
We would like to thank Andrea Garulli at University of Siena who invited Anders Hansson to give a course on optimization for learning in the spring of 2019. The experience from teaching that course provided most valuable inspiration for writing this book. Daniel Cederberg, Markus Fritzsche, and Magnus Malmström are gratefully acknowledged for having proofread some of the chapters.
Anders Hansson and Martin Andersen
Sets
set of natural numbers
set
set of integers
set
set of nonnegative integer numbers
set of real numbers
set of nonnegative real numbers
set of positive real numbers
set of extended real numbers
set of nonnegative extended real numbers
set of positive extended real numbers
set of complex numbers
set of symmetric real‐valued matrices of order
set of positive semidefinite real‐valued matrices of order
set of positive definite real‐valued matrices of order
quadratic cone of dimension
probability simplex of dimension
empty set
Numbers, Vectors, and Matrices
Euler's number
Archimedes' constant
vector of ones
identity matrix
Elementary Functions
natural exponential function
logarithm function
natural logarithm function
logarithm function, base 2
sine function
cosine function
tangent function
sign function
indicator function for set
Operations on Sets or Functions
affine hull of set
minimum argument
maximum argument
closure of set
convex hull of set
conic hull of set
effective domain
epigraph of function
inf
infimum of set or function
interior of set
max
maximum of set or function
min
minimum of set or function
proximal operator
relative interior of set
sup
supremum of set or function
Operations on Vectors, Vector Spaces or Matrices
adjugate of matrix
block diagonal matrix from matrices
determinant of matrix
diagonal matrix from vector
dimension of vector space or convex set
nullspace of matrix
number of nonzero entries
nullity of matrix
range of matrix
rank of matrix
span of vectors
symmetric vectorization of matrix
trace of matrix
vectorization of matrix
Probability Spaces
expectation functional
normal probability density function
probability measure
variance functional
Symbols
summation
product
integral
contour integral
infinity
belongs to
does not belong to
proper subset of
subset of
proper superset of
superset of
not proper subset of
not subset of
not proper superset of
not superset of
set union
set intersection
set difference
plus
minus
plus or minus
multiplied by
Kronecker product
Hadamard product or composition of functions
is equal to
is less than
is less than or equal to
is greater than
is greater than or equal to
is not equal to
is not less than
is neither less than nor equal to
is not greater than
is neither greater than nor equal to
much smaller than
much greater than
is approximately equal to
asymptotically equivalent to
proportional to
precedes
precedes or equals
succeeds
succeeds or equals
does not precede
neither precedes nor equals
does not succeed
neither succeeds nor equals
there exists
there is no
for all
logical not
logical and
logical or
implies
is implied by
is equivalent to
to or tends toward
corresponds to
tends toward from above
maps to
is perpendicular to
such that or given
:
such that
AdaGrad
adaptive gradient method
Adam
adaptive moment estimation
ADMM
alternating direction method of multipliers
a.e.
almost everywhere
ANN
artificial neural network
ARE
algebraic Riccati equation
ARMAX
auto‐regressive‐moving‐average with exogenous terms
ARX
auto‐regressive with exogenous terms
a.s.
almost surely
BB
Barzilai–Borwein
BFGS
Broyden, Fletcher, Goldfarb, and Shanno
BM
Boltzmann machine
DAE
differential algebraic equation
df
distribution function
DFP
Davidon, Fletcher, and Powell
DC
diagonal/correlated
EM
expectation maximization
EVD
eigenvalue decomposition
FIR
finite impulse response
FLOP
floating point operation
GN
Gauss–Newton
GMM
Gaussian mixture model
i.i.d.
independent, identically distributed
HMM
hidden Markov model
IFT
iterative feedback tuning
ILC
iterative learning control
IP
interior point
IPG
incremental proximal gradient
KKT
Karush–Kuhn–Tucker
LICQ
linear independence constraint qualification
LM
Levenberg–Marquardt
LMI
linear matrix inequality
LP
linear program
LQ
linear quadratic
LS
least‐squares
LSTM
long short‐term memory
MA
moving average
MAP
maximum a posteriori
MDP
Markov decision process
ML
maximum likelihood
MPC
model predictive control
m.s.
mean square
MSE
mean square error
NP
nondeterministic polynomial time
ODE
ordinary differential equation
OE
output error
PCA
principal component analysis
probability density function
pf
probability function
PI
policy iteration
PMP
Pontryagin maximum principle
QP
quadratic program
RBM
restricted Boltzmann machine
RMSprop
root mean square propagation
RNN
recurrent neural network
ReLU
rectified linear unit
SA
stochastic approximation
SAA
stochastic average approximation
SARSA
state‐action‐reward‐state‐action
SG
stochastic gradient
SGD
stochastic gradient descent
SMW
Sherman–Morrison–Woodbury
SNR
signal‐to‐noise ratio
SR1
symmetric rank‐1
SVD
singular value decomposition
SVM
support vector machine
SVRG
stochastic variance‐reduced gradient
TCM
temporal convolutional network
TPBVP
two‐point boundary value problem
VI
value iteration
w.p. 1
with probability one
This book is accompanied by a companion website.
www.wiley.com/go/opt4lc
This website includes:
• Data files and templates for solving the exercises
• Solutions to the exercise in the book in terms of a pdf-document (Instructors only)
• MATLAB solution files (Instructors only)
This book will take you on a journey through the fascinating field of optimization, where we explore techniques for designing algorithms that can learn and adapt to complex systems. Whether you are an engineer, a scientist, or simply curious about the world of optimization, this book is for you. We will start with the basics of optimization and gradually build up to advanced techniques for learning and control. By the end of this book, you will have a solid foundation in optimization theory and practical tools to apply to real‐world problems. In this opening, we informally introduce problems and concepts, and we will explain their close interplay with simple formulations and examples. Chapters 2–13 will explore the topic with more rigor, and we end this chapter with an outline of the remaining content of the book.
Optimization is about choosing a best option from a set of available alternatives based on a specific criterion. This concept applies to a range of disciplines, including computer science, engineering, operations research, and economics, and has a long history of conceptual and methodological development. One of the most common optimization problems is of the form
with variable . This is called a nonlinear least‐squares problem, since we are minimizing the squares of the possibly nonlinear functions . We will see that the least‐squares problem and its generalizations have many applications to learning and control. It is also the backbone of several optimization methods for solving more complicated optimization problems. In Chapter 4, we set the foundations for optimization theory, in Chapter 5, we cover different classes of optimization problems that are relevant for learning and control, and in Chapter 6, we discuss different methods for solving optimization problems numerically.
Learning is often about finding low‐dimensional descriptions of data that provide insight and are easy to understand or interpret. A very simple example is the repeated measurement of the same real‐valued quantity . This could be the length of a piece of wood. Each measurement may be assumed to be somewhat different because of random measurement errors, i.e. we have a sequence of, say, measurements . Clearly, a reasonable estimate of would be the average of the measurements, i.e.
This is a scalar representation of the measurements, and hence of lower dimension, which is a reasonable representation of the length of the piece of wood. Here we have estimated the length of the piece of wood by averaging the measurements. The learning we have discussed here is called unsupervised learning, and it is discussed in more detail in Chapter 9.
Let us consider another simple problem where we are driving a car with constant speed . We start at an unknown distance from a city toward which we are driving, and we receive distances information to the city from road signs corresponding to having traveled for a time of , . The following relation should hold
where and . We assume that we do not know and we would like to learn it. It is enough to have , i.e. two linear equations, to solve for , and this solution will be unique since the s will be linearly independent unless . However, in an application like this, it is unlikely that the above equations hold exactly due to measurement errors , i.e.
is a more appropriate description, and hence, it might be more suitable to have larger than 2 to average out the effect of the measurement errors. This can be accomplished by solving
with variable . This is an example of a least‐squares problem for which is an affine function. Hence, this is often called a linear least‐squares problem. Typically, is much larger than , and therefore, the optimal solution is of lower dimension than the measurements. If we later get a new value of , we may predict the value of the corresponding measurement as without performing the measurement. For our application, this means that we can estimate the distance to the city by just checking how long we have been traveling. We do not have to wait for a new sign to appear. This is a so‐called supervised learning problem, since for each , we know the corresponding . For learning the length of the piece of wood the data did not come in pairs, but instead, we just had one stream of data . We learned the mean of the data. That is the reason for the name unsupervised learning for such a learning problem. We will discuss supervised learning in more detail in Chapter 10.
Many physical phenomena can be described with dynamical systems, which in mathematical terms are given by differential equations or difference equations. A simple example is the difference equation
where denotes time, is a physical phenomenon that we can observe. It is influenced by both its previous value and by a quantity denoted by . Depending on which field is studying dynamical systems, the pairs are called stimuli and response or input and output, respectively. Sometimes the word signal is added at the end for each of the four words. Often, the above equation does not hold exactly due to measurement errors , and hence, it is more appropriate to consider
When we do not know the parameters, we can use supervised learning to learn the values assuming that we are given pairs of data for . The following linear least‐squares problem
with variables will provide an estimate of the parameters. Learning for dynamical systems is called system identification, and it will be discussed in more detail in Chapter 12.
Control is about making dynamical systems behave in a way we find desirable. Let us again consider the dynamical system in (1.2), where we are going to influence the behavior by manipulating the input signal . In the context of control, we also often call it the control signal. We assume that the initial value and the parameters are known, and our objective is to make for small. We can make equal to zero by taking , and then we can make all future values of equal to zero by taking all future values of equal to zero. However, it could be that the value of is large, and in applications, this could be costly. Hence, we are interested in finding a trade‐off between how large the values of are in comparison to the values of . This can be accomplished by solving
with variables . This is an equality constrained linear least‐squares problem. The parameter can be used to trade‐off how small should be as compared to . We will cover control of dynamical systems in Chapter 7 for continuous time, and in Chapter 8 for discrete time.
When we discussed control above we had to know the values of the parameters . Had they not been known, we could have used system identification to estimate them, and then we could have used the estimated parameters for solving the control problem. However, sometimes it is desirable to skip the system identification step and do control without knowing the parameter values. One way to accomplishing this for the formulation in (1.3) is called reinforcement learning. This will be discussed in more detail in Chapter 11.
The book is organized as follows: first, we give a background on linear algebra and probabilities in Chapters 2 and 3. Background on optimization is given in Chapter 4. We will cover both convex and nonconvex optimization. Chapter 5 introduces different classes of optimization problems that we will encounter in the learning chapters later on. In Chapter 6, we discuss different optimization methods that are suitable for solving learning problems. After this, we discuss calculus of variations in Chapter 7 and dynamic programming in Chapter 8. We then cover unsupervised learning in Chapter 9, supervised learning in Chapter 10, and reinforcement learning in Chapter 11. Finally, we discuss system identification in Chapter 12. For information about notation, basic definitions, and software useful for optimization and for the applications we consider, see the Appendix.
Linear algebra is the study of vector spaces and linear maps on such spaces. It constitutes a fundamental building block in optimization and is used extensively for theoretical analysis and derivations as well as numerical computations.
A procedure for solving systems of simultaneous linear equations appeared already in an ancient Chinese mathematical text. Systems of linear equations were introduced in Europe in the seventeenth century by René Descartes in order to represent lines and planes by linear equations and to compute their intersections. Gauss developed the method of elimination. Further important developments were done by Gottfried Wilhelm von Leibniz, Gabriel Cramer, Hermann Grassmann, and James Joseph Sylvester, the latter introducing the term “matrix.”
The purpose of this chapter is to review key concepts from linear algebra and calculus in finite‐dimensional vector spaces as well as a number of useful identities that will be used throughout the book. We also discuss some computational aspects, including a number of matrix factorizations and their application to solving systems of linear equations.
We start by introducing vectors and matrices. A vector of length is an ordered collection of numbers,
where is the th element or entry of the vector . The ‐dimensional real space, denoted , is the set of real‐valued ‐vectors, i.e. vectors of length whose elements are all real. The product of a scalar and a vector is defined as , the sum of two real‐valued ‐vectors and is the vector , and the Euclidean inner product or dot product of and is the real number
The vectors and are said to be orthogonal if .
A matrix of size ‐by‐, also written as , is an ordered rectangular array that consists of elements arranged in rows and columns, i.e.
where denotes the element of in its th row and th column. The set of ‐by‐ matrices with real‐valued elements is denoted by . The transpose of is the matrix defined as
i.e. the th element of is the th element of .
It is often convenient to think of a vector as a matrix with a single row or column. For example, when interpreted as a matrix of size , the vector is referred to as a row vector, and similarly, when interpreted as a matrix of size , is referred to as a column vector. In this book, we use the convention that all vectors are column vectors. Thus, a vector is always interpreted as the column vector
and hence, is interpreted as the row vector . Similarly, to refer to the columns of a matrix , we will sometimes use the notation
where . When referring to the rows of , we will define
where such that is the matrix with rows . The notation is somewhat ambiguous because may refer to the th element of a vector or the th column of either or , but the meaning will be clear from the context and follows from our convention that vectors are column vectors.
Given two vectors , the inner product can also be expressed as the product
In contrast, the outer product of two vectors and , not necessarily of the same length, is defined as the matrix
The product of a matrix , with columns , and a vector is the linear combination
Equivalently, the th element of the vector is the inner product of and the th row of .
The vector inner and outer products and matrix–vector multiplication are special cases of matrix multiplication. Two matrices and are said to be conformable for multiplication if the number of columns in is equal to the number of rows in . Given two such matrices and , the product is the matrix with elements
Note that is the inner product of the th row of and the th column of . As a result, may be expressed as
where are the rows of and are the columns of . Equivalently, by expressing in terms of its columns and in terms of its rows, may also be written as the sum of outer products
where are the columns of and are the rows of .
It is important to note that matrix multiplication is associative, but unlike scalar multiplication, it is not commutative. In other words, the associative property holds, provided that , , and are conformable for multiplication, but the identity does not hold in general.
The Frobenius inner product of two matrices is defined as
This is also called the trace inner product of and since
where the trace of a square matrix is defined as
The inner product may also be written as , where maps a matrix to a vector of length by stacking the columns of , i.e.
The span of vectors is the set of all linear combinations of these vectors, i.e.
The set of vectors is said to be linearly independent if
and otherwise, the set is said to be linearly dependent. Equivalently, the set is linearly independent if and only if all vectors in have a unique representation as a linear combination of . The span of a set of linearly independent vectors is a ‐dimensional subspace of , and is a basis for the subspace. In other words, the dimension of a subspace is equal to the number of linearly independent vectors that span the subspace. The vectors are said to be orthonormal if they are mutually orthogonal and of unit length, i.e. if and for . The standard basis or natural basis for is the orthonormal basis , where is the unit vector whose th element is equal to 1, and the rest are zero.
The range of a matrix , denoted , is the span of its columns. This is also referred to as the column space of , whereas is referred to as the row space of . The dimension of is called the rank of , denoted . The null space of , denoted , consists of all vectors such that , i.e.
The dimension of is called the nullity of and is denoted . The null space of is said to be trivial if in which case .
It follows from the definition of the range and nullspace of a matrix that
To see that , note that for every , we have that
or equivalently, if we let , we see that for all . This shows that and , which are both subspaces of , are orthogonal complements. Similarly, for every , it immediately follows that for all , and hence, is the orthogonal complement of .
The result (2.3) can be used to derive the so‐called rank‐nullity theorem which states that
To see this, note that the identity combined with the fact that
implies that The rank‐nullity theorem follows from the identity
which we will now derive. First, suppose that , and let be a linearly independent set of vectors that span . It follows that the set of vectors is linearly independent since
As a result, we have that . Applying the same inequality to implies that . A direct consequence of this identity is that . We say that has full rank if , and we will use the term full row rank when and the term full column rank when . The four subspaces , , , and and their dimensions are illustrated in Figure 2.1.
Figure 2.1 The four subspaces associated with an matrix with rank .
If two matrices and are conformable for multiplication, then
This follows from the fact that and which implies that and . Furthermore, we have that when has full row rank, in which case .
If and are conformable for addition, then
This means that rank is subadditive, and it follows from the fact that where
is the Minkowski sum of and . Subadditivity implies that a rank matrix can be decomposed as the sum of or more rank 1 matrices. Thus, a rank matrix can be factorized as
where and are the columns of and , respectively, and and .
A matrix with an equal number of rows and columns is called a square matrix, and the order of such a matrix refers to its size. The square matrix with ones on the main diagonal and zeros elsewhere is an identity matrix and denoted by , i.e. if and otherwise, . Furthermore, a square matrix of order is said to be nonsingular or invertible if it has full rank, and otherwise, it is said to be singular. Equivalently, a square matrix is invertible if there exists a matrix such that . Such a matrix is unique if it exists, and it is called the inverse of and is denoted by . In the special case where , the matrix is said to be orthogonal. The columns of an orthogonal matrix are orthonormal since implies that . More generally, if two matrices and are conformable for multiplication, i.e. not necessarily square matrices, and , then is a right inverse of whereas is a left inverse of .
The determinant of a scalar matrix is the scalar itself, whereas the determinant of a square matrix of order may be defined recursively as
where denotes the minor of which is the determinant of the matrix that is obtained by removing the th row and th column of . This expression is a so‐called Laplace expansion of the determinant along the th column of , and it holds for every . As a special case, the determinant of a matrix may be expressed as
and its absolute value may be interpreted as the area of a parallelogram defined by the columns of , as illustrated in Figure 2.2. More generally, the absolute value of the determinant of an matrix is the volume of the parallelotope defined by the columns of , i.e. the set
As a result, if and only if has full rank.
Figure 2.2 Parallelogram defined by two vectors and in . The area is given by , where is the projection of onto the line with normal .
The term is called the cofactor of . The matrix composed of all the cofactors, i.e. the matrix with elements
is called the cofactor matrix. Expressed in terms of the cofactors, the Laplace expansion (2.8) may be written as the inner product of the th column of and , i.e.
Furthermore, since the Laplace expansion is valid for any , the diagonal elements of the matrix are all equal to . In fact, it can be shown that
and if is nonsingular, this identity implies that
where , the transpose of the cofactor matrix, is known as the adjugate matrix which is denoted by .
Figure 2.3 Affine combinations of two points and .
Given two points and , the point
is a so‐called affine combination of and . The set of all such affine combinations of and , which may be expressed as
is a line in if . This is illustrated in Figure 2.3. More generally, an affine combination of points is a linear combination
where . A set is an affine set if and only if it contains all affine combinations of its points. Equivalently, is affine if for every pair of points ,
An example of an affine set is the set of solutions to the system of equation with and , i.e.
In fact, every affine subset of may be expressed in this form. The affine hull of a set , which we denote , is the smallest possible affine set that contains .
Equipped with the Euclidean inner product (2.1), the set is a Euclidean vector space of dimension with the induced norm
More generally, a norm of a vector , denoted , is a function with the following defining properties:
Subadditive
1
:
Absolutely homogeneous:
Positive definite:
Nonnegative:
Such a function is not unique, so to distinguish between different norms, a subscript is typically used to denote specific norms. For example, the 2‐norm or Euclidean norm of a vector , defined in (2.10), is denoted . Other examples are the 1‐norm and the infinity norm, defined as
These are special cases of the more general vector ‐norm
which is defined for .
A norm on is said to be orthogonally invariant if for all and all orthogonal matrices . The 2‐norm is orthogonally invariant, which follows by noting that if is an orthogonal matrix.
The following inequality, known as the Cauchy–Schwarz inequality, is used extensively in linear algebra and optimization:
The inequality can be derived by noting that
where the inequality follows from the nonnegativity property of the norm. An immediate consequence is that for all and , from which the Cauchy–Schwarz inequality (2.12) follows by rearranging the terms and taking the square root.
The Frobenius norm of is a vector norm on induced by the Frobenius inner product, i.e.
This is sometimes referred to as an entrywise norm. In contrast, given both vector norms on and , the operator norm or induced norm on may be defined as
It follows directly from this definition that for all and that . The induced norm may also be expressed as
from which the submultiplicative property follows:
The matrix ‐norm of is the norm induced by the vector ‐norm, and it is denoted . The next example considers the matrix 1‐norm, and the matrix infinity norm is treated in Exercise 2.3.
The matrix 1‐norm on is the induced norm defined as
where denote the columns of . To verify the third equality, first note that
where are the columns of the identity matrix of order . Moreover, subadditivity, i.e. the triangle inequality, implies that
which shows that .
The number of arithmetic operations on floating‐point numbers required by an algorithm is often used as a rough measure of its complexity. One floating‐point operation or “FLOP” refers to a single floating‐point addition, subtraction, multiplication, or division, and an algorithm's FLOP count is the total number of FLOPs. When expressed in terms of the problem size, the FLOP count often provides a useful characterization of the asymptotic growth of the running time as a function of the problem size. However, FLOP counts can generally not be used to accurately predict computation time on modern computers. Table 2.1 shows the FLOP counts for some basic operations involving vectors and/or matrices. Since we are mostly interested in the asymptotic growth rate as one or several problem dimensions increase, it is customary to use the so‐called “Big O” notation. A function is said to be “big O” of , which is expressed mathematically as , if there exists scalars , , and such that
The function provides an upper bound on the growth rate of as the parameters and/or increase. For example, the FLOP count for the matrix–vector product is since for all , . Similarly, the function satisfies since for , .
Table 2.1 FLOP counts for basic matrix and vector operations: denotes a scalar, and are ‐vectors, and and are matrices.
Operation
FLOPs
Scaling
Vector addition/subtraction
Inner product
Matrix–vector multiplication
Matrix–matrix multiplication
Recall that matrix multiplication is associative, i.e. given three matrices , , and that are conformable for multiplication, we have that . This implies that the product of three or more matrices can be evaluated in several ways, each of which may have a different FLOP count. For example, the product with , , and may be evaluated left‐to‐right by first computing and then , or right‐to‐left by computing and then . The first approach requires FLOPs, whereas the second approach requires FLOPs. A special case is the product with , , and which requires FLOPs when evaluated left‐to‐right, but only FLOPs when evaluated right‐to‐left. More generally, the product of matrices requires matrix–matrix multiplications which can be carried out in any order since matrix multiplication is associative. The problem of finding the order that yields the lowest FLOP count is a combinatorial optimization problem, which is known as the matrix chain ordering problem, see Exercise 5.7. This problem can be solved by means of dynamic programming, which we introduce in Section 5.5 and discuss in more detail in Chapter 8.
Matrices are sometimes classified according to their mathematical properties and structure, which are often of interest from a computational point of view. This section provides a brief review of some of the most frequently encountered kinds of structure in this book.
A diagonal matrix is a square matrix in which all off‐diagonal entries are equal to zero. Such a matrix may be stored as a vector of length rather than as a general square matrix with entries, which amounts to significant saving in terms of storage when is large. We use the notation
when referring to the diagonal matrix with the elements of on its diagonal. Diagonal matrices are also attractive from a computational point of view. For example, a matrix–vector product of the form involves only FLOPs, whereas a matrix–vector product with a general square matrix requires FLOPs. Similarly, the inverse of a nonsingular diagonal matrix is also diagonal, i.e. .
Recall that a square matrix of order is said to be orthogonal if , or equivalently, . The product of two orthogonal matrices and of order is itself orthogonal, which follows by noting that .
A basic example of an orthogonal matrix of order 2 is a rotation matrix of the form
The action of such a matrix corresponds to the counterclockwise rotation by an angle about the origin in . More generally, a Givens rotation in is a rotation in a two‐dimensional plane defined by two coordinate axes, say, and . Such a transformation can be represented by an orthogonal matrix of order
where and such that the principal submatrix defined by and is of the form (2.16). Givens rotations are used frequently in linear algebra as a means to introduce zeros in a vector or a matrix. For example, a nonzero vector can be transformed into the vector with by means of a rotation. Indeed, it is easy to verify that
if we let and .
A Householder transformation or Householder reflection in is a linear transformation that corresponds to the reflection about a hyperplane that contains the origin. Such a transformation can be represented in terms of an orthogonal matrix of the form
where is normal to the reflection plane. This is illustrated in Figure 2.4. Matrices of the form (2.18) are referred to as Householder matrices, and they are typically stored implicitly. In other words, only the vector , which uniquely determines , is stored rather than storing directly. This requires storage rather than . Moreover, matrix–vector products with a Householder matrix can be computed in FLOPs by noting that the operation amounts to a vector operation, i.e. with . A Householder transformation may be used to simultaneously introduce up to zeros in a vector of length , see Exercise 2.4.
Figure 2.4 Householder reflection in .
Yet another example of a family of orthogonal matrices are permutation matrices. A permutation of the elements of a vector is a reordering of the elements. This can be expressed as a map of the form
where the function defines the permutation and satisfies . This map can be expressed as a matrix–vector product , where is the permutation matrix defined as
It is easy to verify that since whenever , which implies that is orthogonal. It follows that the map (2.19) is invertible, and is the permutation matrix that corresponds to the inverse permutation . The special case where corresponds to reversing the order of elements and leads to the permutation matrix
which is the so‐called exchange matrix of order .
A square matrix in which the elements above or below the main diagonal are zero is called a triangular matrix. A matrix is said to be lower triangular if , whereas is said to be upper triangular if . Furthermore, a triangular matrix with ones on its diagonal is said to be unit triangular. The determinant of a triangular matrix of order may be expressed as
which follows from the Laplace expansion (2.8). A direct consequence is that is nonsingular if and only if all of the diagonal entries of are nonzero. The inverse of a nonsingular lower (upper) triangular matrix is another lower (upper) triangular matrix. This follows from the identity (2.9) by noting that the cofactor matrix associated with is itself triangular. To see this, first recall that the minor of is the determinant of the matrix obtained by deleting the th row and the th column of . This implies that if, say, is lower triangular and , then the effect of deleting the th row and the th column of is a triangular matrix of order . This matrix will have at least one diagonal entry that is equal to zero, and hence, it is singular, which implies that the minor of is zero. We note that the product of two lower (upper) triangular matrices is another lower (upper) triangular matrix.
Given a nonsingular triangular matrix , it is possible to compute matrix–vector products of the form and without computing first. For example, if is lower triangular, then can be computed by means of forward substitution, i.e.
This requires approximately FLOPs. Similarly, if is an upper triangular matrix, then can be computed using backward substitution, i.e.
which also requires approximately FLOPs.
A square matrix is said to be symmetric if , and
is the set of symmetric matrices of order . Similarly, a square matrix is skew‐symmetric if , which implies that the diagonal elements of a skew‐symmetric matrix are equal to zero. A square matrix can always be decomposed into the sum of a symmetric and a skew‐symmetric matrix, i.e. can be expressed as , where is symmetric and is skew‐symmetric. We note that this is an orthogonal decomposition of since the Frobenius inner product of a symmetric and a skew‐symmetric matrix is always zero.
Equipped with the Frobenius inner product, which is defined in (2.2), the set is a Euclidean vector space of dimension . It is sometimes convenient to work with a vector representation of a symmetric matrix, and to this end, we introduce the bijective linear map defined as
i.e. stacks the lower triangular part of the columns of and scales the off‐diagonal entries by . It is straightforward to verify that this definition implies that for all ,
which means that the transformation preserves inner products.
A Toeplitz matrix of order n is a square matrix of the form
where the first row and column of define the subdiagonal, diagonal, and superdiagonal elements. Such a matrix is sometimes referred to as a diagonal‐constant matrix, and it is uniquely determined by numbers in contrast to a general matrix, which has degrees of freedom. A special case of (2.23) is the so‐called lower‐shift matrix of order , which has ones on the first subdiagonal and zeros elsewhere, i.e.
Similarly, is sometimes referred to as the upper‐shift matrix of order . It is easy to verify that with is the matrix with ones on the th subdiagonal and zeros elsewhere. Using the convention that , we may express any Toeplitz matrix of the form (2.23) as
A lower triangular Toeplitz matrix whose first column is given by can be expressed as
It then follows from the fact that that is nonsingular if and only if . Moreover, a matrix of the form (2.26) commutes with , which follows from the fact that commutes with , i.e. . It can also be shown that the inverse of a nonsingular lower triangular Toeplitz matrix is another lower triangular Toeplitz matrix, see Exercise 2.6.
Toeplitz matrices are closely related to another class of matrices called Hankel matrices. Like a Toeplitz matrix, a Hankel matrix is a square matrix that is uniquely determined by its first row and column. However, unlike a Toeplitz matrix, a Hankel matrix has constant entries along the anti‐diagonal, the anti‐subdiagonals, and anti‐superdiagonals. As a consequence, the exchange matrix maps a Toeplitz matrix to a Hankel matrix and vice versa, i.e. if is a Toeplitz matrix, then and are both Hankel matrices. We note that Hankel matrices are symmetric, whereas Toeplitz matrices are persymmetric, i.e. symmetric about the anti‐diagonal. Finally, we note that the notion of Toeplitz (Hankel) matrices can be extended to include nonsquare matrices with diagonal‐constant (anti‐diagonal‐constant) structure, and such matrices may be viewed as submatrices of square Toeplitz (Hankel) matrices.
A matrix is said to be sparse or sparsely populated if a large number of its entries are equal to zero. In contrast, a matrix with few or no zero entries is said to be dense. We will use to denote the number of nonzero entries in a matrix . Sparse matrices can be stored efficiently by storing only the nonzero entries. For example, storing a sparse matrix as a list of triplets of the form requires rather than storage. Similarly, a matrix–vector product requires only FLOPs if sparsity is exploited as opposed to FLOPs if is a general dense matrix.
A sparse matrix is said to be banded if all its nonzero entries are located within a band about the main diagonal. The bandwidth of is the smallest integer such that if . In other words, all entries below the th subdiagonal or above the superdiagonal are equal to zero. The lower bandwidth of is a nonnegative integer such that if , and similarly, the upper bandwidth of is a nonnegative integer such that if . A bandwidth of 0 corresponds to a diagonal matrix, whereas a bandwidth of 1 is a tridiagonal matrix or, if or , an upper or a lower bidiagonal matrix.
A real ‐ary quadratic form is a homogeneous quadratic polynomial in variables, or equivalently, a function from to of the form
for some . It is straightforward to verify that
which implies that only the symmetric part of contributes to the value of . We therefore limit our attention to the case where .
A matrix is positive semidefinite if and only if for all , and it is positive definite if and only if for all . Similarly, is negative (semi)definite if is positive (semi)definite, and it is indefinite