Optimization for Learning and Control - Anders Hansson - E-Book

Optimization for Learning and Control E-Book

Anders Hansson

0,0
100,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Optimization for Learning and Control Comprehensive resource providing a masters' level introduction to optimization theory and algorithms for learning and control Optimization for Learning and Control describes how optimization is used in these domains, giving a thorough introduction to both unsupervised learning, supervised learning, and reinforcement learning, with an emphasis on optimization methods for large-scale learning and control problems. Several applications areas are also discussed, including signal processing, system identification, optimal control, and machine learning. Today, most of the material on the optimization aspects of deep learning that is accessible for students at a Masters' level is focused on surface-level computer programming; deeper knowledge about the optimization methods and the trade-offs that are behind these methods is not provided. The objective of this book is to make this scattered knowledge, currently mainly available in publications in academic journals, accessible for Masters' students in a coherent way. The focus is on basic algorithmic principles and trade-offs. Optimization for Learning and Control covers sample topics such as: * Optimization theory and optimization methods, covering classes of optimization problems like least squares problems, quadratic problems, conic optimization problems and rank optimization. * First-order methods, second-order methods, variable metric methods, and methods for nonlinear least squares problems. * Stochastic optimization methods, augmented Lagrangian methods, interior-point methods, and conic optimization methods. * Dynamic programming for solving optimal control problems and its generalization to reinforcement learning. * How optimization theory is used to develop theory and tools of statistics and learning, e.g., the maximum likelihood method, expectation maximization, k-means clustering, and support vector machines. * How calculus of variations is used in optimal control and for deriving the family of exponential distributions. Optimization for Learning and Control is an ideal resource on the subject for scientists and engineers learning about which optimization methods are useful for learning and control problems; the text will also appeal to industry professionals using machine learning for different practical applications.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 654

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

Preface

Acknowledgments

Glossary

Acronyms

About the Companion Website

Part 1: Introductory Part

1 Introduction

1.1 Optimization

1.2 Unsupervised Learning

1.3 Supervised Learning

1.4 System Identification

1.5 Control

1.6 Reinforcement Learning

1.7 Outline

2 Linear Algebra

2.1 Vectors and Matrices

2.2 Linear Maps and Subspaces

2.3 Norms

2.4 Algorithm Complexity

2.5 Matrices with Structure

2.6 Quadratic Forms and Definiteness

2.7 Spectral Decomposition

2.8 Singular Value Decomposition

2.9 Moore–Penrose Pseudoinverse

2.10 Systems of Linear Equations

2.11 Factorization Methods

2.12 Saddle‐Point Systems

2.13 Vector and Matrix Calculus

Exercises

Note

3 Probability Theory

3.1 Probability Spaces

3.2 Conditional Probability

3.3 Independence

3.4 Random Variables

3.5 Conditional Distributions

3.6 Expectations

3.7 Conditional Expectations

3.8 Convergence of Random Variables

3.9 Random Processes

3.10 Markov Processes

3.11 Hidden Markov Models

3.12 Gaussian Processes

Exercises

Notes

Part II: Optimization

4 Optimization Theory

4.1 Basic Concepts and Terminology

4.2 Convex Sets

4.3 Convex Functions

4.4 Subdifferentiability

4.5 Convex Optimization Problems

4.6 Duality

4.7 Optimality Conditions

Exercises

5 Optimization Problems

5.1 Least‐Squares Problems

5.2 Quadratic Programs

5.3 Conic Optimization

5.4 Rank Optimization

5.5 Partially Separability

5.6 Multiparametric Optimization

5.7 Stochastic Optimization

Exercises

Notes

6 Optimization Methods

6.1 Basic Principles

6.2 Gradient Descent

6.3 Newton's Method

6.4 Variable Metric Methods

6.5 Proximal Gradient Method

6.6 Sequential Convex Optimization

6.7 Methods for Nonlinear Least‐Squares

6.8 Stochastic Optimization Methods

6.9 Coordinate Descent Methods

6.10 Interior‐Point Methods

6.11 Augmented Lagrangian Methods

Exercises

Part III: Optimal Control

7 Calculus of Variations

7.1 Extremum of Functionals

7.2 The Pontryagin Maximum Principle

7.3 The Euler–Lagrange Equations

7.4 Extensions

7.5 Numerical Solutions

Exercises

Notes

8 Dynamic Programming

8.1 Finite Horizon Optimal Control

8.2 Parametric Approximations

8.3 Infinite Horizon Optimal Control

8.4 Value Iterations

8.5 Policy Iterations

8.6 Linear Programming Formulation

8.7 Model Predictive Control

8.8 Explicit MPC

8.9 Markov Decision Processes

8.10 Appendix

Exercises

Notes

Part IV: Learning

9 Unsupervised Learning

9.1 Chebyshev Bounds

9.2 Entropy

9.3 Prediction

9.4 The Viterbi Algorithm

9.5 Kalman Filter on Innovation Form

9.6 Viterbi Decoder

9.7 Graphical Models

9.8 Maximum Likelihood Estimation

9.9 Relative Entropy and Cross Entropy

9.10 The Expectation Maximization Algorithm

9.11 Mixture Models

9.12 Gibbs Sampling

9.13 Boltzmann Machine

9.14 Principal Component Analysis

9.15 Mutual Information

9.16 Cluster Analysis

Exercises

Notes

10 Supervised Learning

10.1 Linear Regression

10.2 Regression in Hilbert Spaces

10.3 Gaussian Processes

10.4 Classification

10.5 Support Vector Machines

10.6 Restricted Boltzmann Machine

10.7 Artificial Neural Networks

10.8 Implicit Regularization

Exercises

Note

11 Reinforcement Learning

11.1 Finite Horizon Value Iteration

11.2 Infinite Horizon Value Iteration

11.3 Policy Iteration

11.4 Linear Programming Formulation

11.5 Approximation in Policy Space

11.6 Appendix – Root‐Finding Algorithms

Exercises

Note

12 System Identification

12.1 Dynamical System Models

12.2 Regression Problem

12.3 Input–Output Models

12.4 Missing Data

12.5 Nuclear Norm system Identification

12.6 Gaussian Processes for Identification

12.7 Recurrent Neural Networks

12.8 Temporal Convolutional Networks

12.9 Experiment Design

Exercises

Notes

Appendix A

A.1 Notation and Basic Definitions

A.2 Software

References

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 FLOP counts for basic matrix and vector operations: denotes a s...

Table 2.2 The four subspaces and the corresponding projection matrices.

Chapter 6

Table 6.1 Asymptotic behavior of step‐size sums and the resulting upper bou...

Table 6.2 Barrier functions for elementary proper convex cones.

List of Illustrations

Chapter 2

Figure 2.1 The four subspaces associated with an matrix with rank .

Figure 2.2 Parallelogram defined by two vectors and in . The area is gi...

Figure 2.3 Affine combinations of two points and .

Figure 2.4 Householder reflection in .

Figure 2.5 The black squares in the sparsity pattern (left) denote nonzero e...

Figure 2.6 Symbolic Cholesky factorizations of two different symmetric permu...

Chapter 4

Figure 4.1 The epigraph of a function and an example of a sublevel set ....

Figure 4.2 Stationary points of a continuously differentiable function .

Figure 4.3 A set is convex if the line segment between two points and is...

Figure 4.4 Convex and conic combinations of two points and . (a) Conic co...

Figure 4.5 A hyperplane and a halfspace in : (a) hyperplane and (b) halfspa...

Figure 4.6 Examples of a polyhedral sets: the set to the left is the interse...

Figure 4.7 Examples of norm balls in . The ‐norm ball and the ‐norm ball ...

Figure 4.8 The nonnegative orthant in .

Figure 4.9 The second‐order cone in .

Figure 4.10 An example of a convex cone and its dual cone in . Note tha...

Figure 4.11 A function is convex if the line segment connecting the two po...

Figure 4.12 The epigraph of a function is a convex set if and only if is...

Figure 4.13 A continuously differentiable function is convex if and only i...

Figure 4.14 The pointwise maximum of convex functions is itself convex.

Figure 4.15 The conjugate function is the largest signed vertical distance...

Figure 4.16 Illustration of subderivatives and subdifferential of nonsmooth ...

Figure 4.17 The optimality condition (4.54) implies that an optimal point ...

Chapter 5

Figure 5.1 Example of a two‐dimensional localization problem. The anchors ...

Figure 5.2 The exponential cone and examples of the power cone : (a) expo...

Figure 5.3 The optimal value associated with the rank‐constrained problem (5...

Figure 5.4 The interaction and clique intersection graphs associated with th...

Figure 5.5 A computational tree associated with the optimization problem (5....

Chapter 6

Figure 6.1 Illustration of the Armijo condition. The gray regions correspond...

Figure 6.2 Illustration of the curvature conditions (6.9) and (6.10). The ha...

Figure 6.3 Examples of the gradient descent iteration with three different s...

Figure 6.4 Newton's method with three different starting points.

Figure 6.5 The relative suboptimality for the PG and the APG methods.

Figure 6.6 Construction of upper and lower bounds on , illustrated for . T...

Figure 6.7 The relative suboptimality for the stochastic gradient method usi...

Figure 6.8 Contour plot of the function , which is convex but nonsmooth. Th...

Figure 6.9 Coordinate descent with exact line search applied to convex quadr...

Figure 6.10 The scaled barrier function is a smooth approximation to the i...

Figure 6.11 The path‐following method converges to an ‐suboptimal point b...

Chapter 7

Figure 7.1 The Brachistochrone problem.

Chapter 8

Figure 8.1 Graph for the shortest path problem in Example 8.2.

Figure 8.2 Graph for the shortest path problem in Example 8.2. The shortest ...

Figure 8.3 The cost for the infinite horizon criterion as a function of the ...

Figure 8.4 The trajectories for the states (top), and the control signals (b...

Chapter 9

Figure 9.1 Graph showing the links between different web pages.

Figure 9.2 Plot of the function .

Figure 9.3 The level curves of the pdf for the Gaussian mixture in Example 9...

Figure 9.4 Figure showing the level curves of the pdf for the Gaussian mixtu...

Figure 9.5 Figure showing samples from the Gaussian mixture in Example 9.4 t...

Figure 9.6 A graph where the subset of nodes in separates the subset of no...

Figure 9.7 Observations and estimated GMM with components based on maximum...

Figure 9.8 Realizations of samples obtained via direct sampling and Gibbs ...

Figure 9.9 Plot of compressed data resulting from PCA. The difference spec...

Figure 9.10 Level curves for the objective function between 2.15 (dark gray)...

Figure 9.11 Plot of resulting clustering. The original data is marked with...

Chapter 10

Figure 10.1 Plot showing the fourth degree polynomial (solid line), and the ...

Figure 10.2 Plot showing the Gaussian regression model (solid line), and the...

Figure 10.3 Plots showing data points from two classes marked with for the...

Figure 10.4 Plot showing the data points from the two classes marked with ...

Figure 10.5 Plot showing the data points from the two classes marked with ...

Figure 10.6 Graph showing how the visible and hidden layers in an RBM are co...

Figure 10.7 A neural network with four hidden layers. The nodes represent th...

Figure 10.8 Figure illustrating the recursive definition of the ANN predicto...

Chapter 12

Figure 12.1 Plots showing the estimated values of and for 100 runs of sy...

Figure 12.2 The left plot shows (solid), and (dashed), as function of ,...

Figure 12.3 Plot showing the Bode‐diagram of the Prefilter.

Figure 12.4 Scatter‐plots showing the estimated values of and . The left ...

Appendix A

Figure A.1 Modeling tools allow the users to specify their problems using a ...

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

Preface

Acknowledgments

Glossary

Acronyms

About the Companion Website

Begin Reading

Appendix A

References

Index

End User License Agreement

Pages

iii

iv

v

xvii

xix

xxi

xxii

xxiii

xxiv

xxv

xxvi

xix

xx

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

173

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

379

380

381

382

383

384

385

387

388

389

390

391

392

393

394

395

396

397

398

Optimization for Learning and Control

 

Anders HanssonLinköping UniversityLinköpingSweden

Martin AndersenTechnical University of DenmarkKongens LyngbyDenmark

 

 

 

 

Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication Data

Names: Hansson, Anders, author. | Andersen, Martin, author.Title: Optimization for learning and control / Anders Hansson, Linköping  University, Linköping Sweden, Martin Andersen, Technical University of  Denmark.Description: First edition. | Hoboken, NJ, USA : Wiley, [2023] | Includes  index.Identifiers: LCCN 2023002568 (print) | LCCN 2023002569 (ebook) | ISBN  9781119809135 (cloth) | ISBN 9781119809142 (adobe pdf) | ISBN  9781119809173 (epub)Subjects: LCSH: System analysis–Mathematics. | Mathematical optimization.  | Machine learning–Mathematics. | Signal processing–Mathematics. |  MATLAB.Classification: LCC T57.62 .H43 2023 (print) | LCC T57.62 (ebook) | DDC  004.2/10151–dc23/eng/20230202LC record available at https://lccn.loc.gov/2023002568LC ebook record available at https://lccn.loc.gov/2023002569

Cover Design: WileyCover Image: © pluie_r/Shutterstock

 

 

 

To Erik.To Cassie, Maxwell, and Patrick.

Preface

This is a book about optimization for learning and control. The literature and the techniques for learning are vast, but we will here not focus on all possible learning methods. Instead we will discuss some of them, and especially the ones that result in optimization problems. We will also discuss what optimization methods are relevant to use for these optimization problems. The book is primarily intended for graduate students with a background in science or engineering and who want to learn more about what optimization methods are suitable for learning problems. It is also useful for those who want to study optimal control. Very limited knowledge of optimization, control, or learning is needed as a background. The book is accompanied with a large number of exercises, many of which involve computer tools in order for the students to obtain hands‐on experience.

The topics in learning span a wide range from classical statistical learning problems like regression and maximum likelihood estimation to more recent problems like deep learning using, e.g. recurrent neural networks. Regarding optimization methods, we cover methods from simple gradient methods to more advanced interior‐point methods for conic optimization. A special emphasis is on stochastic methods applicable to the training of neural networks. We also put a special emphasis on nondifferentiable problems for which we discuss subgradient methods and proximal methods. We cover second‐order methods, variable‐metric methods, and augmented Lagrangian methods. Regarding applications to system identification, we discuss identification both for input–output models as well as for state‐space models. Recurrent neural networks and temporal convolutional networks are naturally introduced as ways of modeling nonlinear dynamical systems. We also cover calculus of variations and dynamic programming in detail, and its generalization to reinforcement learning.

The book can be used to teach several different courses. One could be an introductory course in optimization based on Chapters 4–6. Another course could be on optimal control covering Chapters 7–8, and possibly also Chapter 11. Another course could be on learning covering Chapters 9–10 and perhaps Chapter 12. There is of course also the possibility to combine more chapters, and a course that has been taught at Linköping University for PhD students covers all but the material for optimal control.

 

Anders Hansson and Martin Andersen

Linköping and Kongens LyngbyNovember 2022

Acknowledgments

We would like to thank Andrea Garulli at University of Siena who invited Anders Hansson to give a course on optimization for learning in the spring of 2019. The experience from teaching that course provided most valuable inspiration for writing this book. Daniel Cederberg, Markus Fritzsche, and Magnus Malmström are gratefully acknowledged for having proofread some of the chapters.

 

Anders Hansson and Martin Andersen

Glossary

Sets

set of natural numbers

set

set of integers

set

set of nonnegative integer numbers

set of real numbers

set of nonnegative real numbers

set of positive real numbers

set of extended real numbers

set of nonnegative extended real numbers

set of positive extended real numbers

set of complex numbers

set of symmetric real‐valued matrices of order

set of positive semidefinite real‐valued matrices of order

set of positive definite real‐valued matrices of order

quadratic cone of dimension

probability simplex of dimension

empty set

Numbers, Vectors, and Matrices

Euler's number

Archimedes' constant

vector of ones

identity matrix

Elementary Functions

natural exponential function

logarithm function

natural logarithm function

logarithm function, base 2

sine function

cosine function

tangent function

sign function

indicator function for set

Operations on Sets or Functions

affine hull of set

minimum argument

maximum argument

closure of set

convex hull of set

conic hull of set

effective domain

epigraph of function

inf

infimum of set or function

interior of set

max

maximum of set or function

min

minimum of set or function

proximal operator

relative interior of set

sup

supremum of set or function

Operations on Vectors, Vector Spaces or Matrices

adjugate of matrix

block diagonal matrix from matrices

determinant of matrix

diagonal matrix from vector

dimension of vector space or convex set

nullspace of matrix

number of nonzero entries

nullity of matrix

range of matrix

rank of matrix

span of vectors

symmetric vectorization of matrix

trace of matrix

vectorization of matrix

Probability Spaces

expectation functional

normal probability density function

probability measure

variance functional

Symbols

summation

product

integral

contour integral

infinity

belongs to

does not belong to

proper subset of

subset of

proper superset of

superset of

not proper subset of

not subset of

not proper superset of

not superset of

set union

set intersection

set difference

plus

minus

plus or minus

multiplied by

Kronecker product

Hadamard product or composition of functions

is equal to

is less than

is less than or equal to

is greater than

is greater than or equal to

is not equal to

is not less than

is neither less than nor equal to

is not greater than

is neither greater than nor equal to

much smaller than

much greater than

is approximately equal to

asymptotically equivalent to

proportional to

precedes

precedes or equals

succeeds

succeeds or equals

does not precede

neither precedes nor equals

does not succeed

neither succeeds nor equals

there exists

there is no

for all

logical not

logical and

logical or

implies

is implied by

is equivalent to

to or tends toward

corresponds to

tends toward from above

maps to

is perpendicular to

such that or given

:

such that

Acronyms

AdaGrad

adaptive gradient method

Adam

adaptive moment estimation

ADMM

alternating direction method of multipliers

a.e.

almost everywhere

ANN

artificial neural network

ARE

algebraic Riccati equation

ARMAX

auto‐regressive‐moving‐average with exogenous terms

ARX

auto‐regressive with exogenous terms

a.s.

almost surely

BB

Barzilai–Borwein

BFGS

Broyden, Fletcher, Goldfarb, and Shanno

BM

Boltzmann machine

DAE

differential algebraic equation

df

distribution function

DFP

Davidon, Fletcher, and Powell

DC

diagonal/correlated

EM

expectation maximization

EVD

eigenvalue decomposition

FIR

finite impulse response

FLOP

floating point operation

GN

Gauss–Newton

GMM

Gaussian mixture model

i.i.d.

independent, identically distributed

HMM

hidden Markov model

IFT

iterative feedback tuning

ILC

iterative learning control

IP

interior point

IPG

incremental proximal gradient

KKT

Karush–Kuhn–Tucker

LICQ

linear independence constraint qualification

LM

Levenberg–Marquardt

LMI

linear matrix inequality

LP

linear program

LQ

linear quadratic

LS

least‐squares

LSTM

long short‐term memory

MA

moving average

MAP

maximum a posteriori

MDP

Markov decision process

ML

maximum likelihood

MPC

model predictive control

m.s.

mean square

MSE

mean square error

NP

nondeterministic polynomial time

ODE

ordinary differential equation

OE

output error

PCA

principal component analysis

pdf

probability density function

pf

probability function

PI

policy iteration

PMP

Pontryagin maximum principle

QP

quadratic program

RBM

restricted Boltzmann machine

RMSprop

root mean square propagation

RNN

recurrent neural network

ReLU

rectified linear unit

SA

stochastic approximation

SAA

stochastic average approximation

SARSA

state‐action‐reward‐state‐action

SG

stochastic gradient

SGD

stochastic gradient descent

SMW

Sherman–Morrison–Woodbury

SNR

signal‐to‐noise ratio

SR1

symmetric rank‐1

SVD

singular value decomposition

SVM

support vector machine

SVRG

stochastic variance‐reduced gradient

TCM

temporal convolutional network

TPBVP

two‐point boundary value problem

VI

value iteration

w.p. 1

with probability one

About the Companion Website

This book is accompanied by a companion website.

www.wiley.com/go/opt4lc 

This website includes:

• Data files and templates for solving the exercises

• Solutions to the exercise in the book in terms of a pdf-document (Instructors only)

• MATLAB solution files (Instructors only)

Part 1Introductory Part

 

1Introduction

This book will take you on a journey through the fascinating field of optimization, where we explore techniques for designing algorithms that can learn and adapt to complex systems. Whether you are an engineer, a scientist, or simply curious about the world of optimization, this book is for you. We will start with the basics of optimization and gradually build up to advanced techniques for learning and control. By the end of this book, you will have a solid foundation in optimization theory and practical tools to apply to real‐world problems. In this opening, we informally introduce problems and concepts, and we will explain their close interplay with simple formulations and examples. Chapters 2–13 will explore the topic with more rigor, and we end this chapter with an outline of the remaining content of the book.

1.1 Optimization

Optimization is about choosing a best option from a set of available alternatives based on a specific criterion. This concept applies to a range of disciplines, including computer science, engineering, operations research, and economics, and has a long history of conceptual and methodological development. One of the most common optimization problems is of the form

(1.1)

with variable . This is called a nonlinear least‐squares problem, since we are minimizing the squares of the possibly nonlinear functions . We will see that the least‐squares problem and its generalizations have many applications to learning and control. It is also the backbone of several optimization methods for solving more complicated optimization problems. In Chapter 4, we set the foundations for optimization theory, in Chapter 5, we cover different classes of optimization problems that are relevant for learning and control, and in Chapter 6, we discuss different methods for solving optimization problems numerically.

1.2 Unsupervised Learning

Learning is often about finding low‐dimensional descriptions of data that provide insight and are easy to understand or interpret. A very simple example is the repeated measurement of the same real‐valued quantity . This could be the length of a piece of wood. Each measurement may be assumed to be somewhat different because of random measurement errors, i.e. we have a sequence of, say, measurements . Clearly, a reasonable estimate of would be the average of the measurements, i.e.

This is a scalar representation of the measurements, and hence of lower dimension, which is a reasonable representation of the length of the piece of wood. Here we have estimated the length of the piece of wood by averaging the measurements. The learning we have discussed here is called unsupervised learning, and it is discussed in more detail in Chapter 9.

1.3 Supervised Learning

Let us consider another simple problem where we are driving a car with constant speed . We start at an unknown distance from a city toward which we are driving, and we receive distances information to the city from road signs corresponding to having traveled for a time of , . The following relation should hold

where and . We assume that we do not know and we would like to learn it. It is enough to have , i.e. two linear equations, to solve for , and this solution will be unique since the s will be linearly independent unless . However, in an application like this, it is unlikely that the above equations hold exactly due to measurement errors , i.e.

is a more appropriate description, and hence, it might be more suitable to have larger than 2 to average out the effect of the measurement errors. This can be accomplished by solving

with variable . This is an example of a least‐squares problem for which is an affine function. Hence, this is often called a linear least‐squares problem. Typically, is much larger than , and therefore, the optimal solution is of lower dimension than the measurements. If we later get a new value of , we may predict the value of the corresponding measurement as without performing the measurement. For our application, this means that we can estimate the distance to the city by just checking how long we have been traveling. We do not have to wait for a new sign to appear. This is a so‐called supervised learning problem, since for each , we know the corresponding . For learning the length of the piece of wood the data did not come in pairs, but instead, we just had one stream of data . We learned the mean of the data. That is the reason for the name unsupervised learning for such a learning problem. We will discuss supervised learning in more detail in Chapter 10.

1.4 System Identification

Many physical phenomena can be described with dynamical systems, which in mathematical terms are given by differential equations or difference equations. A simple example is the difference equation

(1.2)

where denotes time, is a physical phenomenon that we can observe. It is influenced by both its previous value and by a quantity denoted by . Depending on which field is studying dynamical systems, the pairs are called stimuli and response or input and output, respectively. Sometimes the word signal is added at the end for each of the four words. Often, the above equation does not hold exactly due to measurement errors , and hence, it is more appropriate to consider

When we do not know the parameters, we can use supervised learning to learn the values assuming that we are given pairs of data for . The following linear least‐squares problem

with variables will provide an estimate of the parameters. Learning for dynamical systems is called system identification, and it will be discussed in more detail in Chapter 12.

1.5 Control

Control is about making dynamical systems behave in a way we find desirable. Let us again consider the dynamical system in (1.2), where we are going to influence the behavior by manipulating the input signal . In the context of control, we also often call it the control signal. We assume that the initial value and the parameters are known, and our objective is to make for small. We can make equal to zero by taking , and then we can make all future values of equal to zero by taking all future values of equal to zero. However, it could be that the value of is large, and in applications, this could be costly. Hence, we are interested in finding a trade‐off between how large the values of are in comparison to the values of . This can be accomplished by solving

(1.3)

with variables . This is an equality constrained linear least‐squares problem. The parameter can be used to trade‐off how small should be as compared to . We will cover control of dynamical systems in Chapter 7 for continuous time, and in Chapter 8 for discrete time.

1.6 Reinforcement Learning

When we discussed control above we had to know the values of the parameters . Had they not been known, we could have used system identification to estimate them, and then we could have used the estimated parameters for solving the control problem. However, sometimes it is desirable to skip the system identification step and do control without knowing the parameter values. One way to accomplishing this for the formulation in (1.3) is called reinforcement learning. This will be discussed in more detail in Chapter 11.

1.7 Outline

The book is organized as follows: first, we give a background on linear algebra and probabilities in Chapters 2 and 3. Background on optimization is given in Chapter 4. We will cover both convex and nonconvex optimization. Chapter 5 introduces different classes of optimization problems that we will encounter in the learning chapters later on. In Chapter 6, we discuss different optimization methods that are suitable for solving learning problems. After this, we discuss calculus of variations in Chapter 7 and dynamic programming in Chapter 8. We then cover unsupervised learning in Chapter 9, supervised learning in Chapter 10, and reinforcement learning in Chapter 11. Finally, we discuss system identification in Chapter 12. For information about notation, basic definitions, and software useful for optimization and for the applications we consider, see the Appendix.

2Linear Algebra

Linear algebra is the study of vector spaces and linear maps on such spaces. It constitutes a fundamental building block in optimization and is used extensively for theoretical analysis and derivations as well as numerical computations.

A procedure for solving systems of simultaneous linear equations appeared already in an ancient Chinese mathematical text. Systems of linear equations were introduced in Europe in the seventeenth century by René Descartes in order to represent lines and planes by linear equations and to compute their intersections. Gauss developed the method of elimination. Further important developments were done by Gottfried Wilhelm von Leibniz, Gabriel Cramer, Hermann Grassmann, and James Joseph Sylvester, the latter introducing the term “matrix.”

The purpose of this chapter is to review key concepts from linear algebra and calculus in finite‐dimensional vector spaces as well as a number of useful identities that will be used throughout the book. We also discuss some computational aspects, including a number of matrix factorizations and their application to solving systems of linear equations.

2.1 Vectors and Matrices

We start by introducing vectors and matrices. A vector of length is an ordered collection of numbers,

where is the th element or entry of the vector . The ‐dimensional real space, denoted , is the set of real‐valued ‐vectors, i.e. vectors of length whose elements are all real. The product of a scalar and a vector is defined as , the sum of two real‐valued ‐vectors and is the vector , and the Euclidean inner product or dot product of and is the real number

(2.1)

The vectors and are said to be orthogonal if .

A matrix of size ‐by‐, also written as , is an ordered rectangular array that consists of elements arranged in rows and columns, i.e.

where denotes the element of in its th row and th column. The set of ‐by‐ matrices with real‐valued elements is denoted by . The transpose of is the matrix defined as

i.e. the th element of is the th element of .

It is often convenient to think of a vector as a matrix with a single row or column. For example, when interpreted as a matrix of size , the vector is referred to as a row vector, and similarly, when interpreted as a matrix of size , is referred to as a column vector. In this book, we use the convention that all vectors are column vectors. Thus, a vector is always interpreted as the column vector

and hence, is interpreted as the row vector . Similarly, to refer to the columns of a matrix , we will sometimes use the notation

where . When referring to the rows of , we will define

where such that is the matrix with rows . The notation is somewhat ambiguous because may refer to the th element of a vector or the th column of either or , but the meaning will be clear from the context and follows from our convention that vectors are column vectors.

Given two vectors , the inner product can also be expressed as the product

In contrast, the outer product of two vectors and , not necessarily of the same length, is defined as the matrix

The product of a matrix , with columns , and a vector is the linear combination

Equivalently, the th element of the vector is the inner product of and the th row of .

The vector inner and outer products and matrix–vector multiplication are special cases of matrix multiplication. Two matrices and are said to be conformable for multiplication if the number of columns in is equal to the number of rows in . Given two such matrices and , the product is the matrix with elements

Note that is the inner product of the th row of and the th column of . As a result, may be expressed as

where are the rows of and are the columns of . Equivalently, by expressing in terms of its columns and in terms of its rows, may also be written as the sum of outer products

where are the columns of and are the rows of .

It is important to note that matrix multiplication is associative, but unlike scalar multiplication, it is not commutative. In other words, the associative property holds, provided that , , and are conformable for multiplication, but the identity does not hold in general.

The Frobenius inner product of two matrices is defined as

(2.2)

This is also called the trace inner product of and since

where the trace of a square matrix is defined as

The inner product may also be written as , where maps a matrix to a vector of length by stacking the columns of , i.e.

2.2 Linear Maps and Subspaces

The span of vectors is the set of all linear combinations of these vectors, i.e.

The set of vectors is said to be linearly independent if

and otherwise, the set is said to be linearly dependent. Equivalently, the set is linearly independent if and only if all vectors in have a unique representation as a linear combination of . The span of a set of linearly independent vectors is a ‐dimensional subspace of , and is a basis for the subspace. In other words, the dimension of a subspace is equal to the number of linearly independent vectors that span the subspace. The vectors are said to be orthonormal if they are mutually orthogonal and of unit length, i.e. if and for . The standard basis or natural basis for is the orthonormal basis , where is the unit vector whose th element is equal to 1, and the rest are zero.

The range of a matrix , denoted , is the span of its columns. This is also referred to as the column space of , whereas is referred to as the row space of . The dimension of is called the rank of , denoted . The null space of , denoted , consists of all vectors such that , i.e.

The dimension of is called the nullity of and is denoted . The null space of is said to be trivial if in which case .

2.2.1 Four Fundamental Subspaces

It follows from the definition of the range and nullspace of a matrix that

(2.3)

To see that , note that for every , we have that

or equivalently, if we let , we see that for all . This shows that and , which are both subspaces of , are orthogonal complements. Similarly, for every , it immediately follows that for all , and hence, is the orthogonal complement of .

The result (2.3) can be used to derive the so‐called rank‐nullity theorem which states that

(2.4)

To see this, note that the identity combined with the fact that

implies that The rank‐nullity theorem follows from the identity

(2.5)

which we will now derive. First, suppose that , and let be a linearly independent set of vectors that span . It follows that the set of vectors is linearly independent since

As a result, we have that . Applying the same inequality to implies that . A direct consequence of this identity is that . We say that has full rank if , and we will use the term full row rank when and the term full column rank when . The four subspaces , , , and and their dimensions are illustrated in Figure 2.1.

Figure 2.1 The four subspaces associated with an matrix with rank .

If two matrices and are conformable for multiplication, then

(2.6)

This follows from the fact that and which implies that and . Furthermore, we have that when has full row rank, in which case .

If and are conformable for addition, then

(2.7)

This means that rank is subadditive, and it follows from the fact that where

is the Minkowski sum of and . Subadditivity implies that a rank matrix can be decomposed as the sum of or more rank 1 matrices. Thus, a rank matrix can be factorized as

where and are the columns of and , respectively, and and .

2.2.2 Square Matrices

A matrix with an equal number of rows and columns is called a square matrix, and the order of such a matrix refers to its size. The square matrix with ones on the main diagonal and zeros elsewhere is an identity matrix and denoted by , i.e. if and otherwise, . Furthermore, a square matrix of order is said to be nonsingular or invertible if it has full rank, and otherwise, it is said to be singular. Equivalently, a square matrix is invertible if there exists a matrix such that . Such a matrix is unique if it exists, and it is called the inverse of and is denoted by . In the special case where , the matrix is said to be orthogonal. The columns of an orthogonal matrix are orthonormal since implies that . More generally, if two matrices and are conformable for multiplication, i.e. not necessarily square matrices, and , then is a right inverse of whereas is a left inverse of .

The determinant of a scalar matrix is the scalar itself, whereas the determinant of a square matrix of order may be defined recursively as

(2.8)

where denotes the minor of which is the determinant of the matrix that is obtained by removing the th row and th column of . This expression is a so‐called Laplace expansion of the determinant along the th column of , and it holds for every . As a special case, the determinant of a matrix may be expressed as

and its absolute value may be interpreted as the area of a parallelogram defined by the columns of , as illustrated in Figure 2.2. More generally, the absolute value of the determinant of an matrix is the volume of the parallelotope defined by the columns of , i.e. the set

As a result, if and only if has full rank.

Figure 2.2 Parallelogram defined by two vectors and in . The area is given by , where is the projection of onto the line with normal .

The term is called the cofactor of . The matrix composed of all the cofactors, i.e. the matrix with elements

is called the cofactor matrix. Expressed in terms of the cofactors, the Laplace expansion (2.8) may be written as the inner product of the th column of and , i.e.

Furthermore, since the Laplace expansion is valid for any , the diagonal elements of the matrix are all equal to . In fact, it can be shown that

and if is nonsingular, this identity implies that

(2.9)

where , the transpose of the cofactor matrix, is known as the adjugate matrix which is denoted by .

Figure 2.3 Affine combinations of two points and .

2.2.3 Affine Sets

Given two points and , the point

is a so‐called affine combination of and . The set of all such affine combinations of and , which may be expressed as

is a line in if . This is illustrated in Figure 2.3. More generally, an affine combination of points is a linear combination

where . A set is an affine set if and only if it contains all affine combinations of its points. Equivalently, is affine if for every pair of points ,

An example of an affine set is the set of solutions to the system of equation with and , i.e.

In fact, every affine subset of may be expressed in this form. The affine hull of a set , which we denote , is the smallest possible affine set that contains .

2.3 Norms

Equipped with the Euclidean inner product (2.1), the set is a Euclidean vector space of dimension with the induced norm

(2.10)

More generally, a norm of a vector , denoted , is a function with the following defining properties:

Subadditive

1

:

Absolutely homogeneous:

Positive definite:

Nonnegative:

Such a function is not unique, so to distinguish between different norms, a subscript is typically used to denote specific norms. For example, the 2‐norm or Euclidean norm of a vector , defined in (2.10), is denoted . Other examples are the 1‐norm and the infinity norm, defined as

These are special cases of the more general vector ‐norm

(2.11)

which is defined for .

A norm on is said to be orthogonally invariant if for all and all orthogonal matrices . The 2‐norm is orthogonally invariant, which follows by noting that if is an orthogonal matrix.

The following inequality, known as the Cauchy–Schwarz inequality, is used extensively in linear algebra and optimization:

(2.12)

The inequality can be derived by noting that

where the inequality follows from the nonnegativity property of the norm. An immediate consequence is that for all and , from which the Cauchy–Schwarz inequality (2.12) follows by rearranging the terms and taking the square root.

The Frobenius norm of is a vector norm on induced by the Frobenius inner product, i.e.

(2.13)

This is sometimes referred to as an entrywise norm. In contrast, given both vector norms on and , the operator norm or induced norm on may be defined as

It follows directly from this definition that for all and that . The induced norm may also be expressed as

(2.14)

from which the submultiplicative property follows:

The matrix ‐norm of is the norm induced by the vector ‐norm, and it is denoted . The next example considers the matrix 1‐norm, and the matrix infinity norm is treated in Exercise 2.3.

Example 2.1

The matrix 1‐norm on is the induced norm defined as

where denote the columns of . To verify the third equality, first note that

where are the columns of the identity matrix of order . Moreover, subadditivity, i.e. the triangle inequality, implies that

which shows that .

2.4 Algorithm Complexity

The number of arithmetic operations on floating‐point numbers required by an algorithm is often used as a rough measure of its complexity. One floating‐point operation or “FLOP” refers to a single floating‐point addition, subtraction, multiplication, or division, and an algorithm's FLOP count is the total number of FLOPs. When expressed in terms of the problem size, the FLOP count often provides a useful characterization of the asymptotic growth of the running time as a function of the problem size. However, FLOP counts can generally not be used to accurately predict computation time on modern computers. Table 2.1 shows the FLOP counts for some basic operations involving vectors and/or matrices. Since we are mostly interested in the asymptotic growth rate as one or several problem dimensions increase, it is customary to use the so‐called “Big O” notation. A function is said to be “big O” of , which is expressed mathematically as , if there exists scalars , , and such that

(2.15)

The function provides an upper bound on the growth rate of as the parameters and/or increase. For example, the FLOP count for the matrix–vector product is since for all , . Similarly, the function satisfies since for , .

Table 2.1 FLOP counts for basic matrix and vector operations: denotes a scalar, and are ‐vectors, and and are matrices.

Operation

FLOPs

Scaling

Vector addition/subtraction

Inner product

Matrix–vector multiplication

Matrix–matrix multiplication

Example 2.2

Recall that matrix multiplication is associative, i.e. given three matrices , , and that are conformable for multiplication, we have that . This implies that the product of three or more matrices can be evaluated in several ways, each of which may have a different FLOP count. For example, the product with , , and may be evaluated left‐to‐right by first computing and then , or right‐to‐left by computing and then . The first approach requires FLOPs, whereas the second approach requires FLOPs. A special case is the product with , , and which requires FLOPs when evaluated left‐to‐right, but only FLOPs when evaluated right‐to‐left. More generally, the product of matrices requires matrix–matrix multiplications which can be carried out in any order since matrix multiplication is associative. The problem of finding the order that yields the lowest FLOP count is a combinatorial optimization problem, which is known as the matrix chain ordering problem, see Exercise 5.7. This problem can be solved by means of dynamic programming, which we introduce in Section 5.5 and discuss in more detail in Chapter 8.

2.5 Matrices with Structure

Matrices are sometimes classified according to their mathematical properties and structure, which are often of interest from a computational point of view. This section provides a brief review of some of the most frequently encountered kinds of structure in this book.

2.5.1 Diagonal Matrices

A diagonal matrix is a square matrix in which all off‐diagonal entries are equal to zero. Such a matrix may be stored as a vector of length rather than as a general square matrix with entries, which amounts to significant saving in terms of storage when is large. We use the notation

when referring to the diagonal matrix with the elements of on its diagonal. Diagonal matrices are also attractive from a computational point of view. For example, a matrix–vector product of the form involves only FLOPs, whereas a matrix–vector product with a general square matrix requires FLOPs. Similarly, the inverse of a nonsingular diagonal matrix is also diagonal, i.e. .

2.5.2 Orthogonal Matrices

Recall that a square matrix of order is said to be orthogonal if , or equivalently, . The product of two orthogonal matrices and of order is itself orthogonal, which follows by noting that .

A basic example of an orthogonal matrix of order 2 is a rotation matrix of the form

(2.16)

The action of such a matrix corresponds to the counterclockwise rotation by an angle about the origin in . More generally, a Givens rotation in is a rotation in a two‐dimensional plane defined by two coordinate axes, say, and . Such a transformation can be represented by an orthogonal matrix of order

(2.17)

where and such that the principal submatrix defined by and is of the form (2.16). Givens rotations are used frequently in linear algebra as a means to introduce zeros in a vector or a matrix. For example, a nonzero vector can be transformed into the vector with by means of a rotation. Indeed, it is easy to verify that

if we let and .

A Householder transformation or Householder reflection in is a linear transformation that corresponds to the reflection about a hyperplane that contains the origin. Such a transformation can be represented in terms of an orthogonal matrix of the form

(2.18)

where is normal to the reflection plane. This is illustrated in Figure 2.4. Matrices of the form (2.18) are referred to as Householder matrices, and they are typically stored implicitly. In other words, only the vector , which uniquely determines , is stored rather than storing directly. This requires storage rather than . Moreover, matrix–vector products with a Householder matrix can be computed in FLOPs by noting that the operation amounts to a vector operation, i.e. with . A Householder transformation may be used to simultaneously introduce up to zeros in a vector of length , see Exercise 2.4.

Figure 2.4 Householder reflection in .

Yet another example of a family of orthogonal matrices are permutation matrices. A permutation of the elements of a vector is a reordering of the elements. This can be expressed as a map of the form

(2.19)

where the function defines the permutation and satisfies . This map can be expressed as a matrix–vector product , where is the permutation matrix defined as

(2.20)

It is easy to verify that since whenever , which implies that is orthogonal. It follows that the map (2.19) is invertible, and is the permutation matrix that corresponds to the inverse permutation . The special case where corresponds to reversing the order of elements and leads to the permutation matrix

which is the so‐called exchange matrix of order .

2.5.3 Triangular Matrices

A square matrix in which the elements above or below the main diagonal are zero is called a triangular matrix. A matrix is said to be lower triangular if , whereas is said to be upper triangular if . Furthermore, a triangular matrix with ones on its diagonal is said to be unit triangular. The determinant of a triangular matrix of order may be expressed as

which follows from the Laplace expansion (2.8). A direct consequence is that is nonsingular if and only if all of the diagonal entries of are nonzero. The inverse of a nonsingular lower (upper) triangular matrix is another lower (upper) triangular matrix. This follows from the identity (2.9) by noting that the cofactor matrix associated with is itself triangular. To see this, first recall that the minor of is the determinant of the matrix obtained by deleting the th row and the th column of . This implies that if, say, is lower triangular and , then the effect of deleting the th row and the th column of is a triangular matrix of order . This matrix will have at least one diagonal entry that is equal to zero, and hence, it is singular, which implies that the minor of is zero. We note that the product of two lower (upper) triangular matrices is another lower (upper) triangular matrix.

Given a nonsingular triangular matrix , it is possible to compute matrix–vector products of the form and without computing first. For example, if is lower triangular, then can be computed by means of forward substitution, i.e.

This requires approximately FLOPs. Similarly, if is an upper triangular matrix, then can be computed using backward substitution, i.e.

which also requires approximately FLOPs.

2.5.4 Symmetric and Skew‐Symmetric Matrices

A square matrix is said to be symmetric if , and

(2.21)

is the set of symmetric matrices of order . Similarly, a square matrix is skew‐symmetric if , which implies that the diagonal elements of a skew‐symmetric matrix are equal to zero. A square matrix can always be decomposed into the sum of a symmetric and a skew‐symmetric matrix, i.e. can be expressed as , where is symmetric and is skew‐symmetric. We note that this is an orthogonal decomposition of since the Frobenius inner product of a symmetric and a skew‐symmetric matrix is always zero.

Equipped with the Frobenius inner product, which is defined in (2.2), the set is a Euclidean vector space of dimension . It is sometimes convenient to work with a vector representation of a symmetric matrix, and to this end, we introduce the bijective linear map defined as

(2.22)

i.e. stacks the lower triangular part of the columns of and scales the off‐diagonal entries by . It is straightforward to verify that this definition implies that for all ,

which means that the transformation preserves inner products.

2.5.5 Toeplitz and Hankel Matrices

A Toeplitz matrix of order n is a square matrix of the form

(2.23)

where the first row and column of define the subdiagonal, diagonal, and superdiagonal elements. Such a matrix is sometimes referred to as a diagonal‐constant matrix, and it is uniquely determined by numbers in contrast to a general matrix, which has degrees of freedom. A special case of (2.23) is the so‐called lower‐shift matrix of order , which has ones on the first subdiagonal and zeros elsewhere, i.e.

(2.24)

Similarly, is sometimes referred to as the upper‐shift matrix of order . It is easy to verify that with is the matrix with ones on the th subdiagonal and zeros elsewhere. Using the convention that , we may express any Toeplitz matrix of the form (2.23) as

(2.25)

A lower triangular Toeplitz matrix whose first column is given by can be expressed as

(2.26)

It then follows from the fact that that is nonsingular if and only if . Moreover, a matrix of the form (2.26) commutes with , which follows from the fact that commutes with , i.e. . It can also be shown that the inverse of a nonsingular lower triangular Toeplitz matrix is another lower triangular Toeplitz matrix, see Exercise 2.6.

Toeplitz matrices are closely related to another class of matrices called Hankel matrices. Like a Toeplitz matrix, a Hankel matrix is a square matrix that is uniquely determined by its first row and column. However, unlike a Toeplitz matrix, a Hankel matrix has constant entries along the anti‐diagonal, the anti‐subdiagonals, and anti‐superdiagonals. As a consequence, the exchange matrix maps a Toeplitz matrix to a Hankel matrix and vice versa, i.e. if is a Toeplitz matrix, then and are both Hankel matrices. We note that Hankel matrices are symmetric, whereas Toeplitz matrices are persymmetric, i.e. symmetric about the anti‐diagonal. Finally, we note that the notion of Toeplitz (Hankel) matrices can be extended to include nonsquare matrices with diagonal‐constant (anti‐diagonal‐constant) structure, and such matrices may be viewed as submatrices of square Toeplitz (Hankel) matrices.

2.5.6 Sparse Matrices

A matrix is said to be sparse or sparsely populated if a large number of its entries are equal to zero. In contrast, a matrix with few or no zero entries is said to be dense. We will use to denote the number of nonzero entries in a matrix . Sparse matrices can be stored efficiently by storing only the nonzero entries. For example, storing a sparse matrix as a list of triplets of the form requires rather than storage. Similarly, a matrix–vector product requires only FLOPs if sparsity is exploited as opposed to FLOPs if is a general dense matrix.

2.5.7 Band Matrices

A sparse matrix is said to be banded if all its nonzero entries are located within a band about the main diagonal. The bandwidth of is the smallest integer such that if . In other words, all entries below the th subdiagonal or above the superdiagonal are equal to zero. The lower bandwidth of is a nonnegative integer such that if , and similarly, the upper bandwidth of is a nonnegative integer such that if . A bandwidth of 0 corresponds to a diagonal matrix, whereas a bandwidth of 1 is a tridiagonal matrix or, if or , an upper or a lower bidiagonal matrix.

2.6 Quadratic Forms and Definiteness

A real ‐ary quadratic form is a homogeneous quadratic polynomial in variables, or equivalently, a function from to of the form

for some . It is straightforward to verify that

which implies that only the symmetric part of contributes to the value of . We therefore limit our attention to the case where .

A matrix is positive semidefinite if and only if for all , and it is positive definite if and only if for all . Similarly, is negative (semi)definite if is positive (semi)definite, and it is indefinite