Face Analysis Under Uncontrolled Conditions - Romain Belmonte - E-Book

Face Analysis Under Uncontrolled Conditions E-Book

Romain Belmonte

0,0
126,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Face analysis is essential for a large number of applications such as human-computer interaction or multimedia (e.g. content indexing and retrieval). Although many approaches are under investigation, performance under uncontrolled conditions is still not satisfactory. The variations that impact facial appearance (e.g. pose, expression, illumination, occlusion, motion blur) make it a difficult problem to solve. This book describes the progress towards this goal, from a core building block - landmark detection - to the higher level of micro and macro expression recognition. Specifically, the book addresses the modeling of temporal information to coincide with the dynamic nature of the face. It also includes a benchmark of recent solutions along with details about the acquisition of a dataset for such tasks.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 521

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Cover

Title Page

Copyright

Preface

Part 1. Facial Landmark Detection

Introduction to Part 1

Chapter 1. Facial Landmark Detection

1.1. Facial landmark detection in still images

1.2. Extending facial landmark detection to videos

1.3. Discussion

1.4. References

Chapter 2. Effectiveness of Facial Landmark Detection

2.1. Overview

2.2. Datasets and evaluation metrics

2.3. Image and video benchmarks

2.4. Cross-dataset benchmark

2.5. Discussion

2.6. References

Chapter 3. Facial Landmark Detection with Spatio-temporal Modeling

3.1. Overview

3.2. Spatio-temporal modeling review

3.3. Architecture design

3.4. Experiments

3.5. Design investigations

3.6. Discussion

3.7. References

Conclusion to Part 1

Part 2. Facial Expression Analysis

Introduction to Part 2

Chapter 4. Extraction of Facial Features

4.1. Introduction

4.2. Face detection

4.3. Face normalization

4.4. Extraction of visual features

4.5. Learning methods

4.6. Conclusion

4.7. References

Chapter 5. Facial Expression Modeling

5.1. Introduction

5.2. Modeling of the affective state

5.3. The challenges of facial expression recognition

5.4. The learning databases

5.5. Invariance to facial expression intensities

5.6. Invariance to facial movements

5.7. Conclusion

5.8. References

Chapter 6. Facial Motion Characteristics

6.1. Introduction

6.2. Characteristics of the facial movement

6.3. LMP

6.4. Conclusion

6.5. References

Chapter 7. Micro- and Macro-Expression Analysis

7.1. Introduction

7.2. Definition of a facial segmentation model

7.3. Feature vector construction

7.4. Recognition process

7.5. Evaluation on micro- and macro-expressions

7.6. Same expression with different intensities

7.7. Conclusion

7.8. References

Chapter 8. Towards Adaptation to Head Pose Variations

8.1. Introduction

8.2. Learning database challenges

8.3. Innovative acquisition system (SNaP-2DFe)

8.4. Evaluation of face normalization methods

8.5. Conclusion

8.6. References

Conclusion to Part 2

List of Authors

Index

List of Tables

Chapter 2

Table 2.1. Datasets captured under unconstrained conditions. The 68-landmark scheme is the most widely used. There...

Table 2.2. Performances of recent image-based methods on 300W. NRMSE is reported for categories A and B. NRMSE/AUC...

Table 2.3. Performances of recent image-based and video-based methods on 300VW. NRMSE is reported for categories 1,...

Table 2.4. The different datasets used to train the selected approaches. Most approaches are trained mainly on 300W,...

Table 2.5. AUC/ FR by head pose variation and facial expression on SNaP-2DFe. AUC is lower and FR is higher in the...

Table 2.6. AUC/ FR by activation pattern on SNaP-2DFe in the presence of facial expressions and head pose variati...

Table 2.7. Δ AUC/Δ FR between a static neutral face and the different head pose variations on SNaP-2DFe....

Table 2.8. Δ AUC/Δ FR between a static neutral face and the different facial expressions on SNaP-2DFe....

Chapter 3

Table 3.1. Comparison of the different architectures presented in section 3.3.1 on SNaP-2DFe (subjects 9 to 12)....

Table 3.2. Comparison of the different architectures presented in section 3.3.1 on SNaP-2DFe (subjects 9 to 12;...

Table 3.3. Comparison of the different architectures presented in section 3.3.1 on SNaP-2DFe (subjects 9 to 12;...

Table 3.4. Comparison of the different architectures presented in section 3.3.1 on the three categories of the...

Table 3.5. Comparison of the different architectures presented in section 3.3.2 on the three categories of the...

Table 3.6. Comparison of the best proposed models, coord-3DRNN and heat-3DFCRNN, with existing models on the...

Table 3.7. Comparison of the proposed architectures regarding their numbers of parameters, model sizes and sp...

Table 3.8. Comparison of the different architectures presented in section 3.5.1 on the three categories of the...

Table 3.9. Comparison of the different architectures presented in section 3.5.2 on the three categories of the...

Table 3.10. Comparison of the different architectures presented in section 3.5.3 on the three categories of the...

Table 3.11. Comparison of the different architectures presented in sections 3.5.1–3.5.3 on the full 300VW...

Chapter 5

Table 5.1. Learning databases for facial expression characterization

Table 5.2. Synthesis of recent macro- and micro-expression characterization systems (* augmented data/deep learning)

Chapter 7

Table 7.1. Performance comparison on CASME II (* deep learning)

Table 7.2. Performance comparison on SMIC (* deep learning)

Table 7.3. Performance comparison on CK+ (* deep learning)

Table 7.4. Performance comparison on Oulu-CASIA (* deep learning)

Table 7.5. Performance comparison on MMI (* deep learning)

Table 7.6. Synthesis of the performances on the different micro- and macro-expression learning databases (* deep learning)

Table 7.7. Facial expression recognition rate under different intensity levels

Chapter 8

Table 8.1. SSIM similarity index applied to different normalization methods in the presence of pose variations and large displacements

Table 8.2. Recognition rate of facial expressions extracted from several descriptors according to normalization methods

Table 8.3. Facial expression recognition rate by head movement class

Guide

Cover

Table of Contents

Title Page

Copyright Page

Begin Reading

Index

End User License Agreement

Pages

i

ii

iii

iv

xi

xii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

SCIENCES

Image, Field Director – Laure Blanc-Féraud

Information Seeking in Images and Videos,Subject Head – Hichem Sahbi

Face Analysis Under Uncontrolled Conditions

From Face Detection to Expression Recognition

Coordinated byRomain BelmonteBenjamin Allaert

First published 2022 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd

27-37 St George’s Road

London SW19 4EU

UK

www.iste.co.uk

John Wiley & Sons, Inc.

111 River Street

Hoboken, NJ 07030

USA

www.wiley.com

© ISTE Ltd 2022

The rights of Romain Belmonte and Benjamin Allaert to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.

Library of Congress Control Number: 2022941463

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-78945-111-5

ERC code:

PE6 Computer Science and Informatics

PE6_9 Human computer interaction and interface, visualisation and natural language processing

PE6_10 Web and information systems, database systems, information retrieval and digital libraries, data fusion

PE6_11 Machine learning, statistical data processing and applications using signal processing (e.g. speech, image, video)

Preface

Romain BELMONTE1 and Benjamin ALLAERT2

1University of Lille, France

2IMT Nord Europe, Lille, France

Face analysis is essential for a large number of applications such as human–computer interaction or multimedia (e.g. content indexing and retrieval). Although many approaches have been proposed, performance under uncontrolled conditions is still not satisfactory. The variations that may impact facial appearance (e.g. pose, expression, illumination, occlusion, motion blur) make it a difficult problem to solve.

This book is composed of two parts based on the recent PhD work of Belmonte in Part 1 (P1) and Allaert in Part 2 (P2). The focus is on an updated review of the literature. Some experiments and benchmarks are also included.

Part 1 focuses on one of the core building blocks of facial analysis: facial landmark detection. After an introduction that provides the background on the problem, Chapter 2 provides an overview of the literature (methods, data, metrics). Supervised 2D detection in images and image sequences are mainly studied. Chapter 3 presents an analysis of the performance of current approaches and highlights the most difficult poses and expressions to handle. It also illustrates the importance of a suitable temporal modeling to benefit from the dynamic nature of the face. Finally, Chapter 4 addresses the modeling of temporal information. A benchmark of various architecture is proposed to determine the best design for video-based facial landmark detection. It also helps us to understand the complementarity between spatial and temporal information as well as local and global motion. The conclusion discusses possible research perspectives.

Part 2 focuses on a more higher-level task within facial analysis: expression recognition. After an introduction explaining the problem and its challenges, Chapter 2 describes how a face can be characterized according to the literature. Chapter 3 explains how to model the expressions and consider both micro- and macro-expressions. Chapter 4 presents way to characterize facial motion that takes into account mechanical facial skin deformation properties, and a detailed evaluation is provided in Chapter 5. Finally, the acquisition of a dataset, called SNaP-2DFe, to analyze facial expressions in the presence of head pose variations is included in Chapter 6. Research perspectives are discussed in the conclusion.

Although landmark detection is used for expression recognition, the two parts of this book are independent and can be read in either order. Each chapter is also relatively self-contained. To go further, we invite the interested readers to look at other facial analysis tasks, to move towards more practical content, and to try to implement the algorithms mentioned in this book and why not transpose them into useful applications.

We would like to thank Hichem Sahbi for inviting us to contribute to the SCIENCES project. We would like to thank our PhD supervisor Chaabane Djeraba who trusted us and who always remained very available. We also thank our co-supervisors Ioan Marius Bilasco (P1 and P2), Pierre Tirilly (P1) and Nacim Ihaddadene (P1). We are sincerely grateful for the time they spent with us, their patience and the extensive advice they gave us. Also, we would like to thank our PhD reviewers (P1: Jean-Luc Dugelay and Hichem Sahbi; P2: Monique Noirhomme-Fraiture and Jenny Benois-Pineau) and examinators (P1: Karine Zeitouni and Nicu Sebe; P2: Moncef Gabbouj), who agreed to be part of this work and have greatly contributed to its improvement before being published as a book. Many thanks to all our colleagues at the University of Lille and ISEN-Lille, who have always been very kind and supportive. Finally, we would like to thank our family and friends for their constant support.

June 2022

PART 1Facial Landmark Detection

1Facial Landmark Detection

Romain BELMONTE1, Pierre TIRILLY1, Ioan Marius BILASCO1, Nacim IHADDADENE2 and Chaabane DJERABA1

1University of Lille, France

2Junia ISEN, Lille, France

As stated by Jin and Tan (2017), the face corresponds to a deformable object that can vary in terms of shape and appearance. The first attempts at facial landmark detection can be traced back to the 1990s. The active shape models (ASMs) (Cootes et al. 1995) represent one of the seminal works on the subject. Such a generative approach consists of a parametric model that can be fitted to a given face by optimizing its parameters. This process of applying the model to new data is called inference. Rapid progress has been made through the development of active appearance models (AAMs) (Cootes et al. 2001), constrained local models (CLMs) (Cristinacce and Cootes 2006) and other extensions (Baltrusaitis et al. 2013; Belhumeur et al. 2013; Antonakos et al. 2015; Tzimiropoulos 2015), to the point where the problem is now considered well addressed for constrained faces (Jin and Tan 2017). The research has therefore shifted to unconstrained faces with multiple and complex challenges, for example, occlusion, variations in pose, illumination and expression. Faster and more robust discriminative methods such as cascaded shape regression (CSR) (Dollár et al. 2010) have been proposed to address these challenges. They differ from generative approaches as they directly learn a mapping function between images and facial shapes with better generalization ability. Due in particular to the phenomenon of data massification and the increase in the computational capacity of machines, a subset of these approaches, deep learning (DL), has stood out. These approaches need larger datasets than traditional ML algorithms but do not require feature engineering. Above all, they currently outperform all previous approaches and most research now focuses on them.

Consequently, two main categories of approaches can be defined, generative and discriminative, not to mention the complementary solutions that can be applied to most of these approaches, for example, multi-task learning, to explicitly address some selected challenges. The proposed nomenclature is close to those proposed in (Jin and Tan 2017; Wu and Ji 2018). Considering its wide range of applications and the persistent difficulties encountered under uncontrolled conditions, facial landmark localization remains a very active area of research. Recently, there has been a trend towards video-based solutions (Shen et al. 2015). One of the main reasons to include this additional dimension to the problem is that temporal consistency provides a useful input to achieve robust detection under uncontrolled conditions. Different strategies have been proposed, from the most traditional tracking strategies, such as tracking by detection, to more complex strategies that can jointly extract features and perform tracking. The current work is mostly tracking-oriented and focuses on global head movements.

All these items are discussed in detail in the following sections. In section 1.1, the groundbreaking work and advances in the two main categories of approaches are reviewed, with an emphasis on DL approaches. This allows us to properly describe, in section 1.1.4, the complementary solutions that can be used to handle major challenges. In contrast to existing literature reviews (Jin and Tan 2017; Wu and Ji 2018), the different strategies proposed in the literature to extend facial landmark localization to video are extensively discussed in section 1.2. Finally, section 1.3 concludes with a positioning of the work presented in this part of the book.

1.1. Facial landmark detection in still images

Efforts to tackle the problem of facial landmark detection have long focused on still images, and much work has been published. In sections 1.1.1 and 1.1.2, the two main categories of approaches, generative and discriminative, are detailed along with their developments. Among the discriminative approaches, DL approaches have become very popular. Therefore, the latter is given close attention in section 1.1.3. Finally, complementary solutions to handle the difficulties encountered under uncontrolled conditions are reviewed in section 1.1.4.

1.1.1. Generative approaches

Generative methods typically build a statistical model for both shape and appearance. They are referred to as generative approaches since they provide a model of the joint probability distribution of the shape and appearance. These models are parametric and therefore have some degrees of freedom constrained during training; so, they are expected to match only possible facial configurations. By optimizing the model parameters, for example, by minimizing the reconstruction error for a given image, the best possible instance of the model for that image can be generated. Initially, only shape variations were modeled (Cootes et al. 1995). However, on the same line of thought, texture variations were also added to jointly apply constraints to variations in shape and texture (Cootes et al. 2001). Facial appearance can be represented in different ways. The entire face can be considered, which constitutes a holistic representation. The face can also be decomposed into different parts, for example, patches centered at each landmark, which is known as a part-based representation. Generative approaches provide a good fitting accuracy with little training data but are sensitive to shape initialization and do not generalize well to new images. The optimization can be difficult under uncontrolled conditions due to the high dimensionality of the appearance space. They tend to get trapped in local minima as well. Note also that these methods are mainly based on principal component analysis (PCA) (Pearson 1901) and therefore the distribution is assumed to be Gaussian, which does not fit the true distribution.

The AAMs (Cootes et al. 2001) are probably the most representative method in this category and the most widely studied. In the following, AAM modeling and fitting are presented as well as some of their improvements. The modeling can be split into three parts: a shape model, an appearance model and a motion model. The shape model, also called point distribution model (PDM) (Cootes et al. 1995), is common among deformable models (Cootes et al. 1995, 2001; Cristinacce and Cootes 2006; Baltrusaitis et al. 2013; Tzimiropoulos 2015). It is built using the N training facial shapes. These shapes are first normalized (aligned) using a GPA (Gower 1975) to remove affine transformations (i.e. rotation, scale and translation variations) and keep only local, non-rigid shape deformation. Next, PCA is applied to obtain a set of orthogonal bases that capture the maximal variance. Only the n eigenvectors corresponding to the largest eigenvalues are kept as they summarize well the data. This reduces the dimensions while ensuring that the key information is maintained. The model can then be expressed as:

[1.1]

where and are, respectively, the mean shape, the shape eigenvectors and the vector of shape parameters. Four eigenvectors corresponding to the similarity transforms (scaling, in-plane rotation and translation) are added to Us (re-orthonormalization) to be able to fit the model on any image (Matthews and Baker 2004).

To build the appearance model, features from the N training images are first extracted using the function F. Initially, holistic pixel-based representation was used, which is sensitive to lighting and occlusions (Cootes et al. 2001). To improve robustness, feature-based representations such as histogram of oriented gradients (HOG) (Dalal and Triggs 2005) or scale-invariant feature transform (SIFT) (Lowe 2004) can also be used. In this way, relevant facial features are extracted, with a better ability to generalize to new faces. These features are then warped into the reference shape using the motion model. PCA is finally applied onto these vectorized warped feature-based images. The model can be expressed as:

[1.2]

where and are, respectively, the mean appearance vector, the appearance eigenvectors and the vector of appearance parameters.

The motion model refers typically to a warp function W such as a piece-wise affine warp or a thin-plate spline (Matthews and Baker 2004; Tzimiropoulos and Pantic 2013). Given a shape generated from parameters p, this function defines how to warp the texture into a reference shape, for example, the mean shape . Figure 1.1 shows an example of AAM instantiation.

Figure 1.1.Example of AAM instantiation. The appearance computed from the parameters q is warped into the shape computed from the parameters p (Matthews and Baker 2004).

Finally, the AMMs can be fit to a previously unseen image by optimizing the following cost function, which consists of the minimization of the appearance reconstruction error with respect to the shape parameters:

[1.3]

The optimization is an iterative process by which parameters p and q are found so that the best instance of the model is fitted to the given image. This can be solved by two main approaches: analytically or by learning. The first one typically uses gradient descent optimization algorithms (Matthews and Baker 2004; Gross et al. 2005; Papandreou and Maragos 2008). The standard gradient descent being inefficient for this problem, other ways to update the parameters have been proposed. In the project-out inverse compositional algorithm (Matthews and Baker 2004), shape and appearance are decoupled by projecting out the appearance variations. This results in a faster fitting as well as with convergence issues that decrease robustness. With the simultaneous inverse compositional algorithm (Gross et al. 2005), shape and appearance parameters are optimized simultaneously to provide a more robust fitting but at a higher computational cost. It is, however, possible to speed it up using the alternating inverse compositional algorithm (Papandreou and Maragos 2008), which optimizes the shape and appearance parameters in an alternative manner instead of simultaneously. The second optimization approach typically uses cascaded regression to learn a mapping function between the facial appearance and the shape parameters (Cootes et al. 2001; Saragih and Goecke 2007). The original AAMs (Cootes et al. 2001) employ linear regression. Nonlinear regression has also been proposed (Saragih and Goecke 2007). This approach may be efficient but makes the invalid assumption that there is a constant linear relationship between the image features and the parameter updates (Matthews and Baker 2004). Despite various efforts in optimization strategies, the AAMs remain difficult to optimize under uncontrolled conditions due to the high dimensionality of the appearance space and the propensity of the optimizer to converge to local minima.

To overcome these problems, more robust, part-based representations can be used. In part-based generative models, the local appearance around each landmark is extracted and combined to build a model for the whole face (Tzimiropoulos and Pantic 2014; Tzimiropoulos 2015), which reduces the size of the appearance space. Active pictorial structures (Antonakos et al. 2015) go further by taking advantage of the tree structure used in pictorial structures (PSs) (Felzenszwalb and Huttenlocher 2005) to replace PCA. PSs are extended to model the appearance of the face using multiple graph-based pairwise distributions between the facial parts and prove to be more accurate than PCA. Note that ASMs are also considered as a part-based generative model, with the difference that an appearance model is used for each facial part rather than a single model for all parts. However, its evolution towards models such as CLMs (Cristinacce and Cootes 2006; Saragih et al. 2011), considered as discriminative, will be discussed in the next section. Although holistic and part-based models appear to have the same representational power, part-based models are easier to optimize and more robust to poor initializations, lighting and occlusions since local features are usually not as sensitive as global features.

1.1.2. Discriminative approaches

Unlike generative approaches that rely on statistical parametric models of appearance and shape, discriminative approaches learn a mapping from the image to the facial shape. In this section, two categories of approaches are distinguished. The first category refers to hybrid approaches such as CLMs (Cristinacce and Cootes 2006; Saragih et al. 2011). Such approaches are based on independent discriminative models of appearance, one for each facial part, constrained by a statistical parametric shape model, for example, PDM. This means that there is still an optimization step during inference for this category. Besides, the ambiguity between the local appearance models of different landmarks can strongly impact the performance under uncontrolled conditions. To address these difficulties, a second category of approaches has emerged. These approaches, the most notable of which is CSR (Dollár et al. 2010; Xiong and De la Torre 2013), infer the whole face shape by directly learning one or more regression functions which implicitly encode the shape constraint. It provides more freedom than a parametric shape model would as no explicit assumption on the distribution of the data is made. During inference, there is no optimization as well. The facial shape is directly estimated from the image features. Because of these distinctions, discriminative approaches are faster and more robust in comparison to generative ones. They generalize better to new unseen images as they can benefit from large datasets, which are omnipresent nowadays. However, it remains challenging to directly map the appearance to the facial shape when confronted to severe difficulties such as extreme poses or expressions. A third category could be added for DL approaches, for example, CNNs. Most of the work today is based on DNNs, leaving traditional methods behind1. These approaches are capable of learning highly discriminative and task-specific features. They no longer need feature engineering. Given the breakthrough of DL in computer vision, a separate section (section 1.1.3) is dedicated to these approaches.

1.1.2.1. Hybrid approaches

CLMs are an extension of ASMs, with the main difference that the appearance models are discriminative rather than generative. It is a part-based hybrid approach composed of two elements: discriminative local detectors or regressors to represent the appearance and a generative shape model to regularize the deformations and ensure a valid face. In the following, the modeling, fitting and improvements related to this approach are described. Local detectors or regressors are used to compute response maps providing the probability that a given landmark is located at a specific position. To build them for each landmark, a statistical model of the gray level structure along with the Mahalanobis distance as the response was initially proposed in ASMs (Cootes et al. 1995). However, more powerful discriminative approaches – for example, binary classifiers such as logistic regression (Saragih et al. 2011) or support vector machines (SVMs) (Lucey et al. 2009) – have been used. Consider a linear SVM that determines whether or not a given patch matches the description of the region of a landmark. From the training data, positive and negative examples are extracted for each landmark to train different SVMs, also called patch experts. The response can be expressed as follows:

[1.4]

where I is the given image, c is the initial coordinates, δc is a displacement constrained by the PDM, λv(c) is the V support vectors and αv is the support weights. The output corresponds to the inverted classifier score, which is 1 if positive, −1 otherwise. By fitting a logistic regression function to the output, Platt scaling (Platt 1999), an approximate probabilistic output can be obtained. To refine the detection and maintain a valid shape, a shape constraint is then imposed using a statistical shape model such as PDM, already presented in section 1.1.1.

Figure 1.2.Overview of CLM fitting: ELS is generally performed to get a response map for each landmark. To refine the detection and maintain a valid shape, a shape constraint is then imposed using a statistical shape model (Saragih et al. 2011).

Hence, two steps (illustrated in Figure 1.2) can be differentiated when fitting CLM. An exhaustive local search (ELS) (Saragih et al. 2011) is generally performed first to get a response map for each landmark. This step is followed by a shape refinement over these response maps to find the shape parameters that maximize the probability that the landmarks are accurately detected given the appearance features. The cost function can be expressed as:

[1.5]

where I is the given image, L is the number of landmarks, is the coordinate of the l-th landmark from the mean shape, Ul is the shape eigenvectors, p is the shape parameters, and R is the response. The Gauss–Newton algorithm, although it suffers from local minima, is generally employed to solve this problem (Saragih et al. 2011). Also, the true response maps are not directly used due to performance issues; they are replaced by approximations. There are several approximation techniques, most of them parametric, yet a non-parametric representation known as regularized landmark mean-shift has proved to be a good balance between representational power and computational complexity (Saragih et al. 2011). It takes the form of a Gaussian kernel density estimate (Silverman 2018):

[1.6]

where yi is a candidate location among ψi locations within a given region, πyi is the likelihood that the i-th landmark is aligned at location yi, xi is the location of the i-th landmark generated by the shape model, p is the variance of the noise on landmark locations and I is the identity matrix. A regression-based fitting approach can also be used to learn mapping functions from response maps to the shape parameter updates. It is known as the discriminative response map fitting (Asthana et al. 2013). Given its nature and ability to benefit from large amounts of data, this approach achieves a performance gain over RLMS fitting.