91,99 €
3D Face Modeling, Analysis and Recognition presents methodologies for analyzing shapes of facial surfaces, develops computational tools for analyzing 3D face data, and illustrates them using state-of-the-art applications. The methodologies chosen are based on efficient representations, metrics, comparisons, and classifications of features that are especially relevant in the context of 3D measurements of human faces. These frameworks have a long-term utility in face analysis, taking into account the anticipated improvements in data collection, data storage, processing speeds, and application scenarios expected as the discipline develops further.
The book covers face acquisition through 3D scanners and 3D face pre-processing, before examining the three main approaches for 3D facial surface analysis and recognition: facial curves; facial surface features; and 3D morphable models. Whilst the focus of these chapters is fundamentals and methodologies, the algorithms provided are tested on facial biometric data, thereby continually showing how the methods can be applied.
Key features:
• Explores the underlying mathematics and will apply these mathematical techniques to 3D face analysis and recognition
• Provides coverage of a wide range of applications including biometrics, forensic applications, facial expression analysis, and model fitting to 2D images
• Contains numerous exercises and algorithms throughout the book
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 413
Veröffentlichungsjahr: 2013
Contents
Cover
Title Page
Copyright
Preface
Introduction
Scope of the book
List of Contributors
Chapter 1: 3D Face Modeling
1.1 Challenges and Taxonomy of Techniques
1.2 Background
1.3 Static 3D Face Modeling
1.4 Dynamic 3D Face Reconstruction
1.5 Summary and Conclusions
Exercises
References
Chapter 2: 3D Face Surface Analysis and Recognition Based on Facial Surface Features
2.1 Geometry of 3D Facial Surface
2.2 Curvatures Extraction from 3D Face Surface
2.3 3D Face Segmentation
2.4 3D Face Surface Feature Extraction and Matching
2.5 Deformation Modeling of 3D Face Surface
Exercises
References
Chapter 3: 3D Face Surface Analysis and Recognition Based on Facial Curves
3.1 Introduction
3.2 Facial Surface Modeling
3.3 Parametric Representation of Curves
3.4 Facial Shape Representation Using Radial Curves
3.5 Shape Space of Open Curves
3.6 The Dense Scalar Field (DSF)
3.7 Statistical Shape Analysis
3.8 Applications of Statistical Shape Analysis
3.9 The Iso-geodesic Stripes
Exercises
Glossary
References
Chapter 4: 3D Morphable Models for Face Surface Analysis and Recognition
4.1 Introduction
4.2 Data Sets
4.3 Face Model Fitting
4.4 Dynamic Model Expansion
4.5 Face Matching
4.6 Concluding Remarks
Exercises
References
Chapter 5: Applications
5.1 Introduction
5.2 3D Face Databases
5.3 3D Face Recognition
5.4 Facial Expression Analysis
5.5 4D Facial Expression Recognition
Exercises
Glossary
References
Index
This edition first published 2013 © 2013, John Wiley & Sons Ltd
Registered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Daoudi, Mohamed, 1964– 3D face modeling, analysis, and recognition / Mohamed Daoudi, Anuj Srivastava, Remco Veltkamp. pages cm Includes bibliographical references and index. ISBN 978-0-470-66641-8 (cloth) 1. Three-dimensional imaging. 2. Human face recognition (Computer science) 3. Face--Computer simulation. I. Srivastava, Anuj, 1968– II. Veltkamp, Remco C., 1963– III. Title. IV. Title: Three dimensional face modeling, analysis, and recognition. TA1637.D365 2013 006.6′93–dc23 2013005799
A catalogue record for this book is available from the British Library
ISBN: 9780470666418
Preface
Introduction
The human face has long been an object of fascination, investigation, and analysis. It is so familiar to our visual cognition system that we can recognize a person’s face in difficult visual environments, that is, under arbitrary lighting conditions and pose variation. A common question to many researchers is whether a computer vision system can process and analyze 3D face as the human vision system does. In addition to understanding human cognition, there is also increasing interest in analyzing shapes of facial surfaces for developing applications such as biometrics, human–computer interaction (HCI), facial surgery, video communications, and 3D animation.
Because facial biometrics is natural, contact free, nonintrusive, and of psychological interest, it has emerged as a popular modality in the biometrics community. Unfortunately, the technology for 2D image-based face recognition still faces difficult challenges. Face recognition is made difficult by data variability caused by pose variations, lighting conditions, occlusions, and facial expressions. Because of the robustness of 3D observations to lighting conditions and pose variations, face recognition using shapes of facial surfaces has become a major research area in the last few years. Many of the state-of-the-art methods have focused on the variability caused by facial deformations, for example, those caused by face expressions, and have proposed methods that are robust to such shape variations.
Another important use of 3D face analysis is in the area of computer interaction. As machines become more and more involved in everyday human life and take on increasing roles in both their living and work spaces, they need to become more intelligent in terms of understanding human moods and emotions. Embedding these machines with a system capable of recognizing human emotions and mental states is precisely what the HCI research community is focused on. Facial expression recognition is a challenging task that has seen a growing interest within the research community, impacting important applications in fields related to HCI. Toward building human-like emotionally intelligent HCI devices, scientists are trying to include identifiers of the human emotional state in such systems. Recent developments in 3D acquisition sensors have made 3D data more readily available. Such data help alleviate problems inherent in 2D data such as illumination, pose, and scale variations as well as low resolution.
The interest in 3D facial shape analysis is fueled by the recent advent of cheaper and lighter scanners that can provide high resolution measurements of both geometry and texture of human facial surfaces. One general goal here is to develop computational tools for analyzing 3D face data. In particular, there is interest in quantifiably comparing the shapes of facial surfaces. This can be used to recognize human beings according to their facial shapes, to measure changes in a facial shape following a surgery, or to study/capture the variations in facial shapes during conversations and expressions of emotions. Accordingly, the main theme of this book is to develop computational frameworks for analyzing shapes of facial surfaces. In this book, we use some basic and some advanced tools from differential geometry, Riemannian geometry, algebra, statistics, and computer science to develop the desired algorithms.
Scope of the book
This book, which focuses on 3D face modeling, processing, and applications, is divided into five chapters.
Chapter 1 provides a brief overview of successful ideas in the literature, starting with some background material and important basic ideas. In particular, the principles of depth from triangulation and shape from shading are explained first. Then, an original 3D face (static or dynamic) modeling-guided taxonomy is proposed. Next, a survey of successful approaches that have led to commercial systems is given in accordance with the proposed taxonomy. Finally, a general review of these approaches according to factors that are intrinsic factors (spatial and temporal resolutions, depth accuracy, sensor cost, etc.) and extrinsic (motion speed, illumination changes, face details, intrusion and need for user cooperation, etc.) are provided.
Chapter 2 discusses the state of the art in 3D surface features for the recognition of the human face. Particular emphasis is laid on the most prominent and recent contributions. The features extracted from 3D facial surfaces serve as means for dimensionality reduction of surface data and for facilitating the task of face recognition. The complexity of extraction, descriptiveness, and robustness of features directly affect the overall accuracy, performance, and robustness of the 3D recognition system.
Chapter 3 presents a novel geometric framework for analyzing 3D faces, with specific goals of comparing, matching, and averaging their shapes. In this framework, facial surfaces are represented by radial curves emanating from the nose tips. These curves, in turn, are compared using elastic shape analysis to develop a Riemannian framework for full facial surfaces. This representation, along with the elastic Riemannian metric, seems natural for measuring facial deformations and is robust to data issues such as large facial expressions. One difficulty in extracting facial curves from the surface of 3D face scans is related to the presence of noise. A possible way to smooth the effect of the noise without losing the effectiveness of representations is to consider aggregates of facial curves, as opposed to individual curves, called iso-geodesic stripes.
Chapter 4 presents an automatic and efficient method to fit a statistical deformation model of the human face to 3D scan data. In a global-to-local fitting scheme, the shape parameters of this model are optimized such that the produced instance of the model accurately fits the 3D scan data of the input face. To increase the expressiveness of the model and to produce a tighter fit of the model, the method fits a set of predefined face components and blends these components afterwards. In the case that a face cannot be modeled, the automatically acquired model coefficients are unreliable, which hinders the automatic recognition. Therefore, we present a bootstrapping algorithm to automatically enhance a 3D morphable face model with new face data. The accurately generated face instances are manifold meshes without noise and holes, and can be effectively used for 3D face recognition. The results show that model coefficient based face matching outperforms contour curve and landmark based face matching, and is more time efficient than contour curve matching.
Although there have been many research efforts in the area of 3D face analysis in the last few years, the development of potential applications and exploitation of face recognition tools is still in its infancy. Chapter 5 summarizes recent trends in 3D face analysis with particular emphasis on the application techniques introduced and discussed in the previous chapters. The chapter discusses how 3D face analysis has been used to improve face recognition in the presence of facial expressions and missing parts, and how 3D techniques are now being extended to process dynamic sequences of 3D face scans for the purpose of facial expression recognition.
We hope that this will serve as a good reference book for researchers and students interested in this field.
Mohamed Daoudi, TELECOM Lille 1/LIFL, FranceAnuj Srivastava, Florida State University, USARemco Veltkamp, Utrecht University, The Netherlands
List of Contributors
Faisal Radhi M. Al-Osaimi, Department of Computer Engineering, College of Computer & Information Systems, Umm Al-Qura University, Saudi Arabia
Mohsen Ardabilian, Ecole Centrale de Lyon, Département Mathématiques – Informatique, France
Boulbaba Ben Amor, TELECOM Lille1, France
Mohammed Bennamoun, School of Computer Science & Software Engineering, The University of Western Australia, Australia
Stefano Berretti, Dipartimento di Sistemi e Informatica Università degli Studi di Firenze, Italy
Alberto del Bimbo, Dipartimento di Sistemi e Informatica, Università degli Studi di Firenze, Italy
Liming Chen, Ecole Centrale de Lyon, Département Mathématiques Informatique, France
Mohamed Daoudi, TELECOM Lille1, France
Hassen Drira, TELECOM Lille1, France
Frank B. ter Haar, TNO, Intelligent Imaging, The Netherlands
Pietro Pala, Dipartimento di Sistemi e Informatica, Università di Firenze, Italy
Anuj Srivastava, Department of Statistics, Florida State University, USA
Remco Veltkamp, Department of Information and Computing Sciences, Universiteit Utrecht, The Netherlands
1
3D Face Modeling
Boulbaba Ben Amor,1 Mohsen Ardabilian,2 and Liming Chen2
1Institut Mines-Télécom/Télécom Lille 1, France
2Ecole Centrale de Lyon, France
Acquiring, modeling, and synthesizing realistic 3D human faces and their dynamics have emerged as an active research topic in the border area between the computer vision and computer graphics fields of research. This has resulted in a plethora of different acquisition systems and processing pipelines that share many fundamental concepts as well as specific implementation details. The research community has investigated the possibility of targeting either end-to-end consumer-level or professional-level applications, such as facial geometry acquisition for 3D-based biometrics and its dynamics capturing for expression cloning or performance capture and, more recently, for 4D expression analysis and recognition. Despite the rich literature, reproducing realistic human faces remains a distant goal because the challenges that face 3D face modeling are still open. These challenges include the motion speed of the face when conveying expressions, the variabilities in lighting conditions, and pose. In addition, human beings are very sensitive to facial appearance and quickly sense any anomalies in 3D geometry or dynamics of faces. The techniques developed in this field attempt to recover facial 3D shapes from camera(s) and reproduce their actions. Consequently, they seek to answer the following questions:
How can one recover the facial shapes under pose and illumination variations?How can one synthesize realistic dynamics from the obtained 3D shape sequences?This chapter provides a brief overview of the most successful existing methods in the literature by first introducing basics and background material essential to understand them. To this end, instead of the classical passive/active taxonomy of 3D reconstruction techniques, we propose here to categorize approaches according to whether they are able to acquire faces in action or they can only capture them in a static state. Thus, this chapter is preliminary to the following chapters that use static or dynamic facial data for face analysis, recognition, and expression recognition.
1.1 Challenges and Taxonomy of Techniques
Capturing and processing human geometry is at the core of several applications. To work on 3D faces, one must first be able to recover their shapes. In the literature, several acquisition techniques exist that are either dedicated to specific objects or are general. Usually accompanied by geometric modeling tools and post-processing of 3D entities (3D point clouds, 3D mesh, volume, etc.), these techniques provide complete solutions for 3D full object reconstruction. The acquisition quality is mainly linked to the accuracy of recovering the z-coordinate (called depth information). It is characterized by loyalty reconstruction, in other words, by data quality, the density of 3D face models, details preservation (regions showing changes in shapes), etc. Other important criteria are the acquisition time, the ease of use, and the sensor’s cost. In what follows, we report the main extrinsic and intrinsic factors which could influence the modeling process.
Extrinsic factors. They are related to the environmental conditions of the acquisition and the face itself. In fact, human faces are globally similar in terms of the position of main features (eyes, mouth, nose, etc.), but can vary considerably in details across (i) their variabilities due to facial deformations (caused by expressions and mouth opening), subject aging (wrinkles), etc, and (ii) their specific details as skin color, scar tissue, face asymmetry, etc. The environmental factors refer to lighting conditions (controlled or ambient) and changes in head pose.Intrinsic factors. They include sensor cost, its intrusiveness, manner of sensor use (cooperative or not), spatial and/or temporal resolutions, measurement accuracy and the acquisition time, which allows us to capture moving faces or simply faces in static state.These challenges arise when acquiring static faces as well as when dealing with faces in action. Different applications have different requirements. For instance, in the computer graphics community, the results of performance capture should exhibit a great deal of spatial fidelity and temporal accuracy to be an authentic reproduction of a real actors’ performance. Facial recognition systems, on the other hand, require the accurate capture of person-specific details. The movie industry, for instance, may afford a 3D modeling pipeline system with special purpose hardware and highly specialized sensors that require manual calibration. When deploying a 3D acquisition system for facial recognition at airports and in train stations, however, cost, intrusiveness, and the need of user cooperation, among others, are important factors to consider. In ambient intelligence applications where a user-specific interface is required, facial expression recognition from 3D sequences emerges as a research trend instead of 2D-based techniques, which are sensitive to changes and pose variations. Here, also, sensor cost and its capability to capture facial dynamics are important issues. Figure 1.1 shows a new 3D face modeling-guided taxonomy of existing reconstruction approaches. This taxonomy proposes two categories: The first category targets 3D static face modeling, while the approaches belonging to the second category try to capture facial shapes in action (i.e., in 3D+t domain). In the level below, one finds different approaches based on concepts presented in section 1.2. In static face category, the multi-view stereo reconstruction uses the optical triangulation principle to recover the depth information of a scene from two or more projections (images). The same mechanism is unconsciously used by our brain to work out how far an object is. The correspondence problem in multi-view approaches is solved by looking for pixels that have the same appearance in the set of images. This is known as stereo-matching problem. Laser scanners use the optical triangulation principle, this time called active by replacing one camera with a laser source that emits a stripe in the direction of the object to scan. A second camera from a different viewpoint captures the projected pattern. In addition to one or several cameras, time-coded structured-light techniques use a light source to project on the scene a set of light patterns that are used as codes for finding correspondences between stereo images. Thus, they are also based on the optical triangulation principle.
Figure 1.1 Taxonomy of 3D face modeling techniques
The moving face modeling category, unlike the first one, needs fast processing for 3D shape recovery, thus, it tolerates scene motion. The structured-light techniques using one complex pattern is one solution. In the same direction, the work called Spacetime faces shows remarkable results in dynamic 3D shape modeling, by employing random colored light on the face to solve the stereo matching problem. Time-of-flight-based techniques could be used to recover the dynamic of human body parts such as the faces but with a modest shape accuracy. Recently, photometric stereo has been used to acquire 3D faces because it can recover a dense normal field from a surface. In the following sections, this chapter first gives basic principles shared by the techniques mentioned earlier, then addresses the details of each method.
1.2 Background
In the projective pinhole camera model, a point P in the 3D space is imaged into a point P on the image plane. P is related to P with the following formula:
(1.1)
where P and P are represented in homogeneous coordinates, M is a projection matrix, and I is the identity matrix. M can be decomposed into two components: the intrinsic parameters and the extrinsic parameters. Intrinsic parameters relate to the internal parameters of the camera, such as the image coordinates of the principal point, the focal length, pixel shape (its aspect ratio), and the skew. They are represented by the upper triangular matrix K. Extrinsic (or external) parameters relate to the pose of the camera, defined by the rotation matrix R and its position t with respect to a global coordinate system. Camera calibration is the process of estimating the intrinsic and extrinsic parameters of the cameras.
3D reconstruction can be roughly defined as the inverse of the imaging process; given a pixel P on one image, 3D reconstruction seeks to find the 3D coordinates of the point P that is imaged onto P. This is an ill-posed problem because with the inverse imaging process a pixel P maps into a ray v that starts from the camera center and passes through the pixel P. The ray direction can be computed from the camera pose R and its intrinsic parameters K as follows;
(1.2)
1.2.1 Depth from Triangulation
If q is the image of the same 3D point P taken by another camera from a different viewing angle, then the 3D coordinates of P can be recovered by estimating the intersection of the two rays, v1 and v2, that start from the camera centers passing, respectively, through P and q. This is known as the optical triangulation principle. P and q are called corresponding or matching pixels because they are the images of the same 3D point P.
A 3D point P is the intersection of n(n>1) rays vi passing through the optical centers ci of cameras where . This can also be referred to passive optical triangulation. As illustrated in Figure 1.2, all points on vi project to pi, given a set of corresponding pixels pi captured by the cameras ci, and their corresponding rays vi, the 3D location of P can be found by intersecting the rays vi. In practice, however, these rays will often not intersect. Instead, we look for the optimal value of P that lies closest to the rays vi. Mathematically, if Ki, Ri, ti are the parameters of the camera ci, where Ki is the matrix that contains the intrinsic parameters of the camera and Ri and ti are the pose of the Ith camera with respect to the world coordinate system, the rays vi originating at ci and passing through pi are in the direction of R−1iK−1ipi. The optimal value of P that lies closest to all the rays , P minimizes the distance:
(1.3)
Figure 1.2 Multiview stereo determines the position of a point in space by finding the intersection of the rays vi passing through the center of projection ci of the Ith camera and the projection of the point P in each image, pi
Methods based on the optical triangulation need to solve two problems: (i) the matching problem, and (ii) the reconstruction problem. The correspondence problem consists of finding matching points across the different cameras. Given the corresponding points, the reconstruction problem consists of computing a 3D disparity map of the scene, which is equivalent to the depth map (z-coordinate on each pixel). Consequently, the quality of the reconstruction depends crucially on the solution to the correspondence problem. For further reading on stereo vision (cameras calibration, stereo matching algorithms, reconstruction, etc.), we refer the reader to download the PDF of the Richard Szeliski's Computer Vision: Algorithms and Applications available at http://szeliski.org.1
Existing optical triangulation-based 3D reconstruction techniques, such as multi-view stereo, structured-light techniques, and laser-based scanners, differ in the way the correspondence problem is solved. Multiview stereo reconstruction uses the triangulation principle to recover the depth map of a scene from two or more projections. The same mechanism is unconsciously used by our brain to work out how far an object is. The correspondence problem in stereo vision is solved by looking for pixels that have the same appearance in the set of images. This is known as stereo matching. Structured-light techniques use, in addition to camera(s), a light source to project on the scene a set of light patterns that are used as codes for finding correspondences between stereo images. Laser scanners use the triangulation principle by replacing one camera with a laser source that emits a laser ray in the direction of the object to scan. A camera from a different viewpoint captures the projected pattern.
1.2.2 Shape from Shading
Artists have reproduced, in paintings, illusions of depth using lighting and shading. Shape From Shading (SFS) addresses the shape recovery problem from a gradual variation of shading in the image. Image formation is a key ingredient to solve the SFS problem. In the early 1970s, Horn was the first to formulate the SFS problem as that of finding the solution of a nonlinear first-order Partial Differential Equation (PDE) also called the brightness equation. In the 1980s, the authors address the computational part of the problem, directly computing numerical solutions. Bruss and Brooks asked questions about the existence and uniqueness of solutions. According to the Lambertian model of image formation, the gray level at an image pixel depends on the light source direction and surface normal. Thus, the aim is to recover the illumination source and the surface shape at each pixel. According to Horn’s formulation of SFS problem, the brightness equation arises as:
(1.4)
where, (x, y) are the coordinates of a pixel; R, the reflectance map and I the brightness image. Usually, SFS approaches, particularly those dedicated to face shape recovery, adopt the Lambartian property of the surface. In which case, the reflectance map is the cosine of the angle between light vector and the normal vector to the surface:
(1.5)
where R, and depends on (x, y). Since the first SFS technique developed by Horn, many different approaches have emerged; active SFS which requires calibration to simplify the solution finding has achieved impressive results.
1.2.3 Depth from Time of Flight (ToF)
Time of flight provides a direct way to acquire 3-D surface information of objects or scenes outputting 2.5 D, or depth, images with a real-time capability. The main idea is to estimate the time taken for the light projected by an illumination source to return from the scene or the object surface. This approach usually requires nano-second timing to resolve surface measurements to millimeter accuracy. The object or scene is actively illuminated with a nonvisible light source whose spectrum is usually nonvisible infrared, e.g. 780 nm. The intensity of the active signal is modulated by a cosine-shaped signal of frequency f. The light signal is assumed to have a constant speed, c, and is reflected by the scene or object surface. The distance d is estimated from the phase shift in radian between the emitted and the reflected signal, respectively:
(1.6)
While conventional imaging sensors consists of multiple photo diodes, arranged within a matrix to provide an image of, e.g., color or gray values, a ToF sensor, for instance a photon mixing device (PMD) sensor, simultaneously acquires a distance value for each pixel in addition to the common intensity (gray) value. Compared with conventional imaging sensors, a PMD sensor is a standard CMOS sensor that benefits from these functional improvements. The chip includes all intelligence, which means that the distance is computed per pixel. In addition, some ToF cameras are equipped with a special pixel-integrated circuit, which guarantees the independence to sunlight influence by the suppression of background illumination (SBI).
1.3 Static 3D Face Modeling
1.3.1 Laser-stripe Scanning
Laser-stripe triangulation uses the well-known optical triangulation described in section 1.2. A laser line is swept across the object where a CCD array camera captures the reflected light, its shape gives the depth information. More formally, as illustrated in Figure 1.3, a slit laser beam, generated by a light projecting optical system, is projected on the object to be measured, and its reflected light is received by a CCD camera for triangulation. Then, 3D distance data for one line of slit light are obtained. By scanning slit light with a galvanic mirror, 3D data for the entire object to be measured are obtained. By measuring the angle , formed by the baseline d (distance between the light-receiving optical system and the light-projecting optical system) and by a laser beam to be projected, one can determine the z-coordinate by triangulation. The angle is determined by an instruction value of the galvanic mirror. Absolute coordinates for laser beam position on the surface of the object, denoted by P, are obtained from congruence conditions of triangles, by
(1.7)
This gives the z-coordinate, by
(1.8)
Solve question 1 in section 5.5.3 for the proof.
Figure 1.3 Optical triangulation geometry for a laser-stripe based scanner
The Charged Couple Device (CCD) is the widely used light-receiving optical system to digitize the point laser image. CCD-based sensors avoid the beam spot reflection and stray light effects and provide more accuracy because of the single-pixel resolution. Another factor that affects the measurement accuracy is the difference in the surface characteristic of the measured object from the calibration surface. Usually calibration should be performed on similar surfaces to ensure measurement accuracy. Using laser as a light source, this method has proven to be able to provide measurement at a much higher depth range than other passive systems with good discrimination of noise factors. However, this line-by-line measurement technique is relatively slow. The laser-based techniques can give very accurate 3D information for a rigid body even with a large depth. However, this method is time consuming for real measurement since it obtains 3D geometry on a line at a time. The area scanning-based methods such as time-coded structured light (see section 1.3.2) are certainly faster.
An example of acquired face using these technique is given by Figure 1.4. It illustrates the good quality of the reconstruction when office environment acquisition conditions are considered, the subject is distant of 1 m from the sensor and remains stable for a few seconds.
Figure 1.4 One example of 3D face acquisition based on laser stripe scanning (using Minolta VIVID 910). Different representations are given, from the left: texture image, depth image, cloud of 3D points, 3D mesh, and textured shape
1.3.2 Time-coded Structured Light
The most widely used acquisition systems for face are based on structured light by virtue of reliability for recovering complex surface and accuracy. That consists in projecting a light pattern and imaging the illuminated object, a face for instance, from one or more points of view. Correspondences between image points and points of the projected pattern can be easily found. Finally the decoded points can be triangulated, and depth is recovered. The patterns are designed so that code words are assigned to a set of pixels.
A code word is assigned to a coded pixel to ensure a direct mapping from the code words to the corresponding coordinates of the pixel in the pattern. The code words are numbers and they are mapped in the pattern by using gray levels, color or geometrical representations. Pattern projection techniques can be classified according to their coding strategy: time-multiplexing, neighborhood codification, and direct codification. Time-multiplexing consists in projecting code words as sequence of patterns along time, so the structure of every pattern can be very simple. In spite of increased complexity, neighborhood codification represents the code words in a unique pattern. Finally, direct codification defines a code word for every pixel; equal to the pixel gray level or color.
One of the most commonly exploited strategies is based on temporal coding. In this case, a set of patterns are successively projected onto the measuring surface. The code word for a given pixel is usually formed by the sequence of illumination values for that pixel across the projected patterns. Thus, the codification is called temporal because the bits of the code words are multiplexed in time. This kind of pattern can achieve high accuracy in the measurements. This is due to two factors: First, because multiple patterns are projected, the code word basis tends to be small (usually binary) and hence a small set of primitives is used, being easily distinguishable among each other. Second, a coarse-to-fine paradigm is followed, because the position of a pixel is encoded more precisely while the patterns are successively projected.
Figure 1.5 (a) Binary-coded patterns projection for 3D acquisition, (b) n-ary-coded coded patterns projection for 3D acquisition
During the three last decades, several techniques based on time-multiplexing have appeared. These techniques can be classified into three categories: binary codes (Figure 1.5a), n-ary codes (Fig. 1.5b), and phase-shifting techniques.
Binary codes. In binary code, only two illumination levels are used. They are coded as 0 and 1. Each pixel of the pattern has its code word formed by the sequence of 0 and 1 corresponding to its value in every projected pattern. A code word is obtained once the sequence is completed. In practice, illumination source and camera are assumed to be strongly calibrated and hence only one of both pattern axes is encoded. Consequently, black and white strips are used to compose patterns – black corresponding to 0 and white 1, m patterns encode 2m stripes. The maximum number of patterns that can be projected is the resolution in pixels of the projector device; however, because the camera cannot always perceive such narrow stripes, reaching this value is not recommended. It should be noticed that all pixels belonging to a similar stripe in the highest frequency pattern share the same code word. Therefore, before triangulating, it is necessary to calculate either the center of every stripe or the edge between two consecutive stripes. The latter has been shown to be the best choice.N-ary codes. The main drawback of binary codes is the large number of patterns to be projected. However, the fact that only two intensities are projected eases the segmentation of the imaged patterns. The number of patterns can be reduced by increasing the number of intensity levels used to encode the stripes. A first mean is to use multilevel Gray code based on color. This extension of Gray code is based on an alphabet of n symbols; each symbol is associated with a certain RGB color. This extended alphabet makes it possible to reduce the number of patterns. For instance, with binary Gray code, m patterns are necessary to encode 2m stripes. With an n-ary code, nm stripes can be coded using the same number of patterns.Phase shifting. Phase shifting is a well-know principle in the pattern projection approach for 3D surface acquisition. Here, a set of sinusoidal patterns is used. The intensities of a pixel p(x, y) in each pattern is given by:(1.9)
I0(x, y) is the background or the texture information, Imod(x, y) is the signal modulation amplitude, and I1(x, y), I2(x, y) and I3(x, y) are the intensities of the three patterns. is the phase value and is a constant. Three images of the object are used to estimate a wrapped phase value by:
(1.10)
The wrapped phase is periodic and needs to be unwrapped to obtain an absolute phase value , where k is an integer representing the period or the number of the fringe. Finally the 3D information is recovered based on the projector-camera system configuration. Other pattern configurations of these patterns have been proposed. For instance, Zhang and Yau proposed a real-time 3D shape measurement based on a modified three-step phase-shifting technique (Zhang et al., 2007) (Figure 1.6). They called the modified patterns 2+1 phase shifting approach. According to this approach, the patterns and phase estimation are given by
(1.11)
(1.12)
A robust phase unwrapping approach called “multilevel quality-guided phase unwrapping algorithm” is also proposed in Zhang et al. (2007).
Ouji et al. (2011) proposed a cost-effective 3D video acquisition solution with a 3D super-resolution scheme, using three calibrated cameras coupled with a non-calibrated projector device, which is particularly suited to 3D face scanning, that is, rapid, easily movable, and robust to ambient lighting conditions. Their solution is a hybrid stereovision and phase-shifting approach that not only takes advantage of the assets of stereovision and structured light but also overcomes their weaknesses. First, a 3D sparse model is estimated from stereo matching with a fringe-based resolution and a sub-pixel precision. Then projector parameters are automatically estimated through an inline stage. A dense 3D model is recovered by the intrafringe phase estimation, from the two sinusoidal fringe images and a texture image, independently from the left, middle, and right cameras. Finally, the left, middle, and right 3D dense models are fused to produce the final 3D model, which constitutes a spatial super-resolution. In contrast with previous methods, camera-projector calibration and phase-unwrapping stages are avoided.
Figure 1.6 The high-resolution and real-time 3D shape measurement system proposed by Zhang and Yau (2007) is based on the modified 2 + 1 phase-shifting algorithm and particularly adapted for face acquisition. The data acquisition speed is as high as 60 frames per second while the image resolution is 640 × 480 pixels per frame. Here a photograph captured during the experiment is illustrated. The left side of the image shows the subject, whereas the right side shows the real-time reconstructed geometry
1.3.3 Multiview Static Reconstruction
The aim of multiview stereo (MVS) reconstruction is twofold. Firstly, it allows to reinforce constraints on stereo matching, discard false matches, and increase the precision of good matches. Secondly, spatial arrangement of cameras allows covering the entire face. To reduce the complexity, as well as achieve high quality reconstruction, multiview reconstruction approaches usually proceed in a coarse-to-fine sequence. Finally, multiview approaches involve high resolution images captured in real time, whereas the processing stage requires tens of minutes. MVS scene and object reconstruction approaches can be organized into four categories. The first category operates first by estimating a cost function on a 3D volume and then extracting a surface from this volume. A simple example of this approach is the voxel-coloring algorithm and its variants (Seitz and Dyer, 1997; Treuille et al., 2004). The second category of approaches, based on voxels, level sets, and surface meshes, works by iteratively evolving a surface to decrease or minimize a cost function. For example, from an initial volume, space carving progressively removes inconsistent voxels. Other approaches represent the object as an evolving mesh (Hernandez and Schmitt, 2004; Yu et al., 2006) moving as a function of internal and external forces. In the third category are image-space methods that estimate a set of depth maps. To ensure a single consistent 3D object interpretation, they enforce consistency constraints between depth maps (Kolmogorov and Zabih, 2002; Gargallo and Sturm, 2005) or merge the set of depth maps into a 3D object as a post process (Narayanan et al., 1998). The final category groups approaches that first extract and matches a set of feature points. A surface is then fitted to the reconstructed features (Morris and Kanade, 2000; Taylor, 2003). Seitz et al. (2006) propose an excellent overview and categorization of MVS. 3D face reconstruction approaches use a combination of methods from these categories.
Furukawa and Ponce (2009) proposed a MVS algorithm that outputs accurate models with a fine surface. It implements multiview stereopsis as a simple match, expand, and filter procedure. In the matching step, a set of features localized by Harris operator and difference-of-Gaussians algorithms are matched across multiple views, giving a sparse set of patches associated with salient image regions. From these initial matches, the two next steps are repeated n times (n=3 in experiments). In the expansion step, initial matches are spread to nearby pixels to obtain a dense set of patches. Finally in the filtering step, the visibility constraints are used to discard incorrect matches lying either in front of or behind the observed surface. The MVS approach proposed by Bradley et al. (2010) is based on an iterative binocular stereo method to reconstruct seven surface patches independently and to merge into a single high resolution mesh. At this stage, face details and surface texture help guide the stereo algorithm. First, depth maps are created from pairs of adjacent rectified viewpoints. Then the most prominent distortions between the views are compensated by a scaled-window matching technique. The resulted depth images are converted to 3D points and fused into a single dense point cloud. A triangular mesh from the initial point cloud is reconstructed over three steps: down-sampling, outliers removal, and triangle meshing. Sample reconstruction results of this approach are shown in Figure 1.7.
Figure 1.7 Sample results on 3D modeling algorithm for calibrated multiview stereopsis proposed by Furukawa and Ponce (2010) that outputs a quasi-dense set of rectangular patches covering the surfaces visible in the input images. In each case, one of the input images is shown on the left, along with two views of textured-mapped reconstructed patches and shaded polygonal surfaces. Copyright © 2007, IEEE
The 3D face acquisition approach proposed by Beeler et al. (2010), which is built on the survey paper, takes inspiration from Furukawa and Ponce (2010). The main difference lies in a refinement formulation. The starting point is the established approach for refining recovered 3D data on the basis of a data-driven photo-consistency term and a surface-smoothing term, which has been research topic. These approaches differ in the use of a second-order anisotropic formulation of the smoothing term, and we argue that it is particularly suited to faces. Camera calibration is achieved in a pre-processing stage.
The run-time system starts with a pyramidal pairwise stereo matching. Results from lower resolutions guide the matching at higher-resolutions. The face is first segmented based on cues of background subtraction and skin color. Images from each camera pair are rectified. An image pyramid is then generated by factor of two downsampling using Gaussian convolution and stopping at approximately pixels for the lowest layer. Then a dense matching is established between pairwise neighboring cameras, and each layer of the pyramid is processed as follows: Matches are computed for all pixels on the basis of normalized cross correlation (NCC) over a square window (). The disparity is computed to sub-pixel accuracy and used to constrain the search area in the following layer. For each pixel, smoothness, uniqueness, and ordering constraints are checked, and the pixels that do not fulfill these criteria are reached using the disparity estimated at neighboring pixels. The limited search area ensures smoothness and ordering constraints, but the uniqueness constraint is enforced again by disparity map refinement. The refinement is defined as a linear combination of a photometric consistency term, dp, and a surface consistency term, ds, balanced both by a user-specified smoothness parameter, ws, and a data-driven parameter, wp, to ensure that the photometric term has the greatest weight in regions with good feature localization. dp favors solutions with high NCC, whereas ds favors smooth solutions. The refinement is performed on the disparity map and later on the surface. Both are implemented as iterative processes.
The refinement results in surface geometry that is smooth across skin pores and fine wrinkles because the disparity change across such a feature is too small to detect. The result is flatness and lack of realism in synthesized views of the face. On the other hand, visual inspection shows the obvious presence of pores and fine wrinkles in the images. This is due to the fact that light reflected by a diffuse surface is related to the integral of the incoming light. In small concavities, such as pores, part of the incoming light is blocked and the point thus appears darker. This has been exploited by various authors (e.g., Glencross et al., 2008)) to infer local geometry variation. In this section, we expose a method to embed this observation into the surface refinement framework. It should be noticed that this refinement is qualitative, and the geometry that is recovered is not metrically correct. However, augmenting the macroscopic geometry with fine scale features does produce a significant improvement in the perceived quality of the reconstructed face geometry.
For the mesoscopic augmentation, only features that are too small to be recovered by the stereo algorithm are interesting. Therefore, first high pass filtered values are computed for all points X using the projection of a Gaussian :
(1.13)
where denotes the set of visible cameras, the covariance matrix of the projection of the Gaussian into camera c, and the weighting term is the cosine of the foreshortening angle observed at camera c. The variance of the Gaussian is chosen such that high spatial frequencies are attenuated. It can either be defined directly on the surface using the known maximum size of the features or in dependence of the matching window M. The next steps are based on the assumption that variation in mesoscopic intensity is linked to variation in the geometry. For human skin, this is mostly the case. Spatially bigger skin features tend to be smooth and are thus filtered out. The idea is thus to adapt the local high frequency geometry of the mesh to the mesoscopic field (X). The geometry should locally form a concavity whenever (X) decreases and a convexity when it increases.
1.4 Dynamic 3D Face Reconstruction
The objective now is to create dynamic models that accurately recover the facial shape and acquire the time-varying behavior of a real person’s face. Modeling facial dynamics is essential for several applications such as avatar animation, facial action analysis, and recognition. Compared with a static or quasi-static object (or scene), this is more difficult to achieve because of the required fast processing. Besides, it is the main limitation of the techniques described in Section 1.3. In particular, laser-based scanners and time-coded structured light shape capture techniques do not operate effectively on fast-moving scenes because of the time required for scanning the object when moving or deforming. In this section, we present appropriate techniques designed for moving/deforming face acquisition and post-processing pipeline for performance capture or expression transfer.
1.4.1 Multiview Dynamic Reconstruction
Passive facial reconstruction has received particular attention because of its potential applications in facial animation. Recent research effort has focused on passive multi-view stereo (PMVS) for animated face capture sans markers, makeup, active technology, and expensive hardware. A key step toward effective performance capture is to model the structure and motion of the face, which is a highly deformable surface. Furukawa and Ponce (2009) proposed a motion capture approach from video stream that specifically aims at this challenge. Assuming that the instantaneous geometry of the face is represented by a polyhedral mesh with fixed topology, an initial mesh is constructed in the first frame using PMVS software for MVS (Furukawa and Ponce, 2010) and Poisson surface reconstruction software (Kazhdan et al., 2006) for meshing. Then its deformation is captured by tracking its vertices over time. The goal of the algorithm is to estimate in each frame f the position vfi of each vertex vi (From now on, vfi will be used to denote both the vertex and its position.) Each vertex may or may not be tracked at a given frame, including the first one, allowing the system to handle occlusion, fast motion, and parts of the surface that are not initially visible. The three steps of the tracking algorithm refer to local motion parameters estimation, global surface deformation, and filtering.
First, at each frame, an approximation of a local surface region around each vertex, by its tangent plane, gives the corresponding local 3D rigid motion with six degrees of freedom. Three parameters encode normal information, while the remaining three contain tangential motion information. Then, on the basis of the estimated local motion parameters, the whole mesh is then deformed by minimizing the sum of the three energy terms.
(1.14)
The first data term measures the squared distance between the vertex position vfi and the position estimated by the local estimation process. The second uses the discrete Laplacian operator of a local parameterization of the surface in vi to enforce smoothness. [The values and are used in all experiments (Furukawa and Ponce, 2009)]. This term is very similar to the Laplacian regularizer used in many other algorithms (Ponce, 2008). The third term is also used for regularization, and it enforces local tangential rigidity with no stretch, shrink, or shear. The total energy is minimized with respect to the 3D positions of all the vertices by a conjugate gradient method. In case of deformable surfaces such as human faces, nonstatic target edge length is computed on the basis of non-rigid tangential deformation from the reference frame to the current one at each vertex. The estimation of the tangential deformation is performed at each frame before starting the motion estimation, and the parameters are fixed within a frame. Thus, the tangential rigidity term Er(vfi) for a vertex vfi in the global mesh deformation is given by
(1.15)
which is the sum of squared differences between the actual edge lengths and those predicted from the reference frame to the current frame. The term is used to make the penalty zero when the deviation is small so that this regularization term is enforced only when the data term is unreliable and the error is large. In all our experiments, is set to be 0.2 times the average edge length of the mesh at the first frame. Figure 1.8 shows some results of motion capture approach proposed in Furukawa and Ponce (2009).
Figure 1.8 The results of motion capture approach, proposed by Furukawa and Ponce (2009), form multiple synchronized video streams based on regularization adapted to nonrigid tangential deformation. From left to right, a sample input image, reconstructed mesh model, estimated notion and a texture mapped model for one frame with interesting structure/motion for each dataset 1, 2, and 3. The right two columns show the results in another interesting frame. Copyright © 2009, IEEE
Finally after surface deformation, the residuals of the data and tangential terms are used to filter out erroneous motion estimates. Concretely, these values are first smoothed, and a smoothed local motion estimate is deemed an outlier if at least one of the two residuals exceeds a given threshold. These three steps are iterated a couple of times to complete tracking in each frame, the local motion estimation step only being applied to vertices whose parameters have not already been estimated or filtered out.
The face capture framework proposed by Bradley et al. (2010) operates without use of markers and consists of three main components: acquisition, multiview reconstruction and geometry, and texture tracking. The acquisition stage uses 14 high definition video cameras arranged in seven binocular stereo pairs. At the multiview reconstruction stage, each pair captures a highly detailed small patch of the face surface under bright ambient light. This stage uses on an iterative binocular stereo method to reconstruct seven surface patches independently that are merged into a single high resolution mesh; the stereo algorithm is guided by face details providing, roughly, 1 million polygons meshes. First, depth maps are created from pairs of adjacent rectified viewpoints. Observing that the difference in projection between the views causes distortions of the comparison windows, the most prominent distortions of this kind are compensated by a scaled-window matching technique. The resulting depth images are converted to 3D points and fused into a single dense point cloud. Then, a triangular mesh from the initial point cloud is reconstructed through three steps: the original point cloud is downsampled using hierarchical vertex clustering (Schaefer and Warren, 2003). Outliers and small-scale high frequency noise are removed on the basis of the Plane Fit Criterion proposed by Weyrich et al. (2004) and a point normal filtering inspired by Amenta and Kil (2004), respectively. A triangle mesh is generated without introducing excessive smoothing using lower dimensional triangulation methods Gopi et al. (2000).
At the last stage, in order to consistently track geometry and texture over time, a single reference mesh from the sequence is chosen. A sequence of compatible meshes without holes is explicitly computed. Given the initial per-frame reconstructions Gt, a set of compatible meshes Mt is generated that has the same connectivity as well as explicit vertex correspondence. To create high quality renderings, per-frame texture maps Tt that capture appearance changes, such as wrinkles and sweating of the face, are required. Starting with a single reference mesh M0, generated by manually cleaning up the first frame G0, dense optical flow on the video images is computed and used in combination with the initial geometric reconstructions Gt to automatically propagate M0 through time. At each time step, a high quality 2D face texture Tt from the video images is computed. Drift caused by inevitable optical flow error is detected in the per-frame texture maps and corrected in the geometry. Also, the mapping is guided by an edge-based mouth-tracking process to account the high speed motion while talking.
