Emerging Technologies for 3D Video - Frederic Dufaux - E-Book

Emerging Technologies for 3D Video E-Book

Frederic Dufaux

0,0
103,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

With the expectation of greatly enhanced user experience, 3D video is widely perceived as the next major advancement in video technology. In order to fulfil the expectation of enhanced user experience, 3D video calls for new technologies addressing efficient content creation, representation/coding, transmission and display.

Emerging Technologies for 3D Video will deal with all aspects involved in 3D video systems and services, including content acquisition and creation, data representation and coding, transmission, view synthesis, rendering, display technologies, human perception of depth and quality assessment.

Key features:

  • Offers an overview of key existing technologies for 3D video
  • Provides a discussion of advanced research topics and future technologies
  • Reviews relevant standardization efforts
  • Addresses applications and implementation issues
  • Includes contributions from leading researchers

The book is a comprehensive guide to 3D video systems and services suitable for all those involved in this field, including engineers, practitioners, researchers as well as professors, graduate and undergraduate students, and managers making technological decisions about 3D video.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 955

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Cover

Title Page

Copyright

Preface

List of Contributors

Acknowledgements

Part One: Content Creation

Chapter 1: Consumer Depth Cameras and Applications

1.1 Introduction

1.2 Time-of-Flight Depth Camera

1.3 Structured Light Depth Camera

1.4 Specular and Transparent Depth

1.5 Depth Camera Applications

References

Chapter 2: SFTI: Space-from-Time Imaging

2.1 Introduction

2.2 Background and Related Work

2.3 Sampled Response of One Source–Sensor Pair

2.4 Diffuse Imaging: SFTI for Estimating Scene Reflectance

2.5 Compressive Depth Acquisition: SFTI for Estimating Scene Structure

2.6 Discussion and Future Work

Acknowledgments

References

Chapter 3: 2D-to-3D Video Conversion: Overview and Perspectives

3.1 Introduction

3.2 The 2D-to-3D Conversion Problem

3.3 Definition of Depth Structure of the Scene

3.4 Generation of the Second Video Stream

3.5 Quality of Experience of 2D-to-3D Conversion

3.6 Conclusions

References

Chapter 4: Spatial Plasticity: Dual-Camera Configurations and Variable Interaxial

4.1 Stereoscopic Capture

4.2 Dual-Camera Arrangements in the 1950s

4.3 Classic “Beam-Splitter” Technology

4.4 The Dual-Camera Form Factor and Camera Mobility

4.5 Reduced 3D Form Factor of the Digital CCD Sensor

4.6 Handheld Shooting with Variable Interaxial

4.7 Single-Body Camera Solutions for Stereoscopic Cinematography

4.8 A Modular 3D Rig

4.9 Human Factors of Variable Interaxial

References

Part Two: Representation, Coding and Transmission

Chapter 5: Disparity Estimation Techniques

5.1 Introduction

5.2 Geometrical Models for Stereoscopic Imaging

5.3 Stereo Matching Process

5.4 Overview of Disparity Estimation Methods

5.5 Conclusion

References

Chapter 6: 3D Video Representation and Formats

6.1 Introduction

6.2 Three-Dimensional Video Representation

6.3 Three-Dimensional Video Formats

6.4 Perspectives

Acknowledgments

References

Chapter 7: Depth Video Coding Technologies

7.1 Introduction

7.2 Depth Map Analysis and Characteristics

7.3 Depth Map Coding Tools

7.4 Application Example: Depth Map Coding Using “Don't Care” Regions

7.5 Concluding Remarks

Acknowledgments

References

Chapter 8: Depth-Based 3D Video Formats and Coding Technology

8.1 Introduction

8.2 Depth Representation and Rendering

8.3 Coding Architectures

8.4 Compression Technology

8.5 Experimental Evaluation

8.6 Concluding Remarks

References

Chapter 9: Coding for Interactive Navigation in High-Dimensional Media Data

9.1 Introduction

9.2 Challenges and Approaches of Interactive Media Streaming

9.3 Example Solutions

9.4 Interactive Multiview Video Streaming

9.5 Conclusion

References

Chapter 10: Adaptive Streaming of Multiview Video Over P2P Networks

10.1 Introduction

10.2 P2P Overlay Networks

10.3 Monocular Video Streaming Over P2P Networks

10.4 Stereoscopic Video Streaming over P2P Networks

10.5 MVV Streaming over P2P Networks

References

Part Three: Rendering and Synthesis

Chapter 11: Image Domain Warping for Stereoscopic 3D Applications

11.1 Introduction

11.2 Background

11.3 Image Domain Warping

11.4 Stereo Mapping

11.5 Warp-Based Disparity Mapping

11.6 Automatic Stereo to Multiview Conversion

11.7 IDW for User-Driven 2D–3D Conversion

11.8 Multi-Perspective Stereoscopy from Light Fields

11.9 Conclusions and Outlook

Acknowledgments

References

Chapter 12: Image-Based Rendering and the Sampling of the Plenoptic Function

12.1 Introduction

12.2 Parameterization of the Plenoptic Function

12.3 Uniform Sampling in a Fourier Framework

12.4 Adaptive Plenoptic Sampling

12.5 Summary

References

Chapter 13: A Framework for Image-Based Stereoscopic View Synthesis from Asynchronous Multiview Data

13.1 The Virtual Video Camera

13.2 Estimating Dense Image Correspondences

13.3 High-Quality Correspondence Edit

13.4 Extending to the Third Dimension

References

Part Four: Display Technologies

Chapter 14: Signal Processing for 3D Displays

14.1 Introduction

14.2 3D Content Generation

14.3 Dealing with 3D Display Hardware

14.4 Conclusions

Acknowledgments

References

Chapter 15: 3D Display Technologies

15.1 Introduction

15.2 Three-Dimensional Display Technologies in Cinemas

15.3 Large 3D Display Technologies in the Home

15.4 Mobile 3D Display Technologies

15.5 Long-Term Perspectives

15.6 Conclusion

References

Chapter 16: Integral Imaging

16.1 Introduction

16.2 Integral Photography

16.3 Real-Time System

16.4 Properties of the Reconstructed Image

16.5 Research and Development Trends

16.6 Conclusion

References

Chapter 17: 3D Light-Field Display Technologies

17.1 Introduction

17.2 Fundamentals of 3D Displaying

17.3 The HoloVizio Light-Field Display System

17.4 HoloVizio Displays and Applications

17.5 The Perfect 3D Display

17.6 Conclusions

References

Part Five: Human Visual System and Quality Assessment

Chapter 18: 3D Media and the Human Visual System

18.1 Overview

18.2 Natural Viewing and S3D Viewing

18.3 Perceiving 3D Structure

18.4 ‘Technical’ Issues in S3D Viewing

18.5 Fundamental Issues in S3D Viewing

18.6 Motion Artefacts from Field-Sequential Stereoscopic Presentation

18.7 Viewing Stereoscopic Images from the ‘Wrong’ Place

18.8 Fixating and Focusing on Stereoscopic Images

18.9 Concluding Remarks

Acknowledgments

References

Chapter 19: 3D Video Quality Assessment

19.1 Introduction

19.2 Stereoscopic Artifacts

19.3 Subjective Quality Assessment

19.4 Objective Quality Assessment

References

Part Six: Applications and Implementation

Chapter 20: Interactive Omnidirectional Indoor Tour

20.1 Introduction

20.2 Related Work

20.3 System Overview

20.4 Acquisition and Preprocessing

20.5 SfM Using the Ladybug Camera

20.6 Loop and Junction Detection

20.7 Interactive Alignment to Floor Plan

20.8 Visualization and Navigation

20.9 Vertical Rectification

20.10 Experiments

20.11 Conclusions

Acknowledgments

References

Chapter 21: View Selection

21.1 Introduction

21.2 Content Analysis

21.3 Content Ranking

21.4 View Selection

21.5 Comparative Summary and Outlook

References

Chapter 22: 3D Video on Mobile Devices

22.1 Mobile Ecosystem, Architecture, and Requirements

22.2 Stereoscopic Applications on Mobile Devices

22.3 Stereoscopic Capture on Mobile Devices

22.4 Display Rendering on Mobile Devices

22.5 Depth and Disparity

22.6 Conclusions

Acknowledgments

References

Chapter 23: Graphics Composition for Multiview Displays

23.1 An Interactive Composition System for 3D Displays

23.2 Multimedia for Multiview Displays

23.3 GPU Graphics Synthesis for Multiview Displays

23.4 DIBR Graphics Synthesis for Multiview Displays

23.5 Conclusion

Acknowledgments

References

Chapter 24: Real-Time Disparity Estimation Engine for High-Definition 3DTV Applications

24.1 Introduction

24.2 Review of Disparity Estimation Algorithms and Implementations

24.3 Proposed Hardware-Efficient Algorithm

24.4 Proposed Architecture

24.5 Experimental Results

24.6 Conclusion

References

Index

This edition first published 2013

© 2013 John Wiley & Sons, Ltd.

Registered office

John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Emerging technologies for 3D video : creation, coding, transmission, andrendering / Frederic Dufaux, Beatrice Pesquet-Popescu, Marco Cagnazzo.

pages cm

Includes bibliographical references and index.

ISBN 978-1-118-35511-4 (cloth)

1. 3-D video–Standards. 2. Digital video–Standards. I. Dufaux, Frederic, 1967- editor of compilation. II. Pesquet-Popescu, Beatrice, editor of compilation. III. Cagnazzo, Marco, editor of compilation. IV. Title: Emerging technologies for three dimensional video.

TK6680.8.A15E44 2013

006.6′96–dc23

2012047740

A catalogue record for this book is available from the British Library.

ISBN: 9781118355114

Preface

The underlying principles of stereopsis have been known for a long time. Stereoscopes to see photographs in 3D appeared and became popular in the nineteenth century. The first demonstrations of 3D movies took place in the first half of the twentieth century, initially using anaglyph glasses, and then with polarization-based projection. Hollywood experienced a first short-lived golden era of 3D movies in the 1950s. In the last 10 years, 3D has regained significant interests and 3D movies are becoming ubiquitous. Numerous major productions are now released in 3D, culminating with Avatar, the highest grossing film of all time.

In parallel with the recent growth of 3D movies, 3DTV is attracting significant interest from manufacturers and service providers. This is obvious by the multiplication of new 3D product announcements and services. Beyond entertainment, 3D imaging technology is also seen as instrumental in other application areas such as video games, immersive video conferences, medicine, video surveillance, and engineering.

With this growing interest, 3D video is often considered as one of the major upcoming innovations in video technology, with the expectation of greatly enhanced user experience.

This book intends to provide an overview of key technologies for 3D video applications. More specifically, it covers the state of the art and explores new research directions, with the objective to tackle all aspects involved in 3D video systems and services. Topics addressed include content acquisition and creation, data representation and coding, transmission, view synthesis, rendering, display technologies, human perception of depth, and quality assessment. Relevant standardization efforts are reviewed. Finally, applications and implementation issues are also described.

More specifically, the book is composed of six parts. Part One addresses different aspects of 3D content acquisition and creation. In Chapter 1, Lee presents depth cameras and related applications. The principle of active depth sensing is reviewed, along with depth image processing methods such as noise modelling, upsampling, and removing motion blur. In Chapter 2, Kirmani, Colaço, and Goyal introduce the space-from-time imaging framework, which achieves spatial resolution, in two and three dimensions, by measuring temporal variations of light intensity in response to temporally or spatiotemporally varying illumination. Chapter 3, by Vazquez, Zhang, Speranza, Plath, and Knorr, provides an overview of the process generating a stereoscopic video (S3D) from a monoscopic video source (2D), generally known as 2D-to-3D video conversion, with a focus on selected recent techniques. Finally, in Chapter 4, Zone1 provides an overview of numerous contemporary strategies for shooting narrow and variable interaxial baseline for stereoscopic cinematography. Artistic implications are also discussed.

A key issue in 3D video, Part Two addresses data representation, compression, and transmission. In Chapter 5, Kaaniche, Gaetano, Cagnazzo, and Pesquet-Popescu address the problem of disparity estimation. The geometrical relationship between the 3D scene and the generated stereo images is analyzed and the most important techniques for disparity estimation are reviewed. Cagnazzo, Pesquet-Popescu, and Dufaux give an overview of existing data representation and coding formats for 3D video content in Chapter 6. In turn, in Chapter 7, Mora, Valenzise, Jung, Pesquet-Popescu, Cagnazzo, and Dufaux consider the problem of depth map coding and present an overview of different coding tools. In Chapter 8, Vetro and Müller provide an overview of the current status of research and standardization activity towards defining a new set of depth-based formats that facilitate the generation of intermediate views with a compact binary representation. In Chapter 9, Cheung and Cheung consider interactive media streaming, where the server continuously and reactively sends appropriate subsets of media data in response to a client's periodic requests. Different associated coding strategies and solutions are reviewed. Finally, Gürler and Tekalp propose an adaptive P2P video streaming solution for streaming multiview video over P2P overlays in Chapter 10.

Next, Part Three of the book discusses view synthesis and rendering. In Chapter 11, Wang, Lang, Stefanoski, Sorkine-Hornung, Sorkine-Hornung, Smolic, and Gross present image-domain warping as an alternative to depth-image-based rendering techniques. This technique utilizes simpler, image-based deformations as a means for realizing various stereoscopic post-processing operators. Gilliam, Brookes, and Dragotti, in Chapter 12, examine the state of the art in plenoptic sampling theory. In particular, the chapter presents theoretical results for uniform sampling based on spectral analysis of the plenoptic function and algorithms for adaptive plenoptic sampling. Finally, in Chapter 13, Klose, Lipski, and Magnor present a complete end-to-end framework for stereoscopic free viewpoint video creation, allowing one to viewpoint-navigate through space and time of complex real-world, dynamic scenes.

As a very important component of a 3D video system, Part Four focuses on 3D display technologies. In Chapter 14, Konrad addresses digital signal processing methods for 3D data generation, both stereoscopic and multiview, and for compensation of the deficiencies of today's 3D displays. Numerous experimental results are presented to demonstrate the usefulness of such methods. Borel and Doyen, in Chapter 15, present in detail the main 3D display technologies available for cinemas, for large-display TV sets, and for mobile terminals. A perspective of evolution for the near and long term is also proposed. In Chapter 16, Arai focuses on integral imaging, a 3D photography technique that is based on integral photography, in which information on 3D space is acquired and represented. This chapter describes the technology for displaying 3D space as a spatial image by integral imaging. Finally, in Chapter 17, Kovács and Balogh present light-field displays, an advanced technique for implementing glasses-free 3D displays.

In most targeted applications, humans are the end-users of 3D video systems. Part Five considers human perception of depth and perceptual quality assessment. More specifically, in Chapter 18, Watt and MacKenzie focus on how the human visual system interacts with stereoscopic 3D media, in view of optimizing effectiveness and viewing comfort. Three main issues are addressed: incorrect spatiotemporal stimuli introduced by field-sequential stereo presentation, inappropriate binocular viewing geometry, and the unnatural relationship between where the eyes fixate and focus in stereoscopic 3D viewing. In turn, in Chapter 19, Hanhart, De Simone, Rerabek, and Ebrahimi consider mechanisms of 3D vision in humans, and their underlying perceptual models, in conjunction with the types of distortions that today's and tomorrow's 3D video processing systems produce. This complex puzzle is examined with a focus on how to measure 3D visual quality, as an essential factor in the success of 3D technologies, products, and services.

In order to complete the book, Part Six describes target applications for 3D video, as well as implementation issues. In Chapter 20, Bazin, Saurer, Fraundorfer, and Pollefeys present a semi-automatic method to generate interactive virtual tours from omnidirectional video. It allows a user to virtually navigate through buildings and indoor scenes. Such a system can be applied in various contexts, such as virtual tourism, tele-immersion, tele-presence, and e-heritage. Daniyal and Cavallaro address the question of how to automatically identify which view is more useful when observing a dynamic scene with multiple cameras in Chapter 21. This problem concerns several applications ranging from video production to video surveillance. In particular, an overview of existing approaches for view selection and automated video production is presented. In Chapter 22, Bourge and Bellon present the hardware architecture of a typical mobile platform, and describe major stereoscopic 3D applications. Indeed, smartphones bring new opportunities to stereoscopic 3D, but also specific constraints. Chapter 23, by Le Feuvre and Mathieu, presents an integrated system for displaying interactive applications on multiview screens. Both a simple GPU-based prototype and a low-cost hardware design implemented on a field-programmable gate array are presented. Finally, in Chapter 24, Tseng and Chang propose an optimized disparity estimation algorithm for high-definition 3DTV applications with reduced computational and memory requirements.

By covering general and advanced topics, providing at the same time a broad and deep analysis, the book has the ambition to become a reference for those involved or interested in 3D video systems and services. Assuming fundamental knowledge in image/video processing, as well as a basic understanding in mathematics, this book should be of interest to a broad readership with different backgrounds and expectations, including professors, graduate and undergraduate students, researchers, engineers, practitioners, and managers making technological decisions about 3D video.

Frédéric DufauxBéatrice Pesquet-PopescuMarco Cagnazzo

Note

1. It is with great sadness that we learned that Ray Zone passed away on November 13, 2012.

List of Contributors

Jun Arai, NHK (Japan Broadcasting Corporation), Japan

Tibor Balogh, Holografika, Hungary

Jean-Charles Bazin, Computer Vision and Geometry Group, ETH Zürich, Switzerland

Alain Bellon, STMicroelectronics, France

Thierry Borel, Technicolor, France

Arnaud Bourge, STMicroelectronics, France

Mike Brookes, Department of Electrical and Electronic Engineering, Imperial College London, UK

Marco Cagnazzo, Département Traitement du Signal et des Images, Télécom ParisTech, France

Andrea Cavallaro, Queen Mary University of London, UK

Tian-Sheuan Chang, Department of Electronics Engineering, National Chiao Tung University, Taiwan

Gene Cheung, Digital Content and Media Sciences Research Division, National Institute of Informatics, Japan

Ngai-Man Cheung, Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore

Andrea Colaço, Media Lab, Massachusetts Institute of Technology, USA

Fahad Daniyal, Queen Mary University of London, UK

Francesca De Simone, Multimedia Signal Processing Group (MMSPG), Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland

Didier Doyen, Technicolor, France

Pier Luigi Dragotti, Department of Electrical and Electronic Engineering, Imperial College London, UK

Frédéric Dufaux, Département Traitement du Signal et des Images, Télécom ParisTech, France

Touradj Ebrahimi, Multimedia Signal Processing Group (MMSPG), Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland

Friedrich Fraundorfer, Computer Vision and Geometry Group, ETH Zürich, Switzerland

Raffaele Gaetano, Département Traitement du Signal et des Images, Télécom ParisTech, France

Christopher Gilliam, Department of Electrical and Electronic Engineering, Imperial College London, UK

Vivek K. Goyal, Research Laboratory of Electronics, Massachusetts Institute of Technology, USA

Markus Gross, Disney Research Zurich, Switzerland

C. Göktu Gürler, College of Engineering, Koç University, Turkey

Philippe Hanhart, Multimedia Signal Processing Group (MMSPG), Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland

Alexander Sorkine-Hornung, Disney Research Zurich, Switzerland

Joël Jung, Orange Labs, France

Mounir Kaaniche, Département Traitement du Signal et des Images, Télécom ParisTech, France

Ahmed Kirmani, Research Laboratory of Electronics, Massachusetts Institute of Technology, USA

Felix Klose, Institut für Computergraphik, TU Braunschweig, Germany

Sebastian Knorr, imcube labs GmbH, Technische Universität Berlin, Germany

Janusz Konrad, Department of Electrical and Computer Engineering, Boston University, USA

Péter Tamás Kovács, Holografika, Hungary

Manuel Lang, Disney Research Zurich, Switzerland

Seungkyu Lee, Samsung Advanced Institute of Technology, South Korea

Jean Le Feuvre, Département Traitement du Signal et des Images, Telecom ParisTech, France

Christian Lipski, Institut für Computergraphik, TU Braunschweig, Germany

Kevin J. MacKenzie, Wolfson Centre for Cognitive Neuroscience, School of Psychology, Bangor University, UK

Marcus Magnor, Institut für Computergraphik, TU Braunschweig, Germany

Yves Mathieu, Telecom ParisTech, France

Elie Gabriel Mora, Orange Labs, France; Département Traitement du Signal et des Images, Télécom ParisTech, France

Karsten Müller, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Germany

Béatrice Pesquet-Popescu, Département Traitement du Signal et des Images, Télécom ParisTech, France

Nils Plath, imcube labs GmbH, Technische Universität Berlin, Germany

Marc Pollefeys, Computer Vision and Geometry Group, ETH Zürich, Switzerland

Martin Rerabek, Multimedia Signal Processing Group (MMSPG), Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland

Olivier Saurer, Computer Vision and Geometry Group, ETH Zürich, Switzerland

Aljoscha Smolic, Disney Research Zurich, Switzerland

Olga Sorkine-Hornung, ETH Zurich, Switzerland

Filippo Speranza, Communications Research Centre Canada (CRC), Canada

Nikolce Stefanoski, Disney Research Zurich, Switzerland

A. Murat Tekalp, College of Engineering, Koç University, Turkey

Yu-Cheng Tseng, Department of Electronics Engineering, National Chiao Tung University, Taiwan

Giuseppe Valenzise, Département Traitement du Signal et des Images, Télécom ParisTech, France

Carlos Vazquez, Communications Research Centre Canada (CRC), Canada

Anthony Vetro, Mitsubishi Electric Research Labs (MERL), USA

Simon J. Watt, Wolfson Centre for Cognitive Neuroscience, School of Psychology, Bangor University, UK

Oliver Wang, Disney Research Zurich, Switzerland

Liang Zhang, Communications Research Centre Canada (CRC), Canada

Ray Zone, The 3-D Zone, USA

Acknowledgements

We would like to express our deepest appreciation to all the authors for their invaluable contributions. Without their commitment and efforts, this book would not have been possible.

Moreover, we would like to gratefully acknowledge the John Wiley & Sons Ltd. staff, Alex King, Liz Wingett, Richard Davies, and Genna Manaog, for their relentless support throughout this endeavour.

Frédéric DufauxBéatrice Pesquet-PopescuMarco Cagnazzo

Part One

Content Creation

1

Consumer Depth Cameras and Applications

Seungkyu Lee

Samsung Advanced Institute of Technology, South Korea

1.1 Introduction

Color imaging technology has been advanced to increase spatial resolution and color quality. However, its sensing principle limits three-dimensional (3D) geometry and photometry information acquisition. Many computer vision and robotics researchers have tried to reconstruct 3D scenes from a set of two-dimensional (2D) color images. They have built a calibrated color camera network and employed visual hull detection from silhouettes or key point matching to figure out the 3D relation between the set of 2D color images. These techniques, however, assume that objects seen from different views have identical color and intensity. This photometry consistency assumption is valid only for Lambertian surfaces that are not common in the real world. As a result, 3D geometry data capture using color cameras shows limited performance under limited lighting and object environments.

3D sensing technologies such as digital holograph, interferometry, and integral photography have been studied. However, they show limited performance in 3D geometry and photometry acquisition. Recently, several consumer depth-sensing cameras using near-infrared light have been introduced in the market. They have relatively low spatial resolution compared with color sensors and show limited sensing range and accuracy. Thanks to their affordable prices and the advantage of direct 3D geometry acquisition, many researchers from graphics, computer vision, image processing, and robotics have employed this new modality of data for many applications. In this chapter, we introduce two major depth-sensing principles using active IR signals and state of the art applications.

1.2 Time-of-Flight Depth Camera

In active light sensing technology, if we can measure the flight time of a fixed-wavelength signal emitted from a sensor and reflected from an object surface, we can calculate the distance of the object from the sensor based on the speed of light. This is the principle of a time-of-flight (ToF) sensor (Figure 1.1). However, it is not simple to measure the flight time directly at each pixel of any existing image sensor. Instead, if we can measure the phase delay of the reflected signal compared with the original emitted signal, we can calculate the distance indirectly. Recent ToF depth cameras in the market measure the phase delay of the emitted infrared (IR) signal at each pixel and calculate the distance from the camera.

Figure 1.1 Time-of-flight depth sensor

1.2.1 Principle

In this section, the principle of ToF depth sensing is explained in more detail with simplified examples. Let us assume that we use a sinusoidal IR wave as an active light source. In general, consumer depth cameras use multiple light-emitting diodes (LEDs) to generate a fixed-wavelength IR signal. What we can observe using an existing image sensor is the amount of electrons induced by collected photons during a certain time duration. For color sensors, it is enough to count the amount of induced electrons to capture the luminance or chrominance of the expected bandwidth. However, a single shot of photon collection is not enough for phase delay measurement. Instead, we collect photons multiple times at different time locations, as illustrated in Figure 1.2.

Figure 1.2 Phase delay measurement

Q1 through Q4 in Figure 1.2 are the amounts of electrons measured at each corresponding time. Reflected IR shows a phase delay proportional to the distance from the camera. Since we have reference emitted IR and its phase information, electron amounts at multiple time locations (Q1 through Q4 have a 90 phase difference to each other) can tell us the amount of delay as follows:

(1.1)

where α is the amplitude of the IR signal and ϕ1 through ϕ4 are the normalized amounts of electrons.

In real sensing situations, a perfect sine wave is not possible to produce using cheap LEDs of consumer depth cameras. Any distortion on the sine wave causes miscalculation of the phase delay. Furthermore, the amount of electrons induced by the reflected IR signal at a certain moment is very noisy due to the limited LED power. In order to increase the signal-to-noise ratio, sensors collect electrons from multiple cycles of reflected IR signal, thus allowing some dedicated integration time.

For a better understanding of the principle, let us assume that the emitted IR is a square wave instead of sinusoidal and we have four switches at each sensor pixel to collect Q1 through Q4. Each pixel of the depth sensor consists of several transistors and capacitors to collect the electrons generated. Four switches alter the on and off states with 90 phase differences based on the emitted reference IR signal as illustrated in Figure 1.3. When a switch is turned on and reflected IR goes high, electrons are charged as indicated by shaded regions.

Figure 1.3 Four-phase depth sensing

In order to increase the signal-to-noise ratio, we repeatedly charge electrons through multiple cycles of the IR signal to measure Q1 through Q4 during a fixed integration time for a single frame of depth image acquisition. Once Q1 through Q4 are measured, the distance can be calculated as follows:

(1.2)

where c is the speed of light (3 × 108 m/s) and t(d) is the flight time. Note that q1 through q4 are normalized electric charges and α is the amplitude of reflected IR that does not affect the distance calculation. In other words, depth can be calculated correctly regardless of IR amplitude. The emitted IR should be modulated with a high enough frequency to estimate the flight time.

As indicated in Figure 1.4, what we have calculated is the distance R from the camera to an object surface along the reflected IR signal. This is not necessarily the distance along the z-direction of the 3D sensor coordinate. Based on the location of each pixel and field of view information, Z in Figure 1.4 can be calculated from R to obtain an undistorted 3D geometry. Most consumer depth cameras give calculated Z distance instead of R for user convenience.

Figure 1.4 Relation between R and Z

1.2.2 Quality of the Measured Distance

Even though the ToF principle allows the distance imaging within the sensing range decided by the IR modulation frequency, the quality of measured depth suffers from various sensor systematic or nonsystematic noises (Huhle et al., 2008; Edeler et al., 2010; Foix et al., 2011; Matyunin et al., 2011). Owing to the limited power of IR light, incoming reflected IR into each image sensor pixel induces a limited number of electrons for depth calculation. The principle of ToF depth sensing can calculate correct depth regardless of the power of IR light and the amplitude of reflected IR. However, a lower absolute amount of electrons suffers from electronic noises such as shot noise. To resolve this problem, we increase the integration time to collect a sufficient number of electrons for higher accuracy of depth calculation. However, this limits the frame rate of sensors. Increase of modulation frequency also increases sensor accuracy under identical integration time, because it allows more cycles of modulated IR waves for a single depth frame production. However, this also limits the maximum sensing range of depth sensors. The left image in Figure 1.5 shows a 3D point cloud collected by a ToF depth sensor. The viewpoint is shifted to the right of the camera showing occluded regions by foreground chairs. Note that the 3D data obtained from a depth sensor are not the complete volumetric data. Only the 3D locations of the 2D surface seen from the camera's viewpoint are given. The right images in Figure 1.5 are depth and IR intensity images.

Figure 1.5 Measured depth and IR intensity images

In active light sensors, the incoming light signal-to-noise ratio is still relatively low compared with the passive light sensors like a color camera due to the limited IR signal emitting power. In order to increase the signal-to-noise ratio further, depth sensors merge multiple neighbor sensor pixels to measure a single depth value, decreasing depth image resolution. This is called pixel binning. Most consumer depth cameras perform the pixel binning and sacrifice image resolution to guarantee a certain depth accuracy. Therefore, many researchers according to their applications perform depth image super-resolution (Schuon et al., 2008; Park et al., 2011; Yeo et al., 2011) before the use of raw depth images, as illustrated in Figure 1.6.

Figure 1.6 Depth image 2D super-resolution

The left image in Figure 1.6 is a raw depth image that is upsampled on the right. Simple bilinear interpolation is used in this example. Most interpolation methods, however, consider the depth image as a 2D image and increase depth image resolution in the 2D domain. On the other hand, if depth is going to be used for 3D reconstruction, upsampling only in two axes is not enough. The left image in Figure 1.7 shows an example of 2D depth image super-resolution. The upper part of the chair is almost fronto-parallel to the depth camera and shows a dense super-resolution result. On the other hand, the lower part of the chair is perpendicular to the depth camera and shows lots of local holes. This is because the point gaps between depth pixels in the z-direction are not filled enough. The right image in Figure 1.7 is an example of 3D super-resolution in which the number of points going to be inserted is adaptively decided by the maximum distance in between depth pixels.

Figure 1.7 Depth image 2D versus 3D super-resolution

Figure 1.7 is an example of 3D super-resolution where the original depth point cloud is upsampled around 25 times and the IR intensity value is projected onto each point. A sensor pixel taking the boundary region of the foreground chair gives a depth in between foreground and background. Once the super-resolution is performed, this miscalculated depth point makes additional boundary noise pixels, as can be seen in the right image of Figure 1.7 where odd patterns can be observed around the foreground chair. Figure 1.8 shows this artifact more clearly.

Figure 1.8 Depth point cloud 3D super-resolution

Figure 1.8 shows upsampled depth point cloud where the aligned color value is projected onto each pixel. The left image in Figure 1.8 shows lots of depth points in between the foreground chair and background. The colors projected onto these points are from either the foreground or background of the aligned 2D color image. This is a huge artifact, especially for the 3D reconstruction application, where random view navigation will see this noise more seriously, as shown in Figure 1.8.

The right image in Figure 1.9 is an example of boundary noise point elimination. Depth points away from both the foreground and background point cloud can be eliminated by outlier elimination methods.

Figure 1.9 Boundary noise elimination

When we take a moving object in real time, another artifact like motion blur comes out in depth images (Lindner and Kolb, 2009; Hussmann et al., 2011; Lee et al., 2012). Motion blur is a long-standing issue of imaging devices because it leads to a wrong understanding and information of real-world objects. For distance sensors in particular, motion blur causes distortions in the reconstructed 3D geometry or totally different distance information. Various techniques have been developed to mitigate the motion blur in conventional sensors. Current technology, however, either requires a very short integration time to avoid motion blur or adopts computationally expensive post-processing methods to improve blurred images.

Different from passive photometry imaging devices collectively using the amount of photons induced in each pixel of a sensor, active geometry imaging, such as ToF sensors, investigates the relation between the amount of charged photons to figure out the phase difference of an emitted light source of fixed wavelength, as explained earlier in this chapter. These sensors investigate the flight time of an emitted and reflected light source to calculate the distance. The phase difference of the reflected light in these principles represents the difference in distance from the camera. The distance image, as a result, is an integration of phase variation over a pixel grid. When there is any movement of an object or the camera itself, a phase shift will be observed at the corresponding location, shaping the infrared wavefront. The phase shift observed by a sensor pixel causes multiple reflected IR waves to capture different phases within the integration time. This gives the wrong distance value calculation. As a result, the phase shift within a photon integration time produces motion blur that is not preferable for robust distance sensing. The motion blur region is the result of a series of phase shifts within the photon integration time. Reducing the integration time is not always a preferable solution because it reduces the amount of photons collected to calculate a distance-decreasing signal-to-noise ratio. On the other hand, post-processing after distance calculation shows limited performance and is a time-consuming job.

When we produce a distance image that includes motion blur, we can detect the phase shift within an integration time by investigating the relation between the separately collected amounts of photons by multiple control signals. Figure 1.10 shows what happens if there is any phase shift within an integration time in a four-phase ToF sensing mechanism. From their definitions, the four electric charges Q1–Q4 are averages of total cumulated electric charges over multiple “on” phases. Without any phase shift, distance is calculated by Equation 1.2 by obtaining the phase difference between emitted and reflected IR waves. When there is single phase shift within the integration time, distance is calculated by the following equation:

Figure 1.10 Depth motion blur

(1.3)

where α1 and α2 are the amplitudes of two IR signals and q1X and q2X are normalized electron values before and after the phase shifts. Different from the original equation, reflected IR amplitudes α1 and α2 cannot be eliminated from the equation and affect the distance calculation.

Figure 1.10 shows what happens in a depth image with a single phase shift. During the integration time, the original (indicated in black) and phase-shifted (indicated in grey) reflected IR come in sequentially and will be averaged to calculate a single depth value. Motion blurs around moving objects will be observed, showing quite different characteristics from those of conventional color images, as shown in Figure 1.11. Note that the motion blur regions (indicated by dashed ellipses) have nearer or farther depth values than both foreground and background neighbor depth values. In general with multiple or continuous phase shifts, miscalculated distance is as follows:

(1.4)

Figure 1.11 Depth motion blur examples

Each control signal has a fixed phase delay from the others that gives a dedicated relation of collected electric charges. A four-phase ToF sensor makes a 90 phase delay between control signals, giving the following relations: and . Qsum is the total amount of electric charge delivered by the reflected IR. In principle, every regular pixel has to meet with these conditions if there is no significant noise (Figure 1.12). With an appropriate sensor noise model and thresholds, these relations can be used to see if the pixel has regular status. A phase shift causes a very significant distance error exceeding the common sensor noise level and is effectively detected by testing whether either of the relations is violated.

Figure 1.12 Boundary noise elimination

There are several other noise sources (Foix et al., 2011). The emitted IR signal amplitude attenuates while traveling in proportion to the reciprocal of the square of the distance. Even though this attenuation should not affect the depth calculation of the ToF sensor in principle, the decrease of signal-to-noise ratio will degenerate the repeatability of the depth calculation. The uniformity assumption of the emitted IR onto the target object also causes spatial distortion of the calculated depth. In other words, each sensor pixel will collect reflected IR of different amplitudes even though reflected from surfaces of identical distance. Daylight interference is another very critical issue of the practicality of the sensor. Any frequency of IR signal emitted from the sensor will exist in daylight, which works as a noise with regard to correct depth calculation. Scattering problems (Mure-Dubois and Hugli, 2007) within the lens and sensor architecture are a major problem with depth sensors owing to their low signal-to-noise ratio.

1.3 Structured Light Depth Camera

Kinect, a famous consumer depth camera in the market, is a structured IR light-type depth sensor which is well-known 3D geometry acquisition technology. It is composed of an IR emitter, IR sensor, and color sensor, providing an IR amplitude image, depth map, and color image. Basically, this technology utilizes conventional color sensor technology with relatively higher resolution. Owing to the limit of the sensing range of the structured light principle, the operating range of this depth sensor is around 1–4 m.

1.3.1 Principle

In this type of sensor, a predetermined IR pattern is emitted onto the target objects (Figure 1.13). The pattern can be a rectangular grid or a set of random dots. A calibrated IR sensor reads the pattern reflected from the object surfaces. Different from the original IR pattern, the IR sensor reads a distorted pattern due to the geometric variation of the target objects. This allows information to be obtained on the pixel-wise correspondence between the IR emitter and IR sensor. Triangulation of each point between the projected and observed IR patterns enables calculation of the distance from the sensor.

Figure 1.13 Structured IR light depth camera

1.4 Specular and Transparent Depth

Figure 1.14 illustrates how the light ray arrives at a Lambertian surface and travels back to the sensor. Lambertian objects distribute incoming light equally in all directions. Hence, sensors can receive reflected light regardless of their position and orientation. Most existing 3D sensing technologies assume Lambertian objects as their targets. Both the ToF sensor in the left in Figure 1.14 and the structured light sensor on the right in Figure 1.14 can see emitted light sources.

Figure 1.14 Lambertian surface

However, most real-world objects have non-Lambertian surfaces, including transparency and specularity. Figure 1.15 shows what happens if active light depth sensors see a specular surface and why the specular object is challenging to handle. Unlike the Lambertian material, specular objects reflect the incoming light into a limited range of directions. Let us assume that the convex surfaces in Figure 1.15 are mirrors, which reflect incoming light rays toward a very narrow direction. If the sensor is not located on the exact direction of the reflected ray from the mirror surface, the sensor does not receive any reflected IR and it is impossible to calculate any depth information. On the other hand, if the sensor is located exactly on the mirror reflection direction, the sensor will receive an excessive amount of concentrated IR light, causing saturation in measurement. Consequently, the sensors fail to receive the reflected light in a sensible range. Such a phenomenon results in missing measurements for both types of sensors.

Figure 1.15 Depth of a specular surface

Figure 1.16 shows samples of specular objects taken by a ToF depth sensor. The first object is a mirror where the flat area is all specular surfaces. The second object shows specularity in a small region. The sensor in the first case is not on the mirror reflection direction and no saturation is observed. However, depth within the mirror region is not correct. The sensor in the second case is on the mirror reflection direction and saturation is observed in the intensity image. This leads to wrong depth calculation.

Figure 1.16 Specular object examples of ToF sensor

Figure 1.17 shows samples of specular objects taken by a structured light depth sensor. The mirror region of the first case shows the depth of the reflected surface. The second case also shows a specular region and leads to miscalculation of depth.

Figure 1.17 Specular object examples of structured IR sensor

In Figure 1.18 we demonstrate how a transparent object affects the sensor measurement. Considering transparent objects with background, depth sensors receive reflected light from both the foreground and background (Figure 1.19). The mixture of reflected light from foreground and background misleads the depth measurement. Depending on the sensor types, however, the characteristics of the errors vary. Since a ToF sensor computes the depth of a transparent object based on the mixture of reflected IR from foreground and background, the depth measurement includes some bias toward the background. For the structured light sensor, the active light patterns are used to provide the correspondences between the sensor and projector. With transparent objects, the measurement errors cause the mismatch on correspondences and yield data loss.

Figure 1.18 Depth of transparent object

Figure 1.19 Transparent object examples

In general, a multipath problem (Fuchs, 2010) similar to transparent depth occurs in concave objects, as illustrated in Figure 1.20. In the left image in Figure 1.20, two different IR paths having different flight times from the IR LEDs to the sensor can arrive at the same sensor pixel. The path of the ray reflected twice on the concave surface is a spurious IR signal and distracts from the correct depth calculation of the point using the ray whose path is reflected only once. In principle, a structured light sensor suffers from a similar problem. The sensor can observe an unexpected overlapped pattern from multiple light paths if the original pattern is emitted onto a concave surface, as shown on right in Figure 1.20.

Figure 1.20 Multipath problem

1.5 Depth Camera Applications

The recent introduction of cheap depth cameras affects many computer vision and graphics applications. Direct 3D geometry acquisition makes 3D reconstruction easier with a denser 3D point cloud compared with using multiple color cameras and a stereo algorithm. Thus, we do not have to employ the sparse feature matching method with photometry consistency of the Lambertian surface assumption that is fragile in reflective or transparent materials of real-world objects.

1.5.1 Interaction

One of the most attractive applications of early stage of depth cameras is body motion tracking, gesture recognition, and interaction (Shotton et al., 2011). Interactive games or human–machine interaction can employ a depth camera supported by computer vision algorithms as a motion sensor. In this case, higher pixel-wise accuracy and depth precision is not expected. Instead, a faster frame rate and higher spatial resolution are more important for interactive applications such as games with body motion controlling. The system can be trained based on the given input depth data with noise. Furthermore, they are mostly indoor applications that are free from the range ambiguity (Choi et al., 2010; Droeschel et al., 2010) and daylight interference problems of depth cameras.

1.5.2 Three-Dimensional Reconstruction

Another important application of depth cameras is 3D imaging (Henry et al., 2010). This application puts more strict conditions on the depth image. Not just higher frame rate and spatial resolution, but every pixel also has to capture highly accurate depths for precise 3D scene acquisition (Figure 1.21). Furthermore, we have to consider the case of outdoor usage, where a much longer sensing range has to be covered under daylight conditions. Therefore, the range ambiguity now matters in this application. Depth motion blur also has to be removed before the 3D scene reconstruction in order to avoid 3D distortion around edges. A reconstructed 3D scene can generate as many view images as required for future 3D TV, such as glassless multiview displays or integral imaging displays (Dolson et al., 2008; Shim et al., 2012). Digital holography displays accept the 3D model and corresponding texture. A mixed reality of the virtual and real worlds is another example of 3D reconstruction application (Ryden et al., 2010; Newcombe et al., 2011).

Figure 1.21 A 3D reconstruction example from multiple cameras

References

Choi, O., H. Lim, B. Kang, et al.(2010) Range unfolding for time-of-flight depth cameras, in 2010 17th IEEE International Conference on Image Processing (ICIP 2010), IEEE, pp. 4189–4192.

Dolson, J., J. Baek, C. Plagemann, and S. F Thrun (2008) Fusion of time-of-flight depth and stereo for high accuracy depth maps, in IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, IEEE.

Droeschel, D., D. Holz, and S. Behnke (2010) Multi-frequency phase unwrapping for time-of-flight cameras, in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 1463–1469.

Edeler, T., K. Ohliger, S. Hussmann, and A. Mertins (2010) Time-of-flight depth image denoising using prior noise information, in 2010 IEEE 10th International Conference on Signal Processing (ICSP), IEEE, pp. 119–122.

Foix, S., G. Alenya, and C. Torras (2011) Lock-in time-of-flight (ToF) cameras: a survey. IEEE Sens. J., 11 (9): 1917–1926.

Fuchs, S. (2010) Multipath interference compensation in time-of-flight camera images, in 2010 20th International Conference on Pattern Recognition (ICPR), IEEE, pp. 3583–3586.

Henry, P., M. Krainin, E. Herbst et al.(2010) RGB-D mapping: using depth cameras for dense 3D modeling of indoor environments. RGB-D: Advanced Reasoning with Depth Cameras Workshop in conjunction with RSS.

Huhle, B., T. Schairer, P. Jenke, and W. Strasser (2008) Robust non-local denoising of colored depth data, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPR Workshops 2008, IEEE, pp. 1–7.

Hussmann, S., A. Hermanski, and T. Edeler (2011) Real-time motion artifact suppression in ToF camera systems. IEEE Trans. Instrum. Meas., 60 (5): 1682–1690.

Lee, S., B. Kang, J. Kim, and C. Kim (2012) Motion blur-free time-of-flight range sensor, in Sensors, Cameras, and Systems for Industrial and Scientific Applications XIII, (eds R. Widenhorn, V. Nguyen, and A. Dupret), Proceedings of the SPIE, Vol. 8298, SPIE, Bellingham, WA.

Lindner, M. and A. Kolb (2009) Compensation of motion artifacts for time-of-flight cameras, in Dynamic 3D Imaging (eds A. Kolb and R. Koch), Lecture Notes in Computer Science, Vol. 5742, Springer, pp. 16–27.

Matyunin, S., D. Vatolin, Y. Berdnikov, and M. Smirnov (2011) Temporal filtering for depth maps generated by Kinect depth camera, in 3DTV Conference: The True Vision – Capture, Transmission and Display of 3D Video (3DTV-CON), 2011, IEEE.

Mure-Dubois, J. and H. Hugli (2007) Real-time scattering compensation for time-of-flight camera, in Proceedings of the ICVS Workshop on Camera Calibration Methods for Computer Vision Systems – CCMVS2007, Applied Computer Science Group, Bielefeld University, Germany.

Newcombe, R. A., S. Izadi, O. Hilliges et al.(2011) Kinectfusion: real-time dense surface mapping and tracking, in Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, IEEE Computer Society, Washington, DC.

Park, J., H. Kim, Y.-W. Tai et al.(2011) High quality depth map upsampling for 3D-ToF cameras, in 2011 IEEE International Conference on Computer Vision (ICCV), IEEE.

Ryden, F., H. Chizeck, S. N. Kosari et al.(2010) Using Kinect and a haptic interface for implementation of real-time virtual mixtures. RGB-D: Advanced Reasoning with Depth Cameras Workshop in conjunction with RSS.

Schuon, S., C. Theobalt, J. Davis, and S. Thrun (2008) High-quality scanning using time-of-flight depth superresolution, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPR Workshops 2008, IEEE.

Shim, H., R. Adels, J. Kim et al.(2012) Time-of-flight sensor and color camera calibration for multi-view acquisition. Vis. Comput., 28 (12), 1139–1151.

Shotton, J., A. Fitzgibbon, M. Cook, and A. Blake (2011) Real-time human pose recognition in parts from single depth images, in Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC.

Yeo, D., E. ul Haq, J. Kim et al.(2011) Adaptive bilateral filtering for noise removal in depth upsampling, in 2010 International SoC Design Conference (ISOCC), IEEE, pp. 36–39.

2

SFTI: Space-from-Time Imaging

Ahmed Kirmani, Andrea Colaço, and Vivek K. Goyal

Massachusetts Institute of Technology, USA

2.1 Introduction

For centuries, the primary technical meaning of image has been a visual representation or counterpart, formed through the interaction of light with mirrors and lenses, and recorded through a photochemical process. In digital photography, the photochemical process has been replaced by a sensor array, but the use of optical elements is unchanged. Thus, the spatial resolution in this traditional imaging is limited by the quality of the optics and the number of sensors in the array; any finer resolution comes from multiple frames or modeling that will not apply to a generic scene (Milanfar, 2011). A dual configuration is also possible in which the sensing is omnidirectional and the light source is directed, with optical focusing (Sen et al., 2005). The spatial resolution is then limited by the illumination optics, specifically the spot size of the illumination.

Our space-from-time imaging (SFTI) framework provides methods to achieve spatial resolution, in two and three dimensions, rooted in the temporal variations of light intensity in response to temporally- or spatiotemporally-varying illumination. SFTI is based on the recognition that parameters of interest in a scene, such as bidirectional reflectance distribution functions at various wavelengths and distances from the imaging device, are embedded in the impulse response (or transfer function) from a light source to a light sensor. Thus, depending on the temporal resolution, any plurality of source–sensor pairs can be used to generate an image; the spatial resolution can be finer than both the spot size of the illumination and the number of sensors.

The use of temporal resolution in SFTI is a radical departure from traditional imaging, in which time is associated only with fixing the period over which light must be collected to achieve the desired contrast. Very short exposures or flash illuminations are used to effectively “stop time” (Edgerton and Killian, 1939), but these methods could still be called atemporal because even a microsecond integration time is enough to combine light from a large range of transport paths involving various reflections. No temporal variations are present at this time scale (nor does one attempt to capture them), so no interesting inferences can be drawn.

SFTI introduces inverse problems and is thus a collection of computational imaging problems and methods. Since light transfer is linear, the inverse problems are linear in the intensity or reflectance parameters; however, propagation delays and radial falloff of light intensity cause the inverse problems to be nonlinear in geometric parameters. Formulating specific inverse problem parameterizations that have numerically stable solutions is at the heart of SFTI. This chapter summarizes two representative examples: one for reflectance estimation first reported by Kirmani et al. (2011b, 2012) and the other for depth acquisition first reported by Kirmani et al. (2011a). We first review basic terminology and related work and then demonstrate how parameters of a static scene are embedded into the impulse response of the scene.

2.2 Background and Related Work

2.2.1 Light Fields, Reflectance Distribution Functions, and Optical Image Formation

Michael Faraday was the first to propose that light should be interpreted as a field, much like a magnetic field. The phrase light field was coined by Alexander Gershun (Gershun, 1939) to describe the amount of light traveling in every direction, through every point in space, at any wavelength and any time. It is now synonymous with the plenoptic function, which is a function of three spatial dimensions, two angular dimensions, wavelength, and time. The time parameter is usually ignored since all measurements are made in steady state. Moreover, light travels in straight lines in any constant-index medium. The macroscopic behavior at a surface is described by the bidirectional reflectance distribution function (BRDF) (Nicodemus, 1965). This function of wavelength, time, and four geometric dimensions takes an incoming light direction parameterized by and outgoing (viewing) direction , each defined with respect to the surface normal, and returns the scalar ratio of reflected radiance exiting along the viewing angle to the irradiance along the incoming ray's direction. The BRDF determines how a scene will be perceived from a particular viewpoint under a particular illumination. Thus, the scene geometry, BRDF, and scene illumination determine the plenoptic function.

To explain the light field and BRDF concepts, we consider image formation in flatland as shown in Figure 2.1a. A broad wavelength light source (shown in light gray) is radially transmitting pulses of light uniformly in all directions using a pulse intensity modulation function . The scene plane is completely characterized by the locations of its end points and the BRDF of its surface. Assuming that our scene plane is opaque, the BRDF is a scalar function of geometric and material parameters, . To explain the BRDF, consider a unit-intensity light ray incident on the scene point at an incidence angle . The scene point scatters the different wavelengths composing the light ray differently in all directions . The intensity of attenuated light observed at a particular wavelength and at a particular observation angle is equal to the BRDF function value . The light from the illumination source reaches different points on the scene plane at different incidence angles as well as at different time instances due to the time delay in light propagation. The light field is then defined as a function describing the intensity of light rays originating from the scene points under a given illumination setting.

Figure 2.1 Light fields and reflectance distribution functions. (a) The image formation process in flatland. (b) The Lambertian assumption for surface reflectance distribution

In traditional optical imaging, light reflected from the scene plane is focused on a sensor pixel using focusing optics such as lenses. This focusing is equivalent to integration of the light field along . Furthermore, a gray-scale CMOS sensor forms an image of the scene by integrating along wavelength and time. Thus, traditional image formation loses a lot of valuable scene information contained in the light field. For example, it is impossible to recover scene geometry from a single image .

Generalizing to a three-dimensional scene, an ordinary digital camera captures a two-dimensional (2D) color image by marginalizing and sampling the wavelength as

where represents an exposure time. This projection ignores the angular and time dimensions of the plenoptic function and trades spatial resolution for sampling multiple wavelengths; for example, with the Bayer pattern. Recent work in the area of computational photography has extensively explored the angular sampling of light fields and its applications (Ng et al., 2005; Georgeiv et al., 2006), but the possibilities arising from temporal sampling of light fields remain largely unexplored.

In SFTI, our goals are to reconstruct both scene geometry and reflectance from a single viewpoint, with limited use of focusing optics, from time samples of the incident light field. A practical and widely accepted simplifying assumption about the scene BRDF, called the Lambertian assumption, is shown in Figure 2.1b. According to this assumption, the BRDF is independent of the viewing angle and we observe the same intensity of light from all directions at the same radial distance. For simplicity and clarity, we will employ the Lambertian model; incorporating some other known BRDF is generally not difficult.

2.2.2 Time-of-Flight Methods for Estimating Scene Structure

The speed of light enables distances (longitudinal resolution) to be inferred from the time difference between a transmitted pulse and the arrival of a reflection from a scene or, similarly, by the phase offset between transmitted and received signals when these are periodic. This is a well-established technology for ranging, which we refer to as depth map acquisition when spatial (transverse) resolution is also acquired. A light detection and ranging (LIDAR) or laser radar system uses raster scanning of directed illumination of the scene to obtain spatial resolution (Schwarz, 2010). A time-of-flight (TOF) camera obtains spatial resolution with an array of sensors (Gokturk et al., 2004; Foix et al., 2011). These TOF-based techniques have better range resolution and robustness to noise than using stereo disparity (Forsyth and Ponce, 2002; Seitz et al., 2006; Hussmann et al., 2008) or other computer vision techniques – including structured-light scanning, depth-from-focus, depth-from-shape, and depth-from-motion (Forsyth and Ponce, 2002; Scharstein and Szeliski, 2002; Stoykova et al., 2007). While companies such as Canesta, MESA Imaging, 3DV, and PMD offer commercial TOF cameras, these systems are expensive and have low spatial resolution when compared with standard 2D imaging cameras.

All previous techniques using TOF have addressed only the estimation of structure (e.g. three-dimensional (3D) geometry of scenes and shapes of biological samples). SFTI provides methods for estimation of both reflectance and structure. Furthermore, unlike previous methods – including notably the method of Kirmani et al. (2009) using indirect illumination and sensing – SFTI is based on intensity variation as a function of time rather than collections of distance measurements.

2.2.3 Synthetic Aperture Radar for Estimating Scene Reflectance

The central contribution of SFTI is to use temporal information, in conjunction with appropriate post-measurement signal processing, to form images whose spatial resolution exceeds the spatial resolution of the illumination and light collection optics. In this connection it is germane to compare SFTI with synthetic aperture radar (SAR), which is a well-known microwave approach for using time-domain information plus post-measurement signal processing to form images with high spatial resolution (Kovaly, 1976; Munson et al., 1985;Cutrona, 1990). In stripmap mode, an airborne radar transmits a sequence of high-bandwidth pulses on a fixed slant angle toward the ground. Pulse-compression reception of individual pulses provides across-track spatial resolution superior to that of the radar's antenna pattern as the range response of the compressed pulse sweeps across the ground plane. Coherent integration over many pulses provides along-track spatial resolution by forming a synthetic aperture whose diffraction limit is much smaller than that of the radar's antenna pattern.

SAR differs from SFTI in two general ways. First, SAR requires the radar to be in motion, whereas SFTI does not require sensor motion. Second, SAR is primarily a microwave technique, and most real-world objects have specular BRDFs at microwave wavelengths. With specular reflections, an object is directly visible only when the angle of illumination and angle of observation satisfy the law of reflection, and multiple reflections – which are not accounted for in first-order SAR models – are strong. On the other hand, most objects are Lambertian at optical wavelengths, so optical SFTI avoids these sources of difficulty.

2.3 Sampled Response of One Source–Sensor Pair

Broadly stated, SFTI is any computational imaging in which spatial resolution is derived from time-resolved sensing of the response to time-varying illumination. The key observation that enables SFTI is that scene information of interest can be embedded in the impulse response (or transfer function) from a light source to a light sensor. Here, we develop an example of an impulse response model and sampling of this response to demonstrate the embedding that enables imaging. The following sections address the specific inverse problems of inferring scene reflectance and structure.

2.3.1 Scene, Illumination, and Sensor Abstractions

Consider a static 3D scene to be imaged in the scenario depicted in Figure 2.2. We assume that the scene is contained in a cube of dimensions for some finite . Further assume that the scene surfaces are all Lambertian so that their perceived brightnesses are invariant to the angle of observation (Oren and Nayar, 1995); incorporation of any known BRDF would not add insight to our model. Under these assumptions, the scene at one wavelength or collection of wavelengths can be completely represented as a 3D function . For any , the value of the function represents the radiometric response; implies that there is no scene point present at position , and implies that there is a scene point present with a reflectance value given by . We assume that all surface points have nonzero reflectance to avoid ambiguity. To incorporate dependence on wavelength, the codomain of could be made multidimensional.

Figure 2.2 Space-from-time imaging setup. An impulsive illumination source transmits a spherical wavefront towards a scene plane with nonconstant reflectance. The light reflected from the plane is time sampled at a sensor, which has no focusing optics and, therefore, receives contributions from all the scene points. The scene impulse response at any time, shown as the gray waveform, is determined by the elliptical radon transform (ERT) defined in (2.3). For a scene plane with constant reflectance, the scene impulse response (shown in black) is a parametric signal; it is a piecewise linear function as described in (2.14)

The single illumination source is monochromatic and omnidirectional with time-varying intensity denoted by