81,99 €
A unified view of the use of computer vision technology for different types of vehicles
Computer Vision in Vehicle Technology focuses on computer vision as on-board technology, bringing together fields of research where computer vision is progressively penetrating: the automotive sector, unmanned aerial and underwater vehicles. It also serves as a reference for researchers of current developments and challenges in areas of the application of computer vision, involving vehicles such as advanced driver assistance (pedestrian detection, lane departure warning, traffic sign recognition), autonomous driving and robot navigation (with visual simultaneous localization and mapping) or unmanned aerial vehicles (obstacle avoidance, landscape classification and mapping, fire risk assessment).
The overall role of computer vision for the navigation of different vehicles, as well as technology to address on-board applications, is analysed.
Key features:
This is essential reading for computer vision researchers, as well as engineers working in vehicle technologies, and students of computer vision.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 395
Veröffentlichungsjahr: 2017
Cover
Title Page
Copyright
List of Contributors
Preface
Abbreviations and Acronyms
Chapter 1: Computer Vision in Vehicles
1.1 Adaptive Computer Vision for Vehicles
1.2 Notation and Basic Definitions
1.3 Visual Tasks
1.4 Concluding Remarks
Acknowledgments
Chapter 2: Autonomous Driving
2.1 Introduction
2.2 Autonomous Driving in Cities
2.3 Challenges
2.4 Summary
Acknowledgments
Chapter 3: Computer Vision for MAVs
3.1 Introduction
3.2 System and Sensors
3.3 Ego-Motion Estimation
3.4 3D Mapping
3.5 Autonomous Navigation
3.6 Scene Interpretation
3.7 Concluding Remarks
Chapter 4: Exploring the Seafloor with Underwater Robots
4.1 Introduction
4.2 Challenges of Underwater Imaging
4.3 Online Computer Vision Techniques
4.4 Acoustic Imaging Techniques
4.5 Concluding Remarks
Acknowledgments
Chapter 5: Vision-Based Advanced Driver Assistance Systems
5.1 Introduction
5.2 Forward Assistance
5.3 Lateral Assistance
5.4 Inside Assistance
5.5 Conclusions and Future Challenges
Acknowledgments
Chapter 6: Application Challenges from a Bird's-Eye View
6.1 Introduction to Micro Aerial Vehicles (MAVs)
6.2 GPS-Denied Navigation
6.3 Applications and Challenges
6.4 Conclusions
Chapter 7: Application Challenges of Underwater Vision
7.1 Introduction
7.2 Offline Computer Vision Techniques for Underwater Mapping and Inspection
7.3 Acoustic Mapping Techniques
7.4 Concluding Remarks
Chapter 8: Closing Notes
References
Index
End User License Agreement
ix
x
xi
xii
xiii
xiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
cover
Table of Contents
Preface
Begin Reading
Chapter 1: Computer Vision in Vehicles
Figure 1.1 (a) Quadcopter. (b) Corners detected from a flying quadcopter using a modified FAST feature detector.
Figure 1.2 The 10 leading causes of death in the world. Chart provided online by the World Health Organization (WHO). Road injury ranked number 9 in 2011
Figure 1.3 Two screenshots for real-view navigation.
Figure 1.4 Examples of benchmark data available for a comparative analysis of computer vision algorithms for motion and distance calculations. (a) Image from a synthetic sequence provided on EISATS with accurate ground truth. (b) Image of a real-world sequence provided on KITTI with approximate ground truth
Figure 1.5 Laplacians of smoothed copies of the same image using cv::GaussianBlur and cv::Laplacian in OpenCV, with values 0.5, 1, 2, and 4, for parameter for smoothing. Linear scaling is used for better visibility of the resulting
Laplacians
.
Figure 1.6 (a) Image of a stereo pair (from a test sequence available on EISATS). (b) Visualization of a depth map using the color key shown at the top for assigning distances in meters to particular colors. A pixel is shown in gray if there was low confidence for the calculated disparity value at this pixel.
Figure 1.7 Resulting disparity maps for stereo data when using only
one
scanline for DPSM with the SGM smoothness constraint and a MCEN data-cost function.
From top to bottom and left to right
: Left-to-right horizontal scanline, and lower-left to upper-right diagonal scanline, top-to-bottom vertical scanline, and upper-left to lower-right diagonal scanline. Pink pixels are for low-confidence locations (here identified by inhomogeneous disparity locations).
Figure 1.8 Normalized cross-correlation results when applying the third-eye technology for stereo matchers iSGM and linBPM for four real-world trinocular sequences of Set 9 of EISATS.
Figure 1.9 (a) Reconstructed cloud of points. (b) Reconstructed surface based on a single run of the ego-vehicle.
Figure 1.10 Visualization of optical flow using the color key shown around the border of the image for assigning a direction to particular colors; the length of the flow vector is represented by saturation, where value “white” (i.e., undefined saturation) corresponds to “no motion.” (a) Calculated optical flow using the original Horn–Schunck algorithm. (b) Ground truth for the image shown in Figure 1.4a.
Figure 1.11 Face detection, eye detection, and face tracking results under challenging lighting conditions. Typical Haar-like features, as introduced in Viola and Jones (2001b), are shown in the upper right. The illustrated results for challenging lighting conditions require additional efforts.
Figure 1.12 Two examples for Set 7 of EISATS illustrated by preprocessed depth maps following the described method (Steps 1 and 2). Ground truth for segments is provided by Barth et al. (2010) and shown on top in both cases. Resulting segments using the described method are shown below in both cases;
Chapter 2: Autonomous Driving
Figure 2.1 The way people think of usage and design of Autonomous Cars has not changed much over the last 60 years: (a) the well-known advert from the 1950s, (b) a design study published in 2014
Figure 2.2 (a) CMU's first demonstrator vehicle Navlab 1. The van had five racks of computer hardware, including three Sun workstations, video hardware and GPS receiver, and a Warp supercomputer. The vehicle achieved a top speed of 32 km/h in the late 1980s. (b) Mercedes-Benz's demonstrator vehicle VITA built in cooperation with Dickmanns from the university of armed forces in Munich. Equipped with a bifocal vision system and a small transputer system with 10 processors, it was used for Autonomous Driving on highways around Stuttgart in the early 1990s, reaching speeds up to 100km/h
Figure 2.3 (a) Junior by CMU's Robotics Lab, winner of the Urban Challenge 2007. (b) A Google car prototype presented in 2014 that neither features a steering wheel nor gas or braking pedals. Both cars base their environment perception on a high-end laser scanner
Figure 2.4 (a) Experimental car BRAiVE built by Broggi's team at University of Parma. Equipped with only stereo cameras, this car drove 17km along roads around Parma in 2013. (b) Mercedes S500 Intelligent Drive demonstrator named “Bertha.” In August 2013, it drove autonomously about 100 km from Mannheim to Pforzheim, following the historic route driven by Bertha Benz 125 years earlier. Close-to-market radar sensors and cameras were used for environment perception
Figure 2.5 The Bertha Benz Memorial Route from Mannheim to Pforzheim (103 km). The route comprises rural roads, urban areas (e.g., downtown Heidelberg), and small villages and contains a large variety of different traffic situations such as intersections with and without traffic lights, roundabouts, narrow passages with oncoming vehicles, pedestrian crossings, cars parked on the road, and so on
Figure 2.6 System overview of the Bertha Benz experimental vehicle
Figure 2.7 Landmarks that are successfully associated between the mapping image (a) and online image (b) are shown.
Figure 2.8 Given a precise map (shown later), the expected markings (blue), stop lines (red), and curbs (yellow) are projected onto the current image. Local correspondence analysis yields the residuals that are fed to a Kalman filter in order to estimate the vehicle's pose relative to the map.
Figure 2.9 Visual outline of a modern stereo processing pipeline. Dense disparity images are computed from sequences of stereo image pairs. Red pixels are measured close to the ego-vehicle (i.e. ), while green pixels are far away (i.e., ). From these data, the Stixel World is computed. This medium-level representation achieves a reduction of the input data from hundreds of thousands of single depth measurements to a few hundred Stixels only. Stixels are tracked over time in order to estimate the motion of other objects. The arrows show the motion vectors of the tracked objects, pointing 0.5 seconds in advance. This information is used to extract both static infrastructure and moving objects for subsequent processing tasks. The free space is shown in gray
Figure 2.10 A cyclist taking a left turn in front of our vehicle: (a) shows the result when using -Vision point features and (b) shows the corresponding Stixel result
Figure 2.11 Results of the Stixel computation, the Kalman filter-based motion estimation, and the motion segmentation step. The left side shows the arrows on the base points of the Stixels denoting the estimated motion state. The right side shows the corresponding labeling result obtained by graph-cut optimization. Furthermore, the color scheme encodes the different motion classes (right headed, left headed, with us, and oncoming). Uncolored regions are classified as static background
Figure 2.12 ROIs overlaid on the gray-scale image. In the monocular case (upper row left), about 50,000 hypotheses have to be tested by a classifier, in the stereo case (upper row right) this number reduces to about 5000. If each Stixel is assumed to be the center of a vehicle at the distance given by the Stixel World (lower row left), only 500 ROIs have to be checked, as shown on the right
Figure 2.13 Intensity and depth images with corresponding gradient magnitude for pedestrian (top) and nonpedestrian (bottom) samples. Note the distinct features that are unique to each modality, for example, the high-contrast pedestrian texture due to clothing in the gray-level image compared to the rather uniform disparity in the same region. The additional exploitation of depth can reduce the false-positive rate significantly. In Enzweiler et al. (2010), an improvement by a factor of five was achieved
Figure 2.14 ROC curve illustrating the performance of a pedestrian classifier using intensity only (red) versus a classifier additionally exploiting depth (blue). The depth cue reduces the false-positive rate by a factor of five
Figure 2.15 Full-range (0–200m) vehicle detection and tracking example in an urban scenario. Green bars indicate the detector confidence level
Figure 2.16 Examples of hard to recognize traffic lights. Note that these examples do not even represent the worst visibility conditions
Figure 2.17 Two consecutive frames of a stereo image sequence (left). The disparity result obtained from a single image pair is shown in the second column from the right. It shows strong disparity errors due to the wiper blocking parts of one image. The result from temporal stereo is visually free of errors (right) (see Gehrig et al. (2014))
Figure 2.18 Scene labeling pipeline: input image (a), SGM stereo result (b), Stixel representation (d), and the scene labeling result (c)
Figure 2.19 Will the pedestrian cross? Head and body orientation of a pedestrian can be estimated from onboard cameras of a moving vehicle. means motion to the left (body), is toward the camera (head)
Chapter 3: Computer Vision for MAVs
Figure 3.1 A micro aerial vehicle (MAV) equipped with digital cameras for control and environment mapping. The depicted MAV has been developed within the SFLY project (see Scaramuzza et al. 2014)
Figure 3.2 The system diagram of the autonomous Pixhawk MAV using a stereo system and an optical flow camera as main sensors
Figure 3.3 The state estimation work flow for a loosely coupled visual-inertial fusion scheme
Figure 3.4 A depiction of the involved coordinate systems for the visual-inertial state estimation
Figure 3.5 Illustration of monocular pose estimation. The new camera pose is computed from 3D points triangulated from at least two subsequent images
Figure 3.6 Illustration of stereo pose estimation. At each time index, 3D points can be computed from the left and right images of the stereo pair. The new camera pose can be computed directly from the 3D points triangulated from the previous stereo pair
Figure 3.7 Concept of the optical flow sensor depicting the geometric relations used to compute metric optical flow
Figure 3.8 The PX4Flow sensor to compute MAV movements using the optical flow principle. It consists of a digital camera, gyroscopes, a range sensor, and an embedded processor for image processing
Figure 3.9 The different steps of a typical structure from motion (SfM) pipeline to compute 3D data from image data. The arrows on the right depict the additional sensor data provided from a MAV platform and highlight for which steps in the pipeline it can be used
Figure 3.10 A 3D map generated from image data of three individual MAVs using MAVMAP. (a) 3D point cloud including MAVs' trajectories (camera poses are shown in red). (b) Detailed view of a part of the 3D map from a viewpoint originally not observed from the MAVs
Figure 3.11 Environment represented as a 3D occupancy grid suitable for path planning and MAV navigation. Blue blocks are the occupied parts of the environment
Figure 3.12 Live view from a MAV with basic scene interpretation capabilities. The MAV detects faces and pre-trained objects (e.g., the exit sign) and marks them in the live view
Chapter 4: Exploring the Seafloor with Underwater Robots
Figure 4.1 (a) Example of
backscattering
due to the reflection of rays from the light source on particles in suspension, hindering the identification of the seafloor texture. (b) Image depicting the effects produced by
light attenuation
of the water resulting in an evident loss of luminance in the regions farthest from the focus of the artificial lighting. (c) Example of the image acquired in shallow waters showing sunflickering patterns. (d) Image showing a generalized blurred appearance due to the small-angle forward-scattering phenomenon
Figure 4.2 Refracted sunlight creates illumination patterns on the seafloor, which vary in space and time following the dynamics of surface waves
Figure 4.3 Scheme of underwater image formation with natural light as main illumination source. The signal reaching the camera is composed of two main components: attenuated direct light coming from the observed object and water-scattered natural illumination along this propagation path. Attenuation is due to both scattering and absorption
Figure 4.4 Absorption and scattering coefficients of pure seawater. Absorption (solid line (a)) and scattering (dotted line (b)) coefficients for pure seawater, as determined and given by Smith and Baker (1981) and
Figure 4.5 Image dehazing. Example of underwater image restoration in low to extreme low visibility conditions
Figure 4.6 Loop-closure detection. As the camera moves, there is an increasing uncertainty related to both the camera pose and the environment map. At instant , the camera revisits a region of the scene previously visited at instant . If the visual observations between instants and can be associated, the resulting information not only can be used to reduce the pose and map uncertainties at instant but also can be propagated to reduce the uncertainties at prior instants
Figure 4.7 BoW image representation. Images are represented by histograms of generalized visual features
Figure 4.8 Flowchart of OVV and image indexing. In every frames, the vocabulary is updated with new visual features extracted from the last frames. The complete set of features in the vocabulary is then merged until convergence. The obtained vocabulary is used to index the last images. Also, the previously indexed frames are re-indexed to reflect the changes in the vocabulary
Figure 4.9 Sample 2D FLS image of a chain in turbid waters
Figure 4.10 FLS operation. The sonar emits an acoustic wave spanning its beam width in the azimuth () and elevation () directions. Returned sound energy is sampled as a function of () and can be interpreted as the mapping of 3D points onto the zero-elevation plane (shown in red)
Figure 4.11 Sonar projection geometry. A 3D point is mapped onto a point on the image plane along the arc defined by the elevation angle. Considering an orthographic approximation, the point is mapped onto , which is equivalent to considering that all scene points rest on the plane (in red)
Figure 4.12 Overall Fourier-based registration pipeline
Figure 4.13 Example of the denoising effect obtained by intensity averaging. (a) Single frame gathered with a DIDSON sonar (Sou 2015) operating at its lower frequency (1.1 Mhz). (b) Fifty registered frames from the same sequence blended by averaging the overlapping intensities. See how the SNR increases and small details pop-out.
Chapter 5: Vision-Based Advanced Driver Assistance Systems
Figure 5.1 Typical coverage of cameras. For the sake of clarity of the illustrations, the actual cone-shaped volumes that the sensors see are shown as triangles
Figure 5.2 Forward assistance
Figure 5.3 Traffic sign recognition
Figure 5.4 The main steps of pedestrian detection together with the main processes carried out in each module
Figure 5.5 Different approaches in Intelligent Headlamp Control (Lopez et al. (2008a)). On the top, traditional low beams that reach low distances. In the middle, the beams are dynamically adjusted to avoid glaring the oncoming vehicle. On the bottom, the beams are optimized to maximize visibility while avoiding glaring by the use of LED arrays
Figure 5.6 Enhanced night vision. Thanks to infrared sensors the system is capable of distinguishing hot objects (e.g., car engines, pedestrians) from the cold road or surrounding natural environment
Figure 5.7 Intelligent active suspension.
Figure 5.8 Lane Departure Warning (LDW) and Lane Keeping System (LKS)
Figure 5.9 Parking Assistance. Sensors' coverages are shown as 2D shapes to improve visualization
Figure 5.10 Drowsiness detection based on PERCLOS and an NIR camera
Figure 5.11 Summary of the relevance of several technologies in each ADAS: in increasing relevance as null, low, useful, and high
Chapter 6: Application Challenges from a Bird's-Eye View
Figure 6.1 A few examples of MAVs. From left to right: the senseFly eBee, the DJI Phantom, the hybrid XPlusOne, and the FESTO BioniCopter
Figure 6.2 (a) Autonomous MAV exploration of an unknown, indoor environment using RGB-D sensor (image courtesy of Shen et al. (2012)). (b) Autonomous MAV exploration of an unknown, indoor environment using a single onboard camera (image courtesy of Faessler et al. (2015b))
Figure 6.3 Probabilistic depth estimate in SVO. Very little motion is required by the MAV (marked in black at the top) for the uncertainty of the depth filters (shown as magenta lines) to converge.
Figure 6.4 Autonomous recovery after throwing the quadrotor by hand: (a) the quadrotor detects free fall and (b) starts to control its attitude to be horizontal. Once it is horizontal, (c) it first controls its vertical velocity and then (d) its vertical position. The quadrotor uses its horizontal motion to initialize its visual-inertial state estimation and uses it (e) to first break its horizontal velocity and then (f) lock to the current position.
Figure 6.5 (a) A quadrotor is flying over a destroyed building. (b) The reconstructed elevation map. (c) A quadrotor flying in an indoor environment. (d) The quadrotor executing autonomous landing. The detected landing spot is marked with a green cube. The blue line is the trajectory that the MAV flies to approach the landing spot. Note that the elevation map is local and of fixed size; its center lies always below the quadrotor's current position.
Chapter 7: Application Challenges of Underwater Vision
Figure 7.1 Underwater mosaicing pipeline scheme. The
Topology Estimation
,
Image Registration
, and
Global Alignment
steps can be performed iteratively until no new overlapping images are detected
Figure 7.2 Topology estimation scheme. (a) Final trajectory obtained by the scheme proposed in Elibol et al. (2010). The first image frame is chosen as a global frame, and all images are then translated in order to have positive values in the axes. The and axes are in pixels, and the scale is approximately 150 pixels per meter. The plot is expressed in pixels instead of meters since the uncertainty of the sensor used to determine the scale (an acoustic altimeter) is not known. The red lines join the time-consecutive images while the black ones connect non-time-consecutive overlapping image pairs. The total number of overlapping pairs is 5412. (b) Uncertainty in the final trajectory. Uncertainty of the image centers is computed from the covariance matrix of the trajectory (Ferrer et al. 2007). The uncertainty ellipses are drawn with a 95% confidence level. (c) Mosaic built from the estimated trajectory
Figure 7.3 Geometric registration of two different views (a and b) of the same underwater scene by means of a planar transformation, rendering the first image on top (c) and the second image on top (d)
Figure 7.4 Main steps involved in the pairwise registration process. The feature extraction step can be performed in both images of the pair, or only in one. In this last case, the features are identified in the second image after an optional image warping based on a transformation estimation
Figure 7.5 Example of error accumulation from registration of sequential images. The same benthic structures appear in different locations of the mosaic due to error accumulation (trajectory drift)
Figure 7.6 Photomosaic built from six images of two megapixels. The mosaic shows noticeable seams in (a), where the images have only been geometrically transformed and sequentially rendered on the final mosaic canvas, the last image on top of the previous one. After applying a blending algorithm, the artifacts (image edges) disappear from the resulting mosaic (b).
Figure 7.7 2.5D map of a Mid-Atlantic Ridge area of approximately resulting from the combination of a bathymetry and a blended photomosaic of the generated high-resolution images. The obtained scene representation provides scientists with a global view of the interest area as well as with detailed optical information acquired at a close distance to the seafloor. Data courtesy of Javier Escartin (CNRS/IPGP, France)
Figure 7.8 (a) Trajectory used for mapping an underwater chimney at a depth of about 1700 m in the Mid-Atlantic ridge (pose frames in red/green/blue corresponding to the axis). We can see the camera pointing always toward the object in a forward-looking configuration. The shape of the object shown was recovered using our approach presented in Campos et al. (2015). Note the difference in the level of detail when compared with a 2.5D representation of the same area obtained using a multibeam sensor in (b). The trajectory followed in (b) was downward-looking, hovering over the object, but for the sake of comparison we show the same trajectory as in (a). Finally, (c) shows the original point cloud, retrieved through optical-based techniques, that was used to generate the surface in (a). Note the large levels of both noise and outliers that this data set contains.
Figure 7.9 A sample of surface processing techniques that can be applied to the reconstructed surface. (a) Original; (b) remeshed; (c) simplified
Figure 7.10 Texture mapping process, where the texture filling a triangle in the 3D model is extracted from the original images. Data courtesy of Javier Escartin (CNRS/IPGP, France)
Figure 7.11 Seafloor classification example on a mosaic image of a reef patch in the Red Sea, near Eilat, covering approximately 3 6 m. (a) Original mosaic. (b) Classification image using five classes: Brain Coral (green), Favid Coral (purple), Branching Coral (yellow), Sea Urchin (pink), and Sand (gray).
Figure 7.12 Ship hull inspection mosaic. Data gathered with HAUV using DIDSON FLS.
Figure 7.13 Harbor inspection mosaic. Data gathered from an Autonomous Surface Craft with BlueView P900-130 FLS.
Figure 7.14
Cap de Vol
shipwreck mosaic: (a) acoustic mosaic and (b) optical mosaic
Edited by
Antonio M. López
Computer Vision Center (CVC) and Universitat Autònoma de Barcelona, Spain
Atsushi Imiya
Chiba University, Japan
Tomas Pajdla
Czech Technical University, Czech Republic
Jose M. Álvarez
National Information Communications Technology Australia (NICTA), Canberra Research Laboratory, Australia
This edition first published 2017
© 2017 John Wiley & Sons Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought
Library of Congress Cataloging-in-Publication Data
Names: López, Antonio M., 1969- editor. | Imiya, Atsushi, editor. | Pajdla, Tomas, editor. | Álvarez, J. M. (Jose M.), editor.
Title: Computer vision in vehicle technology : land, sea and air / Editors Antonio M. López, Atsushi Imiya, Tomas Pajdla, Jose M. Álvarez.
Description: Chichester, West Sussex, United Kingdom : John Wiley & Sons, Inc., [2017] | Includes bibliographical references and index.
Identifiers: LCCN 2016022206 (print) | LCCN 2016035367 (ebook) | ISBN 9781118868072 (cloth) | ISBN 9781118868041 (pdf) | ISBN 9781118868058 (epub)
Subjects: LCSH: Computer vision. | Automotive telematics. | Autonomous vehicles-Equipment and supplies. | Drone aircraft-Equipment and supplies. | Nautical instruments.
Classification: LCC TL272.53 .L67 2017 (print) | LCC TL272.53 (ebook) | DDC 629.040285/637-dc23
LC record available at https://lccn.loc.gov/2016022206
A catalogue record for this book is available from the British Library.
Cover image: jamesbenet/gettyimages; groveb/gettyimages; robertmandel/gettyimages
ISBN: 9781118868072
Ricard Campos
, Computer Vision and Robotics Institute, University of Girona, Spain
Arturo de la Escalera
, Laboratorio de Sistemas Inteligentes, Universidad Carlos III de Madrid, Spain
Armagan Elibol
, Department of Mathematical Engineering, Yildiz Technical University, Istanbul, Turkey
Javier Escartin
, Institute of Physics of Paris Globe, The National Centre for Scientific Research, Paris, France
Uwe Franke
, Image Understanding Group, Daimler AG, Sindelfingen, Germany
Friedrich Fraundorfer
, Institute for Computer Graphics and Vision, Graz University of Technology, Austria
Rafael Garcia
, Computer Vision and Robotics Institute, University of Girona, Spain
David Gerónimo
, ADAS Group, Computer Vision Center, Universitat Autònoma de Barcelona, Spain
Nuno Gracias
, Computer Vision and Robotics Institute, University of Girona, Spain
Ramon Hegedus
, Max Planck Institute for Informatics, Saarbruecken, Germany
Natalia Hurtos
, Computer Vision and Robotics Institute, University of Girona, Spain
Reinhard Klette
, School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology, New Zealand
Antonio M. López
, ADAS Group, Computer Vision Center (CVC) and Computer Science Department, Universitat Autònoma de Barcelona (UAB), Spain
Laszlo Neumann
, Computer Vision and Robotics Institute, University of Girona, Spain
Tudor Nicosevici
, Computer Vision and Robotics Institute, University of Girona, Spain
Ricard Prados
, Computer Vision and Robotics Institute, University of Girona, Spain
Davide Scaramuzza
, Robotics and Perception Group, University of Zurich, Switzerland
ASM Shihavuddin
, École Normale Supérieure, Paris, France
David Vázquez
, ADAS Group, Computer Vision Center, Universitat Autònoma de Barcelona, Spain
This book was born following the spirit of the Computer Vision in Vehicular Technology (CVVT) Workshop. At the moment of finishing this book, the 7th CVVT Workshop CVPR'2016 is being held in Las Vegas. Previous CVVT Workshops include the CVPR'2015 in Boston (http://adas.cvc.uab.es/CVVT2015/), ECCV'2014 in Zurich (http://adas.cvc.uab.es/CVVT2014/), ICCV'2013 in Sydney (http://adas.cvc.uab.es/CVVT2013/), ECCV'2012 in Firenze (http://adas.cvc.uab.es/CVVT2012/), ICCV'2011 in Barcelona (http://adas.cvc.uab.es/CVVT2011/), and ACCV'2010 in Queenstown (http://www.media.imit.chiba-u.jp/CVVT2010/). This implies throughout these years, many invited speakers, co-organizers, contributing authors, and sponsors have helped to keep CVVT alive and exciting. We are enormously grateful to all of them! Of course, we also want to give special thanks to the authors of this book, who kindly accepted the challenge of writing their respective chapters.
He would also like to thank the past and current members of the Advanced Driver Assistance Systems (ADAS) group of the Computer Vision Center at the Universitat Autònoma de Barcelona. He also would like to thank his current public funding, in particular, Spanish MEC project TRA2014-57088-C2-1-R, Spanish DGT project SPIP2014-01352, and the Generalitat de Catalunya project 2014-SGR-1506. Finally, he would like to thank NVIDIA Corporation for the generous donations of different graphical processing hardware units, and especially for their kind support regarding the ADAS group activities.
Tomas Pajdla has been supported by EU H2020 Grant No. 688652 UP-Drive and Institutional Resources for Research of the Czech Technical University in Prague.
Atsushi Imiya was supported by IMIT Project Pattern Recognition for Large Data Sets from 2010 to 2015 at Chiba University, Japan.
Jose M. Álvarez was supported by the Australian Research Council through its Special Research Initiative in Bionic Vision Science and Technology grant to Bionic Vision Australia. The National Information Communications Technology Australia was founded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Center of Excellence Program.
The book is organized into seven self-contained chapters related to CVVT topics, and a final short chapter with the overall final remarks. Briefly, in Chapter 1, there is a quick overview of the main ideas that link computer vision with vehicles. Chapters 2–7 are more specialized and divided into two blocks. Chapters 2–4 focus on the use of computer vision for the self-navigation of the vehicles. In particular, Chapter 2 focuses on land (autonomous cars), Chapter 3 focuses on air (micro aerial vehicles), and Chapter 4 focuses on sea (underwater robotics). Analogously, Chapters 5–7 focus on the use of computer vision as a technology to solve specific applications beyond self-navigation. In particular, Chapter 5 focuses on land (ADAS), and Chapters 6 and 7 on air and sea, respectively. Finally, Chapter 8 concludes and points out new research trends.
Antonio M. López
Computer Vision Center (CVC) andUniversitat Autònoma de Barcelona, Spain
ACC
adaptive cruise control
ADAS
advanced driver assistance system
AUV
autonomous underwater vehicle
BA
bundle adjustment
BCM
brightness constancy model
BoW
bag of words
CAN
controller area network
CLAHE
contrast limited adaptive histogram equalization
COTS
crown of thorns starfish
DCT
discrete cosine transforms
DOF
degree of freedom
DVL
Doppler velocity log
EKF
extended Kalman filter
ESC
electronic stability control
FCA
forward collision avoidance
FEM
finite element method
FFT
fast Fourier transform
FIR
far infrared
FLS
forward-looking sonar
GA
global alignment
GDIM
generalized dynamic image model
GLCM
gray level co-occurrence matrix
GPS
global positioning system
GPU
graphical processing unit
HDR
high dynamic range
HOG
histogram of gradients
HOV
human operated vehicle
HSV
hue saturation value
IR
infrared
KPCA
kernel principal component analysis
LBL
long baseline
LBP
local binary patterns
LCA
lane change assistance
LDA
linear discriminant analysis
LDW
lane departure warning
LHC
local homogeneity coefficient
LKS
lane keeping system
LMedS
least median of squares
MEX
MATLAB executable
MLS
moving least squares
MR
maximum response
MST
minimum spanning tree
NCC
normalized chromaticity coordinates
NDT
normal distribution transform
NIR
near infrared
OVV
online visual vocabularies
PCA
principal component analysis
PDWMD
probability density weighted mean distance
PNN
probabilistic neural network
RANSAC
random sample consensus
RBF
radial basis function
ROD
region of difference
ROI
region of interest
ROV
remotely operated vehicle
SDF
signed distance function
SEF
seam-eliminating function
SIFT
scale invariant feature transform
SLAM
simultaneous localization and mapping
SNR
signal-to-noise ratio
SSD
sum of squared differences
SURF
speeded up robust features
SVM
support vector machine
TJA
traffic jam assist
TSR
traffic sign recognition
TV
total variation
UDF
unsigned distance function
USBL
ultra short base line
UUV
unmanned underwater vehicle
UV
underwater vehicle
Reinhard Klette
School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology, Auckland, New Zealand
This chapter is a brief introduction to academic aspects of computer vision in vehicles. It briefly summarizes basic notation and definitions used in computer vision. The chapter discusses a few visual tasks as of relevance for vehicle control and environment understanding.
Computer vision designs solutions for understanding the real world by using cameras. See Rosenfeld (1969), Horn (1986), Hartley and Zisserman (2003), or Klette (2014) for examples of monographs or textbooks on computer vision.
Computer vision operates today in vehicles including cars, trucks, airplanes, unmanned aerial vehicles (UAVs) such as multi-copters (see Figure 1.1 for a quadcopter), satellites, or even autonomous driving rovers on the Moon or Mars.
Figure 1.1 (a) Quadcopter. (b) Corners detected from a flying quadcopter using a modified FAST feature detector.
Courtesy of Konstantin Schauwecker
In our context, the ego-vehicle is that vehicle where the computer vision system operates in; ego-motion describes the ego-vehicle's motion in the real world.
Computer vision solutions are today in use in manned vehicles for improved safety or comfort, in autonomous vehicles (e.g., robots) for supporting motion or action control, and also for misusing UAVs for killing people remotely. The UAV technology has also good potentials for helping to save lives, to create three-dimensional (3D) models of the environment, and so forth. Underwater robots and unmanned sea-surface vehicles are further important applications of vision-augmented vehicles.
Traffic safety is a dominant application area for computer vision in vehicles. Currently, about 1.24 million people die annually worldwide due to traffic accidents (WHO 2013), this is, on average, 2.4 people die per minute in traffic accidents. How does this compare to the numbers Western politicians are using for obtaining support for their “war on terrorism?” Computer vision can play a major role in solving the true real-world problems (see Figure 1.2). Traffic-accident fatalities can be reduced by controlling traffic flow (e.g., by triggering automated warning signals at pedestrian crossings or intersections with bicycle lanes) using stationary cameras, or by having cameras installed in vehicles (e.g., for detecting safe distances and adjusting speed accordingly, or by detecting obstacles and constraining trajectories).
Figure 1.2 The 10 leading causes of death in the world. Chart provided online by the World Health Organization (WHO). Road injury ranked number 9 in 2011
Computer vision is also introduced into modern cars for improving driving comfort. Surveillance of blind spots, automated distance control, or compensation of unevenness of the road are just three examples for a wide spectrum of opportunities provided by computer vision for enhancing driving comfort.
Computer vision is an important component of intelligent systems for vehicle control (e.g., in modern cars, or in robots). The Mars rovers “Curiosity” and “Opportunity” operate based on computer vision; “Opportunity” has already operated on Mars for more than ten years. The visual system of human beings provides a proof of existence that vision alone can deliver nearly all of the information required for steering a vehicle. Computer vision aims at creating comparable automated solutions for vehicles, enabling them to navigate safely in the real world. Additionally, computer vision can also work constantly “at the same level of attention,” applying the same rules or programs; a human is not able to do so due to becoming tired or distracted.
A human applies accumulated knowledge and experience (e.g., supporting intuition), and it is a challenging task to embed a computer vision solution into a system able to have, for example, intuition. Computer vision offers many more opportunities for future developments in a vehicle context.
There are generic visual tasks such as calculating distance or motion, measuring brightness, or detecting corners in an image (see Figure 1.1b). In contrast, there are specific visual tasks such as detecting a pedestrian, understanding ego-motion, or calculating the free space a vehicle may move in safely in the next few seconds. The borderline between generic and specific tasks is not well defined.
Solutions for generic tasks typically aim at creating one self-contained module for potential integration into a complex computer vision system. But there is no general-purpose corner detector and also no general-purpose stereo matcher. Adaptation to given circumstances appears to be the general way for an optimized use of given modules for generic tasks.
Solutions for specific tasks are typically structured into multiple modules that interact in a complex system.
Shin et al. (2014) review visual lane analysis for driver-assistance systems or autonomous driving. In this context, the authors discuss specific tasks such as “the combination of visual lane analysis with driver monitoring..., with ego-motion analysis..., with location analysis..., with vehicle detection..., or with navigation....” They illustrate the latter example by an application shown in Figure 1.3: lane detection and road sign reading, the analysis of GPS data and electronic maps (e-maps), and two-dimensional (2D) visualization are combined into a real-view navigation system (Choi et al. 2010).
Figure 1.3 Two screenshots for real-view navigation.
Courtesy of the authors of Choi et al. (2010)
Designing a multi-module solution for a given task does not need to be more difficult than designing a single-module solution. In fact, finding solutions for some single modules (e.g., for motion analysis) can be very challenging. Designing a multi-module solution requires:
1.
that modular solutions are available and known,
2.
tools for evaluating those solutions in dependency of a given situation (or
scenario
; see Klette et al. (2011) for a discussion of scenarios) for being able to select (or adapt) solutions,
3.
conceptual thinking for designing and controlling an appropriate multi-module system,
4.
a system optimization including a more extensive testing on various scenarios than for a single module (due to the increase in combinatorial complexity of multi-module interactions), and
5.
multiple modules require control (e.g., when many designers separately insert processors for controlling various operations in a vehicle, no control engineer should be surprised if the vehicle becomes unstable).
Solutions can be characterized as being accurate, precise, or robust. Accuracy means a systematic closeness to the true values for a given scenario. Precision also considers the occurrence of random errors; a precise solution should lead to about the same results under comparable conditions. Robustness means approximate correctness for a set of scenarios that includes particularly challenging ones: in such cases, it would be appropriate to specify the defining scenarios accurately, for example, by using video descriptors (Briassouli and Kompatsiaris 2010) or data measures (Suaste et al. 2013). Ideally, robustness should address any possible scenario in the real world for a given task.
An efficient way for a comparative performance analysis of solutions for one task is by having different authors testing their own programs on identical benchmark data. But we not only need to evaluate the programs, we also need to evaluate the benchmark data used (Haeusler and Klette 2010 2012) for identifying their challenges or relevance.
Benchmarks need to come with measures for quantifying performance such that we can compare accuracy on individual data or robustness across a diversity of different input data.
Figure 1.4 illustrates two possible ways for generating benchmarks, one by using computer graphics for rendering sequences with accurately known ground truth,1 and the other one by using high-end sensors (in the illustrated case, ground truth is provided by the use of a laser range-finder).2
Figure 1.4 Examples of benchmark data available for a comparative analysis of computer vision algorithms for motion and distance calculations. (a) Image from a synthetic sequence provided on EISATS with accurate ground truth. (b) Image of a real-world sequence provided on KITTI with approximate ground truth
But those evaluations need to be considered with care since everything is not comparable. Evaluations depend on the benchmark data used; having a few summarizing numbers may not be really of relevance for particular scenarios possibly occurring in the real world. For some input data we simply can not answer how a solution performs; for example, in the middle of a large road intersection, we cannot answer which lane border detection algorithm performs best for this scenario.
We are not so naive to expect an all-time “winner” when comparatively evaluating computer vision solutions. Vehicles operate in the real world (whether on Earth, the Moon, or on Mars), which is so diverse that not all of the possible event occurrences can be modeled in underlying constraints for a designed program. Particular solutions perform differently for different scenarios, and a winning program for one scenario may fail for another. We can only evaluate how particular solutions perform for particular scenarios. At the end, this might support an optimization strategy by adaptation to a current scenario that a vehicle experiences at a time.
The following basic notations and definitions (Klette 2014) are provided.
An image is defined on a set
of pairs of integers (pixel locations), called the image carrier, where and define the number of columns and rows, respectively. We assume a left-hand coordinate system with the coordinate origin in the upper-left corner of the image, the -axis to the right, and the -axis downward. A pixel of an image combines a location in the carrier with the value of at this location.
A scalar image takes values in a set , typically with , , or . A vector-valued image has scalar values in a finite number of channels or bands. A video or image sequence consists of frames, for , all being images on the same carrier .
In case of an RGB color image , we have pixels .
A geometrically rectified gray-level stereo image or frame consists of two channels and , usually called left and right images; this is implemented in the multi-picture object (mpo) format for images (CIPA 2009).
For a sequence of gray-level stereo images, we have pixel in frame , which is the combined representation of pixels and in and , respectively, at pixel location and time .
The zero-mean Gauss function is defined as follows:
A convolution of an image with the Gauss function produces smoothed images
also known as Gaussians, for . (We stay with symbol here as introduced by Lindeberg (1994) for “layer”; a given context will prevent confusion with the left image of a stereo pair.)
Step-edges in images are detected based on first- or second-order derivatives, such as values of the gradient or the Laplacian given by
Local maxima of - or -magnitudes or , or zero-crossings of values
