151,99 €
As the state-of-the-art imaging technologies became more and more advanced, yielding scientific data at unprecedented detail and volume, the need to process and interpret all the data has made image processing and computer vision increasingly important. Sources of data that have to be routinely dealt with today's applications include video transmission, wireless communication, automatic fingerprint processing, massive databanks, non-weary and accurate automatic airport screening, robust night vision, just to name a few. Multidisciplinary inputs from other disciplines such as physics, computational neuroscience, cognitive science, mathematics, and biology will have a fundamental impact in the progress of imaging and vision sciences. One of the advantages of the study of biological organisms is to devise very different type of computational paradigms by implementing a neural network with a high degree of local connectivity.
This is a comprehensive and rigorous reference in the area of biologically motivated vision sensors. The study of biologically visual systems can be considered as a two way avenue. On the one hand, biological organisms can provide a source of inspiration for new computational efficient and robust vision models and on the other hand machine vision approaches can provide new insights for understanding biological visual systems. Along the different chapters, this book covers a wide range of topics from fundamental to more specialized topics, including visual analysis based on a computational level, hardware implementation, and the design of new more advanced vision sensors. The last two sections of the book provide an overview of a few representative applications and current state of the art of the research in this area. This makes it a valuable book for graduate, Master, PhD students and also researchers in the field.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 968
Veröffentlichungsjahr: 2015
Cover
Related Titles
Title Page
Copyright
List of Contributors
Foreword
Part I: Fundamentals
Chapter 1: Introduction
1.1 Why Should We Be Inspired by Biology?
1.2 Organization of Chapters in the Book
1.3 Conclusion
Acknowledgments
References
Chapter 2: Bioinspired Vision Sensing
2.1 Introduction
2.2 Fundamentals and Motivation: Bioinspired Artificial Vision
2.3 From Biological Models to Practical Vision Devices
2.4 Conclusions and Outlook
References
Chapter 3: Retinal Processing: From Biology to Models and Applications
3.1 Introduction
3.2 Anatomy and Physiology of the Retina
3.3 Models of Vision
3.4 Application to Digital Photography
3.5 Conclusion
References
Chapter 4: Modeling Natural Image Statistics
4.1 Introduction
4.2 Why Model Natural Images?
4.3 Natural Image Models
4.4 Computer Vision Applications
4.5 Biological Adaptations to Natural Images
4.6 Conclusions
References
Chapter 5: Perceptual Psychophysics
5.1 Introduction
5.2 Laboratory Methods
5.3 Psychophysical Threshold Measurement
5.4 Classic Psychophysics: Theory and Methods
5.5 Signal Detection Theory
5.6 Psychophysical Scaling Methods
5.7 Conclusions
References
Part II: Sensing
Chapter 6: Bioinspired Optical Imaging
6.1 Visual Perception
6.2 Polarization Vision - Object Differentiation/Recognition
6.3 High-Speed Motion Detection
6.4 Conclusion
References
Chapter 7: Biomimetic Vision Systems
7.1 Introduction
7.2 Scaling Laws in Optics
7.3 The Evolution of Vision Systems
7.4 Manufacturing of Optics for Miniaturized Vision Systems
7.5 Examples for Biomimetic Compound Vision Systems
References
Chapter 8: Plenoptic Cameras
8.1 Introduction
8.2 Light Field Representation of the Plenoptic Function
8.3 The Plenoptic Camera
8.4 Applications of the Plenoptic Camera
8.5 Generalizations of the Plenoptic Camera
8.6 High-Performance Computing with Plenoptic Cameras
8.7 Conclusions
References
Part III: Modelling
Chapter 9: Probabilistic Inference and Bayesian Priors in Visual Perception
9.1 Introduction
9.2 Perception as Bayesian Inference
9.3 Perceptual Priors
9.4 Outstanding Questions
References
Chapter 10: From Neuronal Models to Neuronal Dynamics and Image Processing
10.1 Introduction
10.2 The Membrane Equation as a Neuron Model
10.3 Application 1: A Dynamical Retinal Model
10.4 Application 2: Texture Segregation
10.5 Application 3: Detection of Collision Threats
10.6 Conclusions
Acknowledgments
References
Chapter 11: Computational Models of Visual Attention and Applications
11.1 Introduction
11.2 Models of Visual Attention
11.3 A Closer Look at Cognitive Models
11.4 Applications
11.5 Conclusion
References
Chapter 12: Visual Motion Processing and Human Tracking Behavior
12.1 Introduction
12.2 Pursuit Initiation: Facing Uncertainties
12.3 Predicting Future and On-Going Target Motion
12.4 Dynamic Integration of Retinal and Extra-Retinal Motion Information: Computational Models
12.5 Reacting, Inferring, Predicting: A Neural Workspace
12.6 Conclusion
Acknowledgments
References
Chapter 13: Cortical Networks of Visual Recognition
13.1 Introduction
13.2 Global Organization of the Visual Cortex
13.3 Local Operations: Receptive Fields
13.4 Local Operations in V1
13.5 Multilayer Models
13.6 A Basic Introductory Model
13.7 Idealized Mathematical Model of V1: Fiber Bundle
13.8 Horizontal Connections and the Association Field
13.9 Feedback and Attentional Mechanisms
13.10 Temporal Considerations, Transformations and Invariance
13.11 Conclusion
References
Chapter 14: Sparse Models for Computer Vision
14.1 Motivation
14.2 What Is Sparseness? Application to Image Patches
14.3 SparseLets: A Multiscale, Sparse, Biologically Inspired Representation of Natural Images
14.4 SparseEdges: Introducing Prior Information
14.5 Conclusion
Acknowledgments
References
Chapter 15: Biologically Inspired Keypoints
15.1 Introduction
15.2 Definitions
15.3 What Does the Frond-End of the Visual System Tell Us?
15.4 Bioplausible Keypoint Extraction
15.5 Biologically Inspired Keypoint Representation
15.6 Qualitative Analysis: Visualizing Keypoint Information
15.7 Conclusions
References
Part IV: Applications
Chapter 16: Nightvision Based on a Biological Model
16.1 Introduction
16.2 Why Is Vision Difficult in Dim Light?
16.3 Why Is Digital Imaging Difficult in Dim Light?
16.4 Solving the Problem of Imaging in Dim Light
16.5 Implementation and Evaluation of the Night-Vision Algorithm
16.6 Conclusions
Acknowledgment
References
Chapter 17: Bioinspired Motion Detection Based on an FPGA Platform
17.1 Introduction
17.2 A Motion Detection Module for Robotics and Biology
17.3 Insect Motion Detection Models
17.4 Overview of Robotic Implementations of Bioinspired Motion Detection
17.5 An FPGA-Based Implementation
17.6 Experimental Results
17.7 Discussion
17.8 Conclusion
Acknowledgments
References
Chapter 18: Visual Navigation in a Cluttered World
18.1 Introduction
18.2 Cues from Optic Flow: Visually Guided Navigation
18.3 Estimation of Self-Motion: Knowing Where You Are Going
18.4 Object Detection: Understanding What Is in Your Way
18.5 Estimation of TTC: Time Constraints from the Expansion Rate
18.6 Steering Control: The Importance of Representation
18.7 Conclusions
Acknowledgments
References
Index
End User License Agreement
xv
xvi
xvii
xviii
xix
xx
xxi
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
1
109
201
375
447
448
449
450
451
452
453
454
455
456
457
458
Cover
Table of Contents
Foreword
Part I: Fundamentals
Begin Reading
Chapter 1: Introduction
Figure 1.1 The
Erbenochile erbeni
eyes are distributed in a vertical half-cylinder surface with 18 or 19 lenses per vertical row. Note the striking similarity of this natural eye with the CurvACE sensor presented in Figure 22 of Chapter 7. Image credit: Wikipedia
Figure 1.2 Mindmap of the book contents. Cross-links between chapters have been indicated as thin lines.
Figure 1.3 Tag cloud of the abstracts and table of contents of the book. Credit: www.wordle.net
Chapter 2: Bioinspired Vision Sensing
Figure 2.1 Schematic of the retina network cells and layers. Photoreceptors initially receive light stimuli and transduce them into electrical signals. A feedforward pathway is formed from the photoreceptors via the bipolar cell layer to the ganglion cells, which form the output layer of the retina. Horizontal and amacrine cell layers provide additional processing with lateral inhibition and feedback. Finally, the visual information is encoded into spike patterns at the ganglion cell level. Such encoded, the visual information is transmitted along their axons, forming the optic nerve, to the visual cortex in the brain. The schematic greatly simplifies the actual circuitry, which in reality includes various subtypes of each of the neuron types with different specific connection patterns. Also, numerous additional electrical couplings within the network are suppressed for clarity.
Figure 2.2 Modeling the retina in silicon – from biology to a bioinspired camera: ATIS “silicon retina” bioinspired vision sensor [20], showing the pixel cell CMOS layout (bottom left), microscope photograph of part of the pixel array and the whole sensor (bottom middle), and miniature bioinspired ATIS camera.
Figure 2.3 (a) Simplified three-layer retina model and (b) corresponding silicon retina pixel circuitry; in (c), typical signal waveforms of the pixel circuit are shown. The upper trace represents an arbitrary voltage waveform at the node
V
p
tracking the photocurrent through the photoreceptor. The bipolar cell circuit responds with spike events of different polarity to positive and negative gradients of the photocurrent while being monitored by the ganglion cell circuit that also transports the spikes to the next processing stage; the rate of change is encoded in interevent intervals; panel (d) shows the response of an array of pixels to a natural scene (person moving in the field of view of the sensor). Events have been collected for some tens of milliseconds and are displayed as an image with ON (going brighter) and OFF (going darker) events drawn as white and black dots.
Figure 2.4 (a) Functional diagram of an ATIS pixel. (b) Arbitrary light stimulus and pixel response: two types of asynchronous “spike” events, encoding temporal change and sustained gray scale information, are generated and transmitted individually by each pixel in the imaging array. (c) Change events coded black (OFF) and white (ON) (top) and gray-level measurements at the respective pixel positions triggered by the change events (bottom).
Figure 2.5 Instance of a traffic scene observed by an ATIS. (a) ON/OFF changes (shown white/black) leading to instantaneous pixel-individual sampling. (b) Associated gray scale data recorded by the currently active pixels displayed on black background – all black pixels do not sample/send information at that moment. (c) Same data with full background acquired earlier, showing a low-data-rate, high-temporal-resolution video stream. The average video compression factor is around 100 for this example scene; that is, only 1% of the data (with respect to a standard 30 frames per s image sensor) are acquired and transmitted, yet allowing for a near lossless recording of a video stream. The temporal resolution of the video data is about 1000 frames per s equivalent.
Chapter 3: Retinal Processing: From Biology to Models and Applications
Figure 3.1 Schematic representation of the organization into columns of retinal cells. Light is transmitted through the retina from the ganglion cells to the photoreceptors. The neural signal is transmitted in the opposite direction from photoreceptors, their axons located at the outer plexiform layer (OPL), to ganglion cells, their dendritic field located in the inner plexiform layer (IPL), passing through bipolar cells (adapted from Ref. [4]).
Figure 3.2 Number of photoreceptors (rods and cones) at the retina surface (in square millimeters) as a function of eccentricity (distance from the fovea center, located at 0). The blind spot corresponds to the optic nerve (an area without photoreceptor). Redrawn from Ref. [6]
Figure 3.3 Simulation of the complexity of the spatiotemporal response of the retina in Hérault's model. (a) Temporal evolution of the modulus of the spatial transfer function (only in one dimension ) for a square input . When light is switched on, the spatial transfer function is low-pass at the beginning and becomes band-pass (black lines). Blue and red lines show the temporal profile for respectively low spatial frequency (magnocellular pathway) and high spatial frequency (parvocellular pathway). When the light is switched off, there is a rebound for low spatial frequency. (b) The corresponding temporal profiles for high spatial frequencies (HF-X, parvo) and low spatial frequencies (LF-X magno). The profile for Y cells is also shown (adapted with permission from Ref. [4]).
Figure 3.4 Simulation of Hérault's model on an image that is considered steady. Each image shows respectively the response of cones, horizontal cells, bipolar cells ON and OFF, midget cells (Parvo), parasol cells (Magno X), and cat's Y cells (Magno Y) in response to the original image.
Figure 3.5 (a) Bayer color filter array (CFA) (b) Simulation of the arrangement of the L-cone, M-cone, and S-cone in the human retina (Redrawn from Ref. [7])
Figure 3.6 (a) Color image acquired through the Bayer CFA. (b) The corresponding modulus of the spatial frequency spectrum , showing spectrum of luminance (R+2G+B) information in the center and spectrum of chromatic information (R-B, R-2G+B) in the borders.
Chapter 4: Modeling Natural Image Statistics
Figure 4.1 Natural images are highly structured. Here we show an analysis of a single image (a). The pairwise correlations in intensity between neighboring pixels are illustrated by the scatterplot ( = 0.95) in (b). Pairwise correlations extend well beyond neighboring pixels as shown by the autocorrelation function of the image in (c). We also show a whitened version of the image (d) and a phase scrambled version (e). Whitening removes pairwise correlations and preserves higher-order regularities, whereas Fourier phase scrambling has the opposite effect. Comparing the whitened and phase-scrambled images reveals that the
higher
-order regularities carry much of the perceptually meaningful structure. Second-order correlations can be modeled by a Gaussian distribution. The probabilistic models we will discuss are aimed at describing the higher-order statistical regularities and can be thought of as generalizations of the Gaussian distribution.
Figure 4.2 To illustrate how multi-information is measured as a proxy for how redundant a distribution is, we show an example joint distribution , visualized in (a) and its factorial form, , visualized in (b), that is, where the two variables are independent of each other. Multi-information is the difference between the joint distributions's entropy (corresponding to (a)) and the factorial form's entropy (corresponding to (b)). Intuitively speaking, multi-information measures the difference of the
spread
of the two distributions. The illustrated joint distribution therefore has a relatively high degree of multi-information, meaning that it is highly redundant.
Figure 4.3 In this chapter, we review several important natural image models which we have organized into three branches of approaches, each extending the Gaussian distribution by certain higher-order regularities. Arrows with solid lines indicate a generalization of the model class.
Figure 4.4 ICA filter responses are not independent for natural images. In panel a, we show a complete set of ICA basis functions () trained on images pixels in size (“complete” meaning that there are as many basis functions, 100, as there are dimensions in the data). Panel b visualizes the corresponding set of filters (). The filters are oriented, bandpass, and localized – prominent features shared by the receptive fields of simple cells in the primary visual cortex. In panels c and d, we examine the filter responses or “sources” for natural images. (c) shows the joint histogram of two filter responses, , where . The joint distribution exhibits a diamond-shaped symmetry, which is well captured by an -spherical symmetry with close to 1. We also show the two marginal distributions, which are heavy-tailed, sparse distributions with a high probability of zero and an elevated probability of larger nonzero values, relative to the Gaussian distribution (i.e., filter responses are typically either very small or very large). (d) The higher-order dependence of the filter responses is shown by plotting the conditional distribution for each value of . The “bow-tie” shape of this plot reveals that the variance of depends on the value of .
Figure 4.5 The MCGSM [36] models images by learning the distribution of one pixel, , given a causal neighborhood of pixels (a). A graphical model representation of the MCGSM is shown in (b), where for visualization purposes the neighborhood consists of only four pixels. The parents of a pixel are constrained to pixels which are above it or in the same row and left of it, which allows for efficient maximum likelihood learning and sampling.
Figure 4.6 Redundancy reduction capabilities of various methods and models quantified in terms of estimated multi-information. PCA [12, 13] only takes second-order correlations into account and here serves as the baseline. ICA [12, 13] corresponds to the best linear transformation for removing second- and higher-order correlations. Overcomplete ICA (OICA) [23], ISA [25], and hierarchical ICA (HICA) [42] represent various extensions of ICA. Also included are estimates for deep belief networks (DBN), -elliptical models [19], MoG (32 components), PoT, mixtures of GSMs, and MCGSMs.
Figure 4.7 Image patches generated by various image models. To enhance perceptual visibility of the difference between the samples from the different models, all models were trained on natural images with a small amount of additive Gaussian noise.
Figure 4.8 After removing 70% of the pixels from the image (a), the missing pixels from (b) were estimated by maximizing the density defined by an MCGSM (c).
Figure 4.9 Computer vision applications of image models include the synthesis, discrimination, and computation of representations of textures. Here we illustrate the performance of one model in capturing the statistics of a variety of textures. The left image of each pair shows a random pixel crop of a texture [47]. The right image of each pair shows a histogram-matched sample from the MCGSM trained on the texture. The samples provide a visual illustration of the kind of correlations the model can capture when applied to various visual textures.
Figure 4.10 Illustration of a psychophysical stimulus pitting natural images against model samples following [43]. Observers viewed two sets of tightly tiled images. They were told that one set included patches from photographs of natural scenes, whereas the other contained impostors, and their task was to identify the true set ((a) here). The “imposters” here are samples from the -spherically symmetric model. The model samples were always matched in joint probability to the natural images (under the particular model being tested, i.e., under Lp here). In this example, the patches are pixels in size. As shown, the model samples fail to capture several prominent features of natural images, particularly their detailed geometric content.
Figure 4.11 (a) A crop of an example image from a natural image dataset collected by Tkačik
et al.
[54]. (b) A corresponding sample of an MCGSM trained on the dataset, illustrating the kind of correlations captured by a current state-of-the-art probabilistic model of natural images. As with textures (Figure 4.9), the nature of the samples depends highly on the training images. When trained on an urban scene (a crop of the scene is shown in (c); photograph by Matt Wiebe [55]), the samples express much more vertical and horizontal structure (d).
Chapter 5: Perceptual Psychophysics
Figure 5.1 Gaussian distribution of perceived thresholds after a number of measures. This variability is the result of neural noise present in biological systems, experimental noise, and, in certain cases, quantal fluctuations of the stimulus. According to this, every time we look at the stimulus, our “internal” threshold is different and we respond accordingly. These internal thresholds follow a normal distribution with its peak in the most probable value as shown in (a). For any given stimulus intensity, the ratio of positive to negative answers to the question “do you see the stimulus?” will be determined by these internal threshold fluctuations and corresponds to the area under the Gaussian shown in (b).
Figure 5.2 The psychometric function of a typical set of results using the method of constant stimuli.
y-Axis
values show the proportion of “yes” answers to the question “do you see the stimulus?” Each stimulus was presented an arbitrary number of times
n
(typically
n
= 10), and each gray circle is the average result of these presentations.
Figure 5.3 Fictional set of results for a typical method of limits experiment. There are six trials consisting of staircases (three ascending and three descending) of stimulus intensities. Observers answer “
yes
” (Y) or “
no
” (N) to indicate whether or not they have detected the stimulus. Individual trials start from one extreme (either from perfectly detectable or undetectable stimuli) and end as soon as observers alter their answers. In the example, no provision has been taken to counter habituation and expectation errors.
Figure 5.4 Example of a forced-choice version of the method of limits. The staircase consisted of a sequence of morphed pictures (a bull that morphs into a car in small steps). The experimental protocol searched for the point in the sequence where observers can just tell apart a modified picture from the original picture of the bull (discrimination threshold). To decide this, each step in the staircase originated a series of trials where observers had to select the corresponding test image (a morphed bull) from two references (the normal bull) presented in randomized order. By repeating each step a number of times, experimenters measured the morph change necessary for observers to tell the difference, without the inconvenience of habituation and expectation biases [9].
Figure 5.5 The hypothetical probability distribution of noise (
N
, in broken lines) and signal + noise (
S
+
N
, in solid lines) as a function of the observer's neural response. As expected, the
S
+
N
curve generates more neural activity (e.g., spikes per second) in average than the
N
curve; however, there are sections where both curves overlap in
x
. The vertical line represents the observer's decision criterion, which could be anywhere along the
x-axis
. Depending on its position, four distinctive regions are determined: hit, miss, correct rejection, and false alarm.
Figure 5.6 Receiver operating characteristic (ROC) curves describing the relationship between hits and false alarms for several decision criteria and signal strengths. As the observer lowers its decision criteria, both areas labeled “false alarm” and “hit” in Figure 5.5 increase in different proportions, depending of the proximity between the curves. The ROC family of curves describes all the options for our hypothetical radar controller regarding the separation between
S
and
S
+
N
curves. When both curves are clearly separated, it is possible to obtain a nearly perfect hit rate without false alarms.
Figure 5.7 Hypothetical example of how to obtain a psychophysical sensation scale over a large stimulus range from measures of sensation discrimination at threshold (
jnd
). Panel (a) shows that to keep producing the same increment in sensation (equal steps in the
y-axis
), the stimulus intensity needs to be increased (unequal steps in the
x-axis
). Panel (b) represents the psychophysical scale obtained from the results in (a).
Figure 5.8 Hypothetical nonlinear relationship between a physical stimulus and its corresponding psychological sensation. The arrows show how a perceptual scale with uniform measurement units can be produced by partitioning the scale into smaller intervals.
Figure 5.9 (a) Probability distribution of judgments for two different stimuli
A
and
B
in an arbitrary perceptual continuum. Their average perceptual distance is determined by
μ
B
–μ
A
, the distance between the two means. (b) The probability distribution of the differences in perceptual judgments for two different stimuli, derived from (a). The mean
μ
B
–μ
A
of this new Gaussian is equal to the average perceptual distance between
A
and
B
, and its variance is also related to the variances of the distributions in (a).
Chapter 6: Bioinspired Optical Imaging
Figure 6.1 Light spectrum.
Figure 6.2 (a) Human eye, (b) Compound eye [6].
Figure 6.3 Ommatidia of a compound eye.
Figure 6.4 Alignment of visual molecules.
Figure 6.5 Polarization of light by reflection and transmission.
Figure 6.6 Transmission of a polarized beam of light through a linear polarizer.
Figure 6.7 Transmitted light intensity.
Figure 6.8 Experimental setup.
Figure 6.9 Variation of intensity with the angle of the linear polarizer.
Figure 6.10 Variation in the degree of polarization with angle of incidence.
Figure 6.11 Variation of PFR with angle of incidence.
Figure 6.12 Variation of Stoke's degree of polarization with varying angle of incidence.
Figure 6.13 Original test image.
Figure 6.14 Image formed by degree of polarization matrix.
Figure 6.15 Classified transparent object after applying image-processing functions on the polarization matrix image.
Figure 6.16 Temporal differentiation schemes (DI – difference image). (a) Differential imaging and (b) visual tracking or saccadic imaging.
Figure 6.17 Horizontal motion detection using spatially integrated binary optical flow.
Figure 6.18 Histogram equalization using temporal differentiation. (a) and (b) Temporal images with the first frame captured at 10 ms and the second frame 20 ms, respectively; (c) and (d) Histogram equalization of the temporal images.
Chapter 7: Biomimetic Vision Systems
Figure 7.1 (a) Scheme of a lens with lens pupil diameter
D
, focal length ƒ, and angle u for the maximum cone of rays contributing to the focal spot
δ
× (Airy disk); (b) intensity distribution Airy disk free of aberrations; and (c) 2 × 2 pixels of image sensor with Bayer color filter covering the size of the Airy disk.
Figure 7.2 (a) Photographic lens (Nikon) with focal length ƒ = 24 mm and ƒ-number of ƒ/2.8, often written as 1 : 2.8; (b) schematic view of the optical design of the photographic lens shown (a); and (c) a pair of miniaturized
NanEye
cameras (Awaiba) with ƒ/2.7, a focal length ƒ = 0.66 mm, and a lens diameter of
D
= 0.244 mm.
Figure 7.3 Schematic illustration of the scaling law of optics. Lenses with an equal ƒ-number have an equal best focused diffraction-limited spot of
δx
size (Airy pattern). Thus, the smallest resolved feature in the image is independent from the lens scaling for different lenses with an equal ƒ-number. However, the number of resolvable pixels in the image plane, also referred to as space–bandwidth product (SW), is drastically reduced for a small lens, as the size of the image scales with the lens size.
Figure 7.4 Eye evolution from a simple one-cell photoreceptor to a high-resolution focused camera-type eye. It is assumed that the whole process of eye evolution was accomplished in less than 400 000 generations.
Figure 7.5 Different types of natural eye sensors and their artificial counterpart. Artificial eye sensors are based on planar components due to manufacturing restrictions.
Figure 7.6 Compound eyes of dragonflies (
Anisoptera
).
Figure 7.7 Very large telescope (VLT) Array, 4 × 8.2 m Unit Telescopes, Cerro Paranal, Chile (ESO).
Figure 7.8 (a) Forward-directed eyes of a cat allowing stereoscopic view and direct distance measurement; (b) prey animals like the mouse have their eyes on the side allowing round vision; and (c) a male jumping spider (
Pseudeuophrys lanigera
). These jumping spiders live inside the houses in Europe and have excellent vision required to sight, stalk, and jump on prey, typically a smaller insects like a booklice.
Figure 7.9 (a) Scheme of the visual fields of the four eye pairs of a jumping spider and (b) a male jumping spider (
Pseudeuophrys lanigera
) sitting on a fingertip.
Figure 7.10 Schematic drawing (a) top view and (b) enlarged cross section of a cylindrical microlens array on a photographic layer for integral photography as proposed by Walter Hess in 1912 [13].
Figure 7.11 (a) Front-illuminated image sensor and (b) backside-illuminated image sensor bonded to a stack of a 3D integrated chips with through-silicon via (TSV) connections.
Figure 7.12 (a) Trilobite eye – microlens arrays made of calcite, (b) compound apposition eyes of a fly, and (c) microlens array manufactured by wafer-based technology in quartz glass (fused silica) [14].
Figure 7.13 Melting photoresist technique for manufacturing refractive microlenses on wafer level.
Figure 7.14 Transfer of resist microlenses into the wafer bulk material by reactive ion etching (RIE).
Figure 7.15 (a) Multilevel diffractive optical elements (DOE) manufactured with a planar process using subsequent photolithography and dry etching [15]. (b) Eight-level diffractive optical element in fused silica glass material.
Figure 7.16 UV imprint lithography and wafer-level packaging (WLP): (a) A PDMS stamp is embossed on a wafer with a drop of a photosensitive polymer. The resist is formed by the PDMS pattern and exposed by UV light for curing. After hardening of the resist, the PDMS stamp and lens wafer are separated; (b) lens and aperture wafers are mounted in a mask aligner to obtain a stack of wafer-level cameras (WLC).
Figure 7.17 (a) Schemes of a microlens projection lithography (MPL) system for 1 : 1 projection of a photomask onto a wafer in a mask aligner and (b) photograph of an early prototype system, a stack of three microlens wafers.
Figure 7.18 (a) Wafer-level camera (WLC) built within European Research Project WALORI in 2005 [23]. Backside illumination (BSI) through thinned CMOS image sensor, 5.6 µm pixel VGA; (b) optical design of the WLC shown (a).
Figure 7.19 (a) Microlens array camera imaging individual subimages from an object to an image plane and (b) recorded subimages for ƒ/2.4 array camera.
Figure 7.20 (a) Ultraflat camera system consisting of three layers of microlens arrays and (b) superposition of individual subimages obtained from a 1 : 1 multichannel imaging system as shown in Figure 7.16, imaging the same object also shown in Figure 7.18.
Figure 7.21 (a) Electronic cluster eye (eCLEY) with a FOV of 52° × 44° and VGA resolution, an apposition compound design with electronic superposition and (b) an ultrathin array microscope with integrated illumination, a superposition compound vision design. Both devices are built at Fraunhofer IOF in Germany. Both photos compare the biomimetic vision system (right in the photos) to the standard optics, a bulky objective, and a standard microscope [24, 25].
Figure 7.22 (a) PiCam module consisting of 4 × 4 individual cameras with 0.75 megapixel resolution and (b) lens and sensor array with RGB color filters (Pelican Imaging).
Figure 7.23 (A) Scheme of the manufacturing of a curved artificial compound eye from the research project CurvACE [29]. Polymer-on-glass lenses, apertures, photodetectors (CMOS), and interconnections (PCB) were manufactured on a flexible layer, then diced, and mounted on a rigid semicylindrical substrate; (B) image of the CurvACE prototype with a dragonfly.
Figure 7.24 Different possible arrangements of camera modules for integration in smartphones or tablets.
Figure 7.25 Portrait of the author performing a self-portrait using a FLIR ONE™, the first thermal imager designed for mobile phones and presented by FLIR Systems in 2014 [30].
Chapter 8: Plenoptic Cameras
Figure 8.1 Light field parameterizations. (a) Two-plane parameterization, (b) spherical–Cartesian parameterization, and (c) sphere parameterization.
Figure 8.2 Pinhole camera. (a) Optical model and (b) light field representation.
Figure 8.3 Thin lens camera. (a) Optical model and (b) light field representation.
D
is the lens aperture and Δ
p
is the pixel size.
Figure 8.4 (a) A compact camera array. (b) Raw images from the array. Note the need for geometric and color calibration.
Figure 8.5 Simulated images from a light field.
Figure 8.6 Fixing the spatial variables (
x
,
y
) and varying the directional variables (
u
,
v
) give the directional distribution of the light field. On (a), we see the 4D light field as a 2D multiplexed image. The image in (b) is a detail from (a). Each small square in (b) is the (
u
,
v
) slice for a fixed (
x
,
y
).
Figure 8.7 Fixing the directional variables (
u
,
v
) and varying the spatial variables (
x
,
y
) can be interpreted as taking images from a camera array. On (a), we see the 4D light field as another 2D multiplexed image. Image (b) is a detail from (a). Each image in (b) is the (
x
,
y
) slice for a fixed (
u
,
v
).
Figure 8.8 Fixing one-directional variable (
u
or
v
) and one spatial variable (
x
or
y
) generates another 2D multiplexed image. This is the epipolar image that gives information about the depth of the elements in the scene. On (b), we see a detail from (a). Note the different orientations corresponding to different depths in the scene.
Figure 8.9 Optical configuration of the plenoptic camera (not shown at scale).
Figure 8.10 Ray transform in the plenoptic camera.
Figure 8.11 Sampling geometry of the plenoptic camera in
L
1
. Each vertical column corresponds to a microlens. Each square corresponds to a pixel of the image sensor.
Figure 8.12 (a) Light field inside the camera. Refocusing is equivalent to integration along lines with
x–u
slope . (b) Equivalent light field with
x–u
slope after virtually moving the focusing plane.
Figure 8.13 Fourier slice theorem. (a) Spatial integration. (b) Fourier slicing.
Figure 8.14 Image refocusing at several distances using the discrete focal stack transform [39].
Figure 8.15 Geometry of depth estimation with a plenoptic camera.
Figure 8.16 Depth estimation and extended depth of field images.
Figure 8.17 (a) Superresolution photographs from a plenoptic camera. (b) Comparison of superresolution (left) against bilinear interpolation (right).
Figure 8.18 Geometry of a generalized plenoptic camera. Microlenses are now placed at distance
βb
from the sensor and focused on a plane at distance
Bα
from the main lens. Standard plenoptic camera corresponds to
β
= 1,
α =
0.
Figure 8.19 Sampling geometry of the generalized plenoptic camera. (a) Sampling in
L
1
. Each vertical column corresponds to a microlens. Each sheared square corresponds to a pixel of the image sensor. Pixel slope in the
x–u
plane is
βb
/((1 −
β
)
B
). (b) Sampling in
L
r
where the microlens plane is at distance
Bα
from the main lens and
α
verifies Eq. (8.23).
Figure 8.20 Refocusing from a generalized plenoptic camera.
Chapter 9: Probabilistic Inference and Bayesian Priors in Visual Perception
Figure 9.1 An elliptical pattern forming on the retina can be either an ellipse viewed upright or a circle viewed at an angle. Image adapted from Ref. [1] with permission.
Figure 9.2 Rubin's vase. The black-and-white image (b) can be seen as a white vase or as two opposing black faces. The image (a) provide the context that primes the visual system to choose one of the interpretations for the ambiguous image: the vase, before it switches to the faces.
Figure 9.3 (a) Expectation that light comes from above. This image is interpreted as one dimple in the middle of bumps. This is consistent with assuming that light comes from the top of the image. Turning the page upside down would lead to the opposite percept. Image adapted from Ref. [9] with permission. (b) Convexity expectation for figure–ground separation. Black regions are most often seen as convex objects in a white background instead of white regions being seen as concave objects in a black background. Image adapted from Ref. [10] with permission. Interplay between contextual and structural expectations. The black object in (c) is typically perceived as a hair dryer because although it has a pistol-like shape (structural expectation), it appears to be in a bathroom (contextual expectation) and we know that hair dryers are typically found in bathrooms (structural expectation). The identical-looking black object in (d) is perceived as a drill because the context implies that the scene is a workshop. Image adapted from Ref. [11] with permission.
Figure 9.4 Influence of image contrast on the estimation of speed. (a) At high contrast, the measurements of the stimulus are precise, and thus lead to a sharp likelihood. Multiplication of the likelihood by the prior distribution leads to a posterior distribution that is similar to the likelihood, only slightly shifted toward the prior distribution. (b) At low contrast, on the other hand, the measurements are noisy and lead to a broad likelihood. Multiplication of the prior by the likelihood thus leads to a greater shift of the posterior distribution toward the prior distribution. This will result in an underestimation of speed at low contrast. Reproduced from Ref. [17] with permission.
Figure 9.5 Influence of contrast on the perceived direction of a horizontally moving rhombus. (a). With a high-contrast rhombus the signal-to-noise ratio of the two local measurements (only two are shown for clarity) is high and thus the likelihood for each measurement in velocity space is sharp, tightly concentrated around a straight line, and dominates the prior, which is broader. The resulting posterior is thus mostly dictated by the combination of the likelihoods and favors the veridical direction (because the point where the two likelihood lines intersect has ). (b) With a low-contrast rhombus, the constraint line of the likelihood is fuzzy, that is, the likelihood is broad and the prior exerts greater influence on the posterior, resulting in a posterior that favors an oblique direction. Image adapted from Ref. [15], with permission.
Figure 9.6 Expectations about visual orientations. (a) Example of a natural scene, with strongly oriented locations marked in red. (b) Distribution of orientations in natural scene photographs. (c) Participants viewed arrays of oriented Gabor functions and had to indicate whether the right stimulus was oriented counterclockwise or clockwise relative to the left stimulus. (d). The recovered priors, extracted from participants' performances, here shown for subject S1 and mean subject, are found to match the statistics of orientation in the natural scenes. Reproduced from Ref. [23] with permission.
Chapter 10: From Neuronal Models to Neuronal Dynamics and Image Processing
Figure 10.1 Spikes and postsynaptic potentials. (a) The Figure shows three methods for converting a spike (value one at , that is ) into a postsynaptic potential (PSP). The -function with (Eq. (10.8)) is represented by the gray curve. The result of lowpass filtering the spike once (via Eq. (10.10) with ) is shown by the dashed line: The curve has a sudden rise and a gradual decay. Finally, applying two times the lowpass filter (each with ) to the spike results in the black curve. Thus, a 2-pass lowpass filter can approximate the -function reasonably well. (b) The Figure shows PSPs and output of the model neuron Eq. (10.6) endowed with a spike mechanism. Excitatory (, pale green curve) and inhibitory (, pale red curve) PSPs cause corresponding fluctuations in the membrane potential (black curve). As soon as the membrane potential crosses the threshold (dashed horizontal line), a spike is added to , after which the membrane potential is reset to . The half-wave rectified membrane potential represents the neuron's output. The input to the model neuron were random spikes which were generated according to a Poisson process (with rates and spikes per second, respectively). The random spikes were converted into PSPs via simple lowpass filtering (Eq. (10.10)) with filter memories and , respectively, and weight . The integration method was Crank (not Jack)–Nicolson with step size ms. The rest of the parameters of Eq. (10.6) were , , , and .
Figure 10.2 Test images All images have rows and columns. (a) The upper and the lower grating (“ìnducers”) are separated by a small stripe, which is called the
test stripe
. Although the test stripe has the same luminance throughout, humans perceive a wavelike pattern with opposite brightness than the inducers, that is, where the inducers are white, the test stripe appears darker and vice versa. (b) When the inducer gratings have an opposite phase (i.e., white stands
vis-á-vis
black), then the illusory luminance variation across the test stripe is weak or absent. (c) A real-world image or photograph (“camera”). (d) A luminance staircase, which is used to illustrate afterimages in Figure 10.4(c) and (d).
Figure 10.3 Simulation of grating induction. Does the dynamic retina (Eqs. (10.11) and (10.12)) predict the illusory luminance variation across the test stripe (the small stripe which separates the two gratings in Figure 10.2(a) and (b))? (a) Here, the image of Figure 10.2(a) was assigned to . The plot shows the temporal evolution of the horizontal line centered at the test stripe, that is, it shows all columns of for the fixed row number at different instances in time . Time increases toward the background. If values (ON-responses) are interpreted as brightness, and values (OFF-responses) as darkness, then the wave pattern adequately predicts the grating induction effect. (b) If the image of Figure 10.2(b) is assigned to (where human observers usually do not perceive grating induction), then the wavelike pattern will have twice the frequency of the inducer gratings, and moreover a strongly reduced amplitude. Thus, the dynamic retina correctly predicts a greatly reduced brightness (and darkness) modulation across the test stripe.
Figure 10.4 Snapshots of the dynamic retina. (a) This is the ON-response (Eq. (10.11) ) after iterations of the dynamic retina (Eqs. (10.11) and (10.12)), where the image of Figure 10.2(c) was assigned to . Darker gray levels indicate higher values of ). (b) Here the corresponding OFF-responses () are shown, where darker gray levels indicate higher OFF-responses. (c) Until , the luminance staircase (Figure 10.2(d)) was assigned to . Then the image was replaced by the image of the camera man. This simulates a retinal saccade. As a consequence, a ghost image of the luminance staircase is visible in both ON- and OFF-responses (approximately until ). From on, the ON- and OFF-responses are indistinguishable from (a) and (b). Here,
brighter
gray levels indicate higher values of . (d) Corresponding OFF-responses to (c). Again,
brighter
gray levels indicate higher values of . All simulations were performed with filter memory constants , , and diffusion coefficient .
Figure 10.5 Texture segregation. Illustration of processing a grayscale image with the texture system. (a) Input image “Lena” with pixels and superimposed numbers. (b) The output of the retina (Eq. (10.13)). ON-activity is white, while OFF is black. (c) The analysis of the retinal image proceeds along four orientation channels. The image shows an intermediate result after summing across the four orientations (texture brightness in white, texture darkness in black). After this, a local WTA-competition suppresses residual features that are not desired, leading to the texture representation. (d) This is the texture representation of the input image and represents the final output of the texture system. As before, texture brightness is white and texture darkness is black.
Figure 10.6 Video sequences showing object approaches. Two video sequences which served as input to Eq. (10.14). (a) Representative frames of a video where a car drives to a still observer. Except for some camera shake, there is no background motion present in this video. The car does actually not collide with the observer. (b) The video frames show a car (representing the observer) driving against a static obstacle. This sequence implies background motion. Here the observer actually collides with the balloon car, which flies through the air after the impact.
Figure 10.7 Simulated LDMD responses. Both Figure show the rectified LGMD activities (gray curves; label “ON-LGMD”) and (black curves; label “OFF-LGMD”) as computed by Eq. (10.17). LGMD activities are one-dimensional signals that vary with time (the abscissa shows the frame number instead of time). (a) Responses to the video shown in Figure 10.6 (a). The observer does not move and no background motion is generated. Both LDMD responses peak before collision would occur. (b) Responses to the video shown in Figure 10.6 (b). Here the observer moves, and the resulting background motion causes spurious LGMD activity with small amplitude before collision. The collision time is indicated by the dashed vertical line. ON-LGMD activity peaks a few frames before collision. OFF-LGMD activity after collision is generated by the balloon car swirling through the air while it is moving away from the observer.
Chapter 11: Computational Models of Visual Attention and Applications
Figure 11.1 The brain as a computer, an unrealistic but convenient hypothesis.
Figure 11.2 Taxonomy of a computational model of visual attention. Courtesy of Borji and Itti [2].
Figure 11.3 (a) Original pictures. (b)–(f) Predicted saliency maps. AIM: attention based on information maximization [14]; AWS: adaptive whitening saliency model [22]; GBVS: graph-based visual saliency [23]; RARE2012: model based on the rarity concept [13].
Figure 11.4 Architecture of Itti
et al.
's model. The input picture is decomposed into three independent channels (colors, intensity, and orientation) representing early visual features. A Gaussian pyramid is computed on each channel. Center–surround differences and across-scale combinations are applied on the pyramid's scales to infer the saliency map.
Figure 11.5 Example of feature maps and saliency map computed by Itti's model. (a) Input image; (b)–(d) represent the color, intensity, and orientation feature maps, respectively; (e) saliency map; (f) and (g) represent the visual scanpath with two and five fixations, respectively.
Figure 11.6 Architecture of Le Meur
et al.
's model. The input picture is decomposed into one achromatic (A) and two chromatic components ( and ). The Fourier transform is then used to encode all of the spatial frequencies present in these three components. Several filters are eventually applied on the magnitude spectrum to get the saliency map.
Figure 11.7 Projection of the input color image into an opponent-color space. (a) Input image; (b) achromatic component; (c) channel (reddish-greenish); (d) channel (bluish-yellowish).
Figure 11.8 Normalization and perceptual subband decomposition in the Fourier domain: (a) anisotropic CSF proposed by Ref. [26]. This CSF is applied on the achromatic component. The inset represents the CSF weights in the Fourier domain (the center of the image represents the lowest radial frequencies, whereas the four corners indicate the highest radial frequencies). (b) The amplitude spectrum of the achromatic component is decomposed into 17 subbands [27].
Figure 11.9 Achromatic subbands of the achromatic component of image illustrated on Figure 11.7. Only the subbands corresponding to the first crown (a) and the second crown (b) are shown.
Figure 11.10 Examples of saliency maps predicted by the proposed model. (a) Original images. (b) Predicted saliency maps.
Figure 11.11 (a) Original pictures; (b) fixation map (a green circle represents the first fixation of observers); (c) saliency map; and (d) heat map. From top to bottom, the memorability score is , , , and , respectively (from a low to high memorability).
Chapter 12: Visual Motion Processing and Human Tracking Behavior
Figure 12.1 Smooth pursuit's account for the dynamic solution of motion ambiguity and motion prediction. (a) A tilted bar translating horizontally in time (left panel) carries both ambiguous 1D motion cues (middle panel), and nonambiguous 2D motion cues (rightmost panel). (b) Example of average horizontal () and vertical () smooth pursuit eye velocity while tracking a vertical (left) or a tilted bar (right) translating horizontally, either to the right (red curves) or to the left (green curves). Velocity curves are aligned on smooth pursuit onset. (c) Schematic description of a trial in the experiment on anticipatory smooth pursuit: after a fixation display, a fixed duration blank precedes the motion onset of a tilted line moving rightward (with probability ) or leftward (with probability ). (d) Example of average horizontal () and vertical () smooth pursuit eye velocity in the anticipation experiment for two predictability conditions, (unpredictable, black curves) and (completely predictable, gray curves).
Figure 12.2 Examples of human smooth pursuit traces (one different participant on each column, a naive one on the left and a non-naive one on the right side) during horizontal motion of a tilted bar which is transiently blanked during steady-state pursuit. (a) and (b): Average horizontal () and vertical () eye velocity. Different blanking conditions are depicted by different colors, as from the Figure legend. The vertical dashed line indicates the blank onset; vertical full colored lines indicate the end of the blanking epoch for each blanking duration represented. (c) and (e) Zoom on the aperture-induced bias of vertical eye velocity at target motion onset, for all blanking conditions. (d) and (f) Zoom on the aperture-induced bias of vertical eye velocity at target reappearance after blank (time is shifted so that 0 corresponds to blank offset), for all blanking conditions.
Figure 12.3 A Bayesian recurrent module for the aperture problem and its dynamic solution. (a) the prior and the two independent 1D (b) and 2D (c) likelihood functions (for a tilted line moving rightward at 5/s) in the velocity space are multiplied to obtain the posterior velocity distribution (d). The inferred image motion is estimated as the velocity corresponding to the posterior maximum (MAP). Probability density functions are color-coded, such that dark red corresponds to the highest probability and dark blue to the lowest one.
Figure 12.4 Two-stages hierarchical Bayesian model for human smooth pursuit in the blanking experiment. The retinal recurrent loop (a) is the same as in Figure 12.3, with the additional inclusion of physiological delays. The posterior from the retinal recurrent loop and prior from the extra-retinal Bayesian network (b) are combined to form the postsensory output (). The maximum a posteriori of the probability () of target velocity in space serves as an input to both the positive feedback system as well as the oculomotor plants (c). The output of the oculomotor plant is subtracted from the target velocity to form the image's retinal velocity (physical feedback loop shown as broken line). During the transient blank when there is no target on the retina, the physical feedback loop is not functional so that the retinal recurrent block does not decode any motion. The output of the positive feedback system (shown by the broken line) is added to the postsensory output () only when the physical feedback loop is functional. The probability distribution of target velocity in space () is provided as an input to the extra-retinal recurrent Bayesian network where it is combined with a prior to obtain a posterior which is used to update the prior.
Figure 12.5 This Figure reports the simulation of smooth pursuit when the target motion is hemi-sinusoidal, as would happen for a pendulum that would be stopped at each half cycle left of the vertical (broken black lines in panel (d). We report the horizontal excursions of oculomotor angle in retinal space (a), (b) and the angular position of the target in an intrinsic frame of reference (visual space), (c), (d). Panel (d) shows the true value of the displacement in visual space (broken black lines) and the action (blue line) which is responsible for oculomotor displacements. Panel (a) shows in retinal space the predicted sensory input (colored lines) and sensory prediction errors (dotted red lines) along with the true values (broken black lines). The latter is effectively the distance of the target from the center of gaze and reports the spatial lag of the target that is being followed (solid red line). One can see clearly the initial displacement of the target that is suppressed after a few hundred milliseconds. The sensory predictions are based upon the conditional expectations of hidden oculomotor (blue line) and target (red line) angular displacements shown in panel (b). The gray regions correspond to 90% Bayesian confidence intervals and the broken lines show the true values of these hidden states. The generative model used here has been equipped with a second hierarchical level that contains hidden states, modeling latent periodic behavior of the (hidden) causes of target motion (states not shown here). The hidden cause of these displacements is shown with its conditional expectation in panel (c). The true cause and action are shown in panel (d). The action (blue line) is responsible for oculomotor displacements and is driven by the proprioceptive prediction errors.
Figure 12.6 A lateral view of the macaque cortex. The neural network corresponding to our hierarchical Bayesian model of smooth pursuit is made of three main corticocortical loops. The first loop between primary visual cortex (V1) and the mediotemporal (MT) area computes image motion and infers the optimal low-level solution for object motion direction and speed. Its main output is the medio-superior temporal (MST) area that acts as a gear between the sensory loop and the object motion computation loop. Retinal and extra-retinal signals are integrated in both MST and FEF areas. Such dynamical integration computes the perceived trajectory of the moving object and implements an online prediction that can be used on the course of a tracking eye movement to compensate for target perturbation such as transient blanking. FEF and MST area signals are sent to the supplementary eye field (SEF) and the interconnected prefrontal cortical areas. This third loop can elaborate a motion memory of the target trajectory and is interfaced with higher cognitive processes such as cue instruction or reinforcement learning. It also implements off-line predictions that can be used across trials, in particular to drive anticipatory responses to highly predictable targets.
Chapter 13: Cortical Networks of Visual Recognition
Figure 13.1 Basic organization of the visual cortex. Arrows illustrate the connections between various specialized subpopulations of neurons. Green and blue arrows indicate feedback and feedforward connections, respectively. The lateral geniculate nucleus (LGN), an extracortical area, is represented in gray.
Figure 13.2 Receptive field organization. Neurons along the visual pathway V1IT have receptive fields which vary in size and complexity.
Figure 13.3 Primary visual cortex organization. (a) Hypercolumn structure showing the orientation and ocular dominance axes (image reproduced from Ref. [40]). (b) Idealized crystal pinwheel organization of hypercolumns in visual area V1 (image taken from Ref. [14]). (c) Brain imaging of visual area V1 of the tree shrew showing the pinwheel organization of orientation columns (image adapted from Ref. [39]).
Figure 13.4 One-dimensional simple cell profiles. The curves shown are the graphs of local operators modeling the sensitivity profile of neurons in the primary visual cortex. Gabor filter and Gaussian derivative are local approximations of each other and are asymptotically equivalent for high-order of differentiation.
Figure 13.5 Two-dimensional simple cell profiles. The Figure illustrates two-dimensional Gabor filters (a) and first-order Gaussian derivatives (b) at various scales and orientations. Colors can be put in relation to Figure 13.3.
Figure 13.6 Complex cells. (a) The simple cell responds selectively to a local spatial derivative (blue arrow). (b) The complex cell responds selectively to the same spatial pattern, but displays invariance to its exact position. Complex cells are believed to gain their invariance by integrating (summing) over simple cells of the same selectivity.
Figure 13.7 Multilayer architecture. The cortical visual pathway is usually modeled by multilayer networks. Synaptic connections for the early stages, such as V1 (first layer), are usually sensitive to local changes in spatial orientations and frequencies. Connections in the latter stages is an active research topic.
Figure 13.8 Sublayers in HMAX. The Figure illustrates cross sections of the orientation hypercolumns in V1. Each cross section selects one simple cell per position, thereby defining a vector field. The first layer of HMAX is composed of sublayers defining translation fields (that is, the same orientation at all positions). By pooling over translation fields at various orientations, the HMAX model displays tolerance to local translations, which results in degrees of invariance to shape distortions. For illustration, the Figure shows the image of the word
Invariance
, with its component translated. By pooling the local maximum values of translation fields mapped onto the image, the HMAX model tolerates local translations, and will produce a representation which is invariant to those local transformations.
Figure 13.9 General HMAX network. The network alternates layers of feature mapping (convolution) and layers of feature pooling (MAX function). The convolution layers generate specific feature information, whereas the pooling layers results in degrees of invariance by relaxing the configuration of these features.
Figure 13.10 Lateral inhibition filter. A surround inhibition and an excitatory center.
Figure 13.11 V4-type cells. On layer 3, the cells represent configurations of complex cells, of various orientations and scales, sampled during training. The ellipses represent the receptive fields of simple cells on the first layer.
Figure 13.12 Multiresolution pooling. The training spatial position and scale of each filter is stored during training. For each new image, concentric pooling regions are centered on this coordinate. The maximum value is pooled from each pooling radius and across scales . This ensures that some spatial and scale information are kept in the final representation.
Figure 13.13 Contour completion. Understanding the principles by which the brain is able to spontaneously generate or complete visual contours sheds light on the importance of top-down and lateral processes in shape recognition (image reproduced from Ref. [66]).
Figure 13.14 A fiber bundle abstraction of V1. Contour elements (dotted line) on the retinal plan () are lifted in the cortical fiber bundle () by making contact with simple cells (Figure inspired from Refs [16, 40]).
Figure 13.15 Family of integral curves in V1. The Figure shows admissible V1 curves projected (in blue) on the retinal plane over one hypercolumn in (image adapted from Ref. [40]).
Figure 13.16 Horizontal connections. The fan of integral curves of the vector fields defined by Eq. (13.10) gives a connection between individual hypercolumns. It connects local tangent vectors at each hypercolumn into global curves (image adapted from Ref. [40]).
Figure 13.17 Association field. Long-range horizontal connections are the basis for the association field in Ref. [65]. The center of the Figure represents a simple cell in V1. The curve displayed represents the visual path defined by the horizontal connections to other simple cells. The possible contours are shown to correspond to the family of integral curves as defined in Section 13.7, where the center cell gives the initial condition.
Figure 13.18 Temporal association. By keeping a temporal trace of recent activation, the trace learning rule enables neurons to correlate their activation with a transformation sequence passing through their receptive fields. This generates a neural response which is stable or invariant to transformation of objects inside their receptive fields.
Chapter 14: Sparse Models for Computer Vision
Figure 14.1 Learning a sparse code using sparse Hebbian learning. (a) We show the results at convergence (20,000 learning steps) of a sparse model with unsupervised learning algorithm which progressively optimizes the relative number of active (nonzero) coefficients ( pseudo-norm) [29]. Filters of the same size as the image patches are presented in a matrix (separated by a black border). Note that their position in the matrix is arbitrary as in ICA. These results show that sparseness induces the emergence of edge-like receptive fields similar to those observed in the primary visual area of primates. (b) We show the probability distribution function of sparse coefficients obtained by our method compared to Ref. [21] with first random dictionaries (respectively, “ssc-init” and “cg-init”) and second with the dictionaries obtained after convergence of respective learning schemes (respectively, “ssc” and “cg”). At convergence, sparse coefficients are more sparsely distributed than initially, with more kurtotic probability distribution functions for “ssc” in both cases, as can be seen in the “longer tails” of the distribution. (c) We evaluate the coding efficiency of both methods with or without cooperative homeostasis by plotting the average residual error (L norm) as a function of the pseudo-norm. This provides a measure of the coding efficiency for each dictionary over the set of image patches (error bars represent one standard deviation). The best results are those providing a lower error for a given sparsity (better compression) or a lower sparseness for the same error.
Figure 14.2 The log-Gabor pyramid (a) A set of log-Gabor filters showing in different rows different orientations and in different columns, different phases. Here we have only shown one scale. Note the similarity with Gabor filters. (b) Using this set of filters, one can define a linear representation that is rotation-, scaling- and translation-invariant. Here we show a tiling of the different scales according to a golden pyramid [43]. The hue gives the orientation while the value gives the absolute value (white denotes a low coefficient). Note the redundancy of the linear representation, for instance, at different scales.
Figure 14.3 SparseLets framework. (a) An example reconstructed image with the list of extracted edges overlaid. As in Ref. [68], edges outside the circle are discarded to avoid artifacts. Parameters for each edge are the position, orientation, scale (length of bar), and scalar amplitude (transparency) with the phase (hue). We controlled the quality of the reconstruction from the edge information such that the residual energy is less than over the whole set of images, a criterion met on average when identifying edges per image for images of size (i.e., a relative sparseness of of activated coefficients). (b) Efficiency for different image sizes as measured by the decrease of the residual's energy as a function of the coding cost (relative pseudo-norm). (b, inset) This shows that as the size of images increases, sparseness increases, validating quantitatively our intuition on the sparse positioning of objects in natural images. Note, that the improvement is not significant for a size superior to . The SparseLets framework thus shows that sparse models can be extended to full-scale natural images, and that increasing the size improves sparse models by a degree of order (compare a size of with that of ).
Figure 14.4 Effect of filters' parameters on the efficiency of the SparseLets framework. As we tested different parameters for the filters, we measured the gain in efficiency for the algorithm as the ratio of the code length to achieve 85% of energy extraction relative to that for the default parameters (white bar). The average is computed on the same database of natural images and error bars denote the standard deviation of gain over the database. First, we studied the effect of the bandwidth of filters respectively in the (a) spatial frequency and (b) orientation spaces. The minimum is reached for the default parameters: this shows that default parameters provide an optimal compromise between the precision of filters in the frequency and position domains for this database. We may also compare pyramids with different number of filters. Indeed from Eq. (14.4), efficiency (in bits) is equal to the number of selected filters times the coding cost for the address of each edge in the pyramid. We plot here the average gain in efficiency which shows an optimal compromise respectively (c) the number of orientations and (d) the number of spatial frequencies (scales). Note first that with more than 12 directions, the gain remains stable. Note also that a dyadic scale ratio (that is, of 2) is efficient but that other solutions—such as using the golden section —prove to be significantly more efficient, although the average gain is relatively small (inferior to 5%).
Figure 14.5 Histogram equalization. From the edges extracted in the images from the natural scenes database, we computed sequentially (clockwise, from the bottom left): (a) the histogram and (b) cumulative histogram of edge orientations. This shows that as was reported previously (see, for instance, Ref. [74]), cardinals axis are overrepresented. This represents a relative inefficiency as the representation in the SparseLets framework represents a priori orientations in a uniform manner. A neuromorphic solution is to use histogram equalization, as was first shown in the fly's compound eye by Laughlin [76]. (c) We draw a uniform set of scores on the