123,99 €
Visual attention is a relatively new area of study combining a number of disciplines: artificial neural networks, artificial intelligence, vision science and psychology. The aim is to build computational models similar to human vision in order to solve tough problems for many potential applications including object recognition, unmanned vehicle navigation, and image and video coding and processing. In this book, the authors provide an up to date and highly applied introduction to the topic of visual attention, aiding researchers in creating powerful computer vision systems. Areas covered include the significance of vision research, psychology and computer vision, existing computational visual attention models, and the authors' contributions on visual attention models, and applications in various image and video processing tasks.
This book is geared for graduates students and researchers in neural networks, image processing, machine learning, computer vision, and other areas of biologically inspired model building and applications. The book can also be used by practicing engineers looking for techniques involving the application of image coding, video processing, machine vision and brain-like robots to real-world systems. Other students and researchers with interdisciplinary interests will also find this book appealing.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 855
Veröffentlichungsjahr: 2013
Contents
Cover
Title Page
Copyright
Preface
Part I: Basic Concepts and Theory
Chapter 1: Introduction to Visual Attention
1.1 The Concept of Visual Attention
1.2 Types of Selective Visual Attention
1.3 Change Blindness and Inhibition of Return
1.4 Visual Attention Model Development
1.5 Scope of This Book
References
Chapter 2: Background of Visual Attention – Theory and Experiments
2.1 Human Visual System (HVS)
2.2 Feature Integration Theory (FIT) of Visual Attention
2.3 Guided Search Theory
2.4 Binding Theory Based on Oscillatory Synchrony
2.5 Competition, Normalization and Whitening
2.6 Statistical Signal Processing
References
Part II: Computational Attention Models
Chapter 3: Computational Models in the Spatial Domain
3.1 Baseline Saliency Model for Images
3.2 Modelling for Videos
3.3 Variations and More Details of BS Model
3.4 Graph-based Visual Saliency
3.5 Attention Modelling Based on Information Maximizing
3.6 Discriminant Saliency Based on Centre–Surround
3.7 Saliency Using More Comprehensive Statistics
3.8 Saliency Based on Bayesian Surprise
3.9 Summary
References
Chapter 4: Fast Bottom-up Computational Models in the Spectral Domain
4.1 Frequency Spectrum of Images
4.2 Spectral Residual Approach
4.3 Phase Fourier Transform Approach
4.4 Phase Spectrum of the Quaternion Fourier Transform Approach
4.5 Pulsed Discrete Cosine Transform Approach
4.6 Divisive Normalization Model in the Frequency Domain
4.7 Amplitude Spectrum of Quaternion Fourier Transform (AQFT) Approach
4.8 Modelling from a Bit-stream
4.9 Further Discussions of Frequency Domain Approach
References
Chapter 5: Computational Models for Top-down Visual Attention
5.1 Attention of Population-based Inference
5.2 Hierarchical Object Search with Top-down Instructions
5.3 Computational Model under Top-down Influence
5.4 Attention with Memory of Learning and Amnesic Function
5.5 Top-down Computation in the Visual Attention System: VOCUS
5.6 Hybrid Model of Bottom-up Saliency with Top-down Attention Process
5.7 Top-down Modelling in the Bayesian Framework
5.8 Summary
References
Chapter 6: Validation and Evaluation for Visual Attention Models
6.1 Simple Man-made Visual Patterns
6.2 Human-labelled Images
6.3 Eye-tracking Data
6.4 Quantitative Evaluation
6.5 Quantifying the Performance of a Saliency Model to Human Eye Movement in Static and Dynamic Scenes
6.6 Spearman's Rank Order Correlation with Visual Conspicuity
References
Part III: Applications of Attention Selection Models
Chapter 7: Applications in Computer Vision, Image Retrieval and Robotics
7.1 Object Detection and Recognition in Computer Vision
7.2 Attention Based Object Detection and Recognition in a Natural Scene
7.3 Object Detection and Recognition in Satellite Imagery
7.4 Image Retrieval via Visual Attention
7.5 Applications of Visual Attention in Robots
7.6 Summary
References
Chapter 8: Application of Attention Models in Image Processing
8.1 Attention-modulated Just Noticeable Difference
8.2 Use of Visual Attention in Quality Assessment
8.3 Applications in Image/Video Coding
8.4 Visual Attention for Image Retargeting
8.5 Application in Compressive Sampling
8.6 Summary
References
Part IV: Summary
Chapter 9: Summary, Further Discussions and Conclusions
9.1 Summary
9.2 Further Discussions
9.3 Conclusions
References
Index
This edition first published 2013
© 2013 John Wiley & Sons Singapore Pte. Ltd.
Registered office
John Wiley & Sons Singapore Pte. Ltd., 1 Fusionopolis Walk, #07-01 Solaris South Tower, Singapore 138628
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as expressly permitted by law, without either the prior written permission of the Publisher, or authorization through payment of the appropriate photocopy fee to the Copyright Clearance Center. Requests for permission should be addressed to the Publisher, John Wiley & Sons Singapore Pte. Ltd., 1 Fusionopolis Walk, #07-01 Solaris South Tower, Singapore 138628, tel: 65-66438000, fax: 65-66438008, email: [email protected].
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book's use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software.
Library of Congress Cataloging-in-Publication Data
Zhang, Liming, 1943-
Selective visual attention : computational models and applications / Liming Zhang, Weisi Lin.
pages cm
Includes bibliographical references and index.
ISBN 978-0-470-82812-0 (cloth)
1. Computer vision. 2. Selectivity (Psychology)–Computer simulation. I. Lin, Weisi. II. Title.
TA1634.Z45 2013
006.3'7–dc23
2012042377
ISBN: 978-0-470-82812-0
Preface
Humans perceive the outside world with the information obtained from five sensing organs (ears, eyes, nose, tongue and skin), and human behaviour results from the information processed in the brain. The human brain is the product of evolution in the long process of natural selection and survival, in the course of which the brain, through interaction with the external world and with other species, has evolved into a comprehensive information processing system. The human brain is the most complex and ingenious system that we know of, and there are no artificial systems that can compare with it in terms of information processing. So the study for the human brain is extremely challenging for human beings. Of all the information processing subsystems in the brain, the visual processing system plays the most important role because more than 70% of outside information comes from the visual sense. Thus, the human visual system (HVS) has been more researched biologically than any other information processing system in the brain, and this has resulted in an independent branch of research. However, before the middle of the twentieth century, most of the research on the HVS was based on qualitative observations and experiments, rather than theoretical or quantitative studies.
On the other hand, researchers in physics, mathematics and information science have long hoped to build a machine to simulate the functions of complex visual processing that the brain has. They are interested in the achievements of brain study on the biological side of the HVS. They have tried to understand how information processing works in the HVS, in order to create a brain-like system for engineering applications. The development of artificial intelligence and artificial neural networks for visual pattern recognition is a typical example of simulating brain functions. Researchers with physics and mathematics background promote the qualitative biological studies into quantitative or theoretical ones. Computational neuroscience, biophysics and biomathematics have been developed to simulate the brain function at the neuronal level and to describe the brain's function by using mathematical equations; this aims at building computational models to fit the recorded data of brain cells. One influential work on visual computational theory was the book, Vision, published by Marr in the 1980s, using mathematics and physics for visual processing. Although some of the contents and points of view in that book seem not to be correct now, its influence in both biological and engineering areas continues to this day. Since then, a good number of models of quantitative visual computing have been suggested and developed.
Selective visual attention is a common human or animal behaviour while objects are being observed in the visual field, and this has long attracted much research in physiology, psychology, neuroscience, biophysics, biomathematics, information science and computer science. Biologists have explored the mechanism of visual attention by observing and analysing the data of experiments, such as which part of the brain works for visual attention, how to connect the different visual areas when visual attention happens, and so on. Computational neuroscientists have built some computational models to simulate the structure and processing of the HVS that can fit the experimental data of psychology and physiology. These computational models can validate the mechanism of visual attention qualitatively. Also, engineers and information scientist have explored the computational ability to simulate human vision and tackled the tough issues of computer vision and image processing. These experts have also contributed by building computational models that incorporate engineering theories and methodologies. In other words, some mechanisms that are unclear to those studying the brain have been replaced by information processing methods through the use of engineering models. These applications may in turn inspire and help biologists to explore and understand the functions of the brain further.
As mentioned above, visual attention of the HVS is related to multiple disciplines, so relying on a single discipline is difficult in related research. What is more, research on visual attention covers large spans from pure biology based on observations and qualitative experiments, through building theoretical models and formulation of quantitative methods, to the practical models combining other methods for more immediate engineering applications. Thus, visual attention modelling is an interdisciplinary field that needs cooperation from experts working in different areas. Obviously, this is not easy since there are large knowledge gaps among the different disciplines. For example, a biologist cannot express a problem in the nomenclature used by an expert in information science, and vice versa. Furthermore, the investigation strategy and background of different disciplines are different, which makes it difficult to interpret the findings of different disciplines and for the disciplines to be complementary. More importantly, knowledge in some disciplines (such as biology, physiology and psychology) often works with a single stimulus (or a few stimuli) while a practical model usually needs to deal simultaneously with a huge number of stimuli.
This book is mainly targeted at researchers, engineers and students of physics, mathematics, information science and computer science who are interested in visual attention and the HVS. It is not easy for colleagues in these disciplines to learn the biological nomenclature, research strategies and implications of findings when reading the books and papers scattered in the literature for the relevant parts of biology and psychology. The purpose of this book therefore is to provide a communication bridge.
The development of visual attention studies has had three phases: biological studies, computational models and then their applications. This book therefore follows these three phases as its three major parts, followed by a summary chapter.
Part I includes two chapters that give the fundamental concepts, experimental facts and theories of biology and psychology, as well as some principles of information science.
To be more specific, the first two chapters of this book introduce the related background including the biological concepts of visual attention, the experimental results of physiology and psychology, the anatomical structure of the HVS and then some important theories in visual attention. In addition, the relevant theories of statistical signal processing are briefly presented.
In Part II, some typical visual attention computational models, related to the concepts and theories presented in Part I, are introduced in Chapters 03, 04, 05. There have been a large number of computational models built in the past few decades. There are two extreme categories of models: (1) purely biological models, which simulate the structure of the anatomy and fit the recorded data on cells at the neuronal level; (2) pure computer vision models, which are not based on psychological experiments and do not follow biological rules. Biological models are too complex to be used in applications and, more crucially, they do not capture higher level perception well (obviously perception is not only about cells at the neuronal level), so they cannot tackle practical problems effectively. On the other hand, there is a lack biological or psychological grounding in pure computer vision models – though this is not our main emphasis here – as we have already realised that visual attention is closely related to biology and psychology. Therefore, the two extreme categories of models will not be considered as the core in this book; instead, we mainly concern ourselves with the computational models that have a related biological basis and are effective for applications. Chapters 03 and 04 present bottom-up computational models in the spatial and frequency domains, respectively, and Chapter 5 introduces top-down computational models. Chapter 6 presents databases and methods for benchmark testing different computational models. The performance estimation and benchmarking of computational models discussed will provide the means for testing new models, comparing different models or selecting appropriate models in practice.
In this book several typical saliency-map computational models on both bottom-up and top-down processing are presented. Each model has its biological basis, or the computational results are coincident (at least partly) with biological facts. Bottom-up models in the frequency domain are presented in more detail as a separate chapter (Chapter 4) since they usually have higher computing speed and more easily meet the real-time requirements of engineering applications.
Chapters 07 and 08, in Part III, demonstrate several application examples of two important aspects: computer vision and image processing. Overall, this book provides many case studies on how to solve various problems based on both scientific principles and practical requirements.
The summary in Chapter 9 provides the connection between chapters and sections, several controversial issues in visual attention, suggestions for possible future work and some final overall conclusions.
Readers who are interested in visual attention, the HVS and the building of new computational models should read the Parts I and II, in order to learn how to build the relevant computational models corresponding to biological/psychological facts and how to test and compare one model with others. We suggest that readers who want to use computational visual attention models in their applications should read Parts II and III, since several different types of computational models – with some computer code as references – can be found in Part II, while the way to apply visual attention models in different projects is explained in Part III. Readers who hope to do further research on visual attention and its modelling might also read Chapter 9, where some controversial issues in both biology and information science are discussed for further exploration. Of course, readers can select particular chapters or sections for more careful reading according to their requirements, and we suggest that they read the summary in Chapter 9 and especially look at Figures 9.1 and 9.2 for an overview of all the techniques presented in this book.
Finally, we wish to express our gratitude to many people who, in one way or another, have helped with the process of writing this book. Firstly, we are grateful to the many visual attention researchers in different disciplines, because their original contributions form the foundation of visual attention and its modelling, and therefore make this book possible. We are grateful to John K. Tsotsos, Minhoo Lee, Delian Wang, Nevrez Imamoglu and Manish Narwaria who provided suggestions or checked some sections or chapters of the book. We appreciate the help of Anmin Liu and Yuming Fang for inclusion of their research work and for proofreading of the related chapters. Anmin Liu also assisted us by obtaining permission to use some figures in the book from the original authors or publishers. We would like to thank the students and staff members of the research groups in the school of information science and technology, Fudan University, China, and the School of Computer Engineering, Nanyang Technological University, Singapore, for their research work in modelling and applications of visual attention and drawing some figures in this book. The related research has been supported by National Science Foundation of China (Grant 61071134) and Singapore Ministry of Education Academic Research Fund (AcRF) Tier 2 (Grant T208B1218).
We are particularly grateful to the Editors, James Murphy who helped to initiate this project, and Clarissa Lim and Shelley Chow who looked through our manuscript and provided useful comments and feedback; Shelley Chow was always patient and supportive, notwithstanding our underestimates of the time and effort required to complete this book.
Liming ZhangWeisi Lin
Part I
Basic Concepts and Theory
1
Introduction to Visual Attention
In about 300 bc, Aristotle, a famous philosopher and scientist, mentioned the concept of visual selective attention and stated that humans cannot perceive two objects by a sensory act simultaneously [1]. Although people generally believe that they can get much information from their rich and colourful world and become conscious of environmental changes, many experiments have shown that the human visual ability is overvalued. When people as observers look at a scene, they have the feeling of being able to see all the details. However, when a blank field is inserted between two successive nature scenes with some differences, most observers fail to spot the changes. The reason for this phenomenon is that observers' eyes only can focus their attention on a small area of the visual field at a given moment. Consequently, only this small area can be observed in detail. The saccade over the surroundings often means that our eyes are located in a few places longer than others, or more often than others. The eyes will jump from one fixated locations to another in a scene by saccade. Some animals like quadrumanes also have this ability of selective visual attention. The areas that eyes of humans and quadrumanes often gaze at are referred to as fixated regions and the ignored regions as non-fixated regions.
What is selective attention? A definition given by Corbetta [2] is ‘The mental ability to select stimuli, responses, memory or thought that are behaviourally relevant among the many others that behavioural irrelevant’. Simply, selective visual attention is one of many properties of human and animal vision allowing them to extract important information from abundant visual inputs.
The phenomena of selective visual attention exist everywhere. We demonstrate some intuitive real life examples shown in Figure 1.1 (the black and white versions of some colour images): imagine that you visit an extraordinary place for the first time such as Dream World in Gold Coast, Australia, or the Louvre in Paris, France. In the Dream World shown as Figure 1.1(a), your eyes will first of all involuntarily gaze at a few persons wearing fancy dress and acting as rabbits with long ears, and then they will shift to other dramatis personae of fairy tales near the ‘rabbits’ and continue to other dancing girls on the street, as marked with white circles in Figure 1.1(a). In the Louvre, you will pay attention to the exquisite sculptures moving from one to another as you pass each showroom (Figure 1.1(b)), but you do not need to visit these special sites to experience selective visual attention, because selective visual attention is a concomitant of daily life. If a black spider crawls on a white ceiling just above your bed while you are lying down, you will notice it right away. You can firstly pay attention to red flowers among green leaves and grass (Figure 1.1(c) is a black and white version), or you can immediately stop walking while a car sweeps in front of you rapidly and so on, since outstanding colour targets (such as red flowers) and moving objects (such as a car) attract your attention. When you are enjoying captivating scenery or an artwork, you do not notice your friend or other things in the area around you.
Figure 1.1 Examples of selective visual attention
Fixation regions also depend on subjective consciousness. Under the cue or guidance, selective visual attention becomes more intentional; for example, the intention of identifying your old classmate at the airport from the passenger crowd drives your eyes to shift only to the faces of passengers and to search for a familiar face in your memory, regardless of other factors such as colourful dresses or fancy hairstyles which may draw more attention in free cases (i.e., without guidance of tasks).
From above examples we can summarize the following facts.
Firstly, let us consider the case without the cues or guidance of prior knowledge, such as a baby with normal vision that looks at the natural world. Early research [3–5] showed that features such as intensity, colour, edge orientation and motion play important roles in visual selective behaviour. The locations with high intensity contrast, fresh colour, objects edges and motion always lead to more attention than other locations in scenes. It is easy to observe this in infants. When a baby opens its eyes to see the world, this is a case where there are no cues or guidance and no prior knowledge in the brain; the light near the cradle, the swing bauble or fancy toy hanging above the baby's head makes their eyes peer at one of the targets. When we change the position of these targets, the baby's eyes can shift accordingly. That means that basic visual features will decide eye fixation.
The first ground for what area in a scene can attract the human gaze is from physiology and anatomy, the primary visual processing in our early visual area of the brain is composed of the retina, lateral geniculation nucleus (LGN) and visual cortex of the V1 area. A simple cell in the primary visual cortex only responds to stimuli in a restricted region called the receptive field (RF) of the visual field. There are many kinds of simple cells including orientation tuning cells, chromatic antagonism cells with red-green or yellow-blue, motion direction detecting cells and so on in our primary visual cortex, which extracts various features in the RF and discards useless information. So only significant features of objects such as colour, edges and motion in the input scene can be extracted and submitted to be further processed in the high-level brain. The research about this issue has been published in the relevant biological or physiological literature [6–8]. In Chapter 2 we will explain more about the visual pathways in physiology and anatomy in great detail.
Another ground for visual selective behaviour is from information theory [9–11]. Smooth and well-regulated areas are frequently neglected by our eyes, and the positions with maximum information or greater novelty are observed first. A very familiar environment (scene or image) that you stay in every day such as your home or office does not interest you since it is an old repeated surrounding or is an easily predicted environment in the scene. Someday if a bunch of fresh flowers appears on the desk in your office, then the flowers, representing a novel change, can attract your attention. Therefore, the locations with novelty in an image or a video are the eye-fixation areas because the novelty makes information of the location maximum, or causes surprise. Some statistical criteria that measure information or novelty have been proposed [9–11] to distinguish the fixation and non-fixation regions, such as high variance, high self-information or entropy, large distance between the posterior and prior probability density distributions, distinctive higher-order statistics and so on.
The contrast between centre and surrounding at a location also influences the attention focus [12, 13]. In the visual field, the prominent part or saliency area will first be interested. White sheep on a tract of meadow, or a black spider on a white background, are examples in which the target (sheep or spider) is different from its surroundings, so they stand out. If the sheep stays on a white background or a black spider crawls across a black background, the target (sheep or spider) will not be obvious because of the context. Contrast between centre against surrounding and the statistics for both centre and surrounding have been proposed as a measure for attention [12–14].
For the cases with task-orientated cues or guidance, the attention areas depend not only on features, information, context and so on in the observed region as mentioned above, but also on the subject's intention. In such cases, fixation regions will be different from those with no cues and guidance. In addition, prior knowledge also affects attention. An artist pays more attention to artwork, while a gardener viewing the same scene mainly focuses his attention on strange flowers because their respective fields of interest and background knowledge are different.
During the recent decades there have been substantial exploration and many hypotheses and criteria that attempt to predict what areas in a scene may be a human's eyes' fixated regions. However, up to now it is still an open issue because of the complexity of the biological visual system and the diversity of human knowledge and intentions.
Every day our visual system receives a huge amount of information from the surrounding world. An estimated data of the order of tens of megabytes is falling on the eyes' retinas per second [15, 16]. However, the retina picks up the information that is not evenly distributed. The centre of retina is called the fovea, and it has higher resolution of perception than places that are far from it. In general, people move their eyes to the interested region in a scene just to ensure that the prominent object in the scene is projected onto the fovea for examining it in detail. Those objects projected on areas other than the fovea are perceived with lower resolution and largely ignored for processing. In fact, 50% of the primary visual cortex is devoted processing the inputs from the centre (fovea) of the visual field [17, 18].
Also, the data processing capacity of the visual pathways in the brain is estimated to be only 40 bits per second [19, 20]. Input data with an order of tens megabytes are reduced through the retina fovea, feature extraction of the LGN and the primary cortex V1 in the low-level cortex, and then pass the cortex V2–V4 and V5 (middle temporal or MT) to the high-level cortex. Only very few data (i.e., very little target information) per second can reach the memory and be processed in the high-level cortex. Reducing information redundancy occurs not only in parallel feature extraction of a scene in the primary visual cortex, but also in serial target cognition [21] in all the visual pathways including the high-level cortex. Hence, large amounts of input data are effectively decreased. In ancient times, during the age of Aristotle, it was found that the high-level cortex cannot simultaneously recognize more than one target located in different positions when a scene or an image is viewed; that is, the limited resource in our brain restricts the information processing. The eyes have to be shifted from one prominent target to another according to the attention selection order. Even if only a single star's portrait exists in scene, the areas of the eyes and mouth of the portrait will be fixated more times or with longer interval. A female portrait and the track of an observer's eye saccades for the portrait in a cue-free case are shown in Figure 1.2. The parts of the eyes and mouth that include complex structure are frequently scanned, and the cheek areas without significant information are not scanned.
Figure 1.2 The track of eyes saccade when observing the face of a lady with no instruction for 1 minute from Yarbus (1967 Figure 115 [24]). With kind permission from Springer Science + Business Media: Eyes movements and vision, © 1967, Yarbus
Selective visual attention can solve the bottleneck of limited resource in the human visual system (HVS) [20, 22, 23]. Only a selected subset of visual inputs are allowed to reach high-level cortical processing. So, strictly speaking, selective visual attention is an ability that allocates processing resource in the brain to focus on important information of a scene. Just owing to the selective visual attention, people can effectively deal with a large number of images as their visual inputs without encountering information overflow while systematically handling many tasks. Selective visual attention plays an important role in biological signal processing. In the literature (as well as this book), the term ‘visual attention’, or ‘selective attention’, is used sometimes to refer to selective visual attention.
The studies of visual attention from physiology and psychology have developed for several decades. Biologists have been trying to understand the mechanism of processing perceptive signals in visual pathways and further understand the brain of humans or quadrumanes. Visual attention helps people to deal with a mass of input data easily, regardless of the input data being tens of megabytes per second, while in computer vision or robot vision, enormous input images often result in memory overflow. Hence, recently, many scientists and engineers who work in computer science, artificial intelligence, image processing and so on have been engaged in visual attention research that aims to construct computational models to simulate selective visual attention for engineering applications. Although the principle of selective attention has not been very clear biologically, and there are a lot of open issues that need to be explored, these computational models have found good applications in many engineering tasks.
In Section 1.1 we showed that visual attention is an ability of humans and quadrumanes, and exists universally. Over the past several decades, many researchers, especially physiologists, psychologists and computational neuroscientists, have tried to understand the mechanisms of visual attention. Different types of selective visual attention has been explored and described in the literature from different viewpoints and with different emphases, such as pre-attention and attention [25, 26], bottom-up attention and top-down attention, voluntary and passive attention, parallel and serial processing of attention in the brain, overt and covert attention and so on. Although these types are related, overlapped or similar, it is beneficial to discuss them for the purpose of understanding of visual attention studies and the related theory, since in fact, the different types (to be introduced in the rest of this section) reflect different aspects of selective visual attention and are often complementary to each other.
From the signal processing point of view, visual attention is divided into two stages: pre-attention and attention stages, as proposed by Neisser and Hoffman [25, 26]. The pre-attention stage provides the necessary information for attention processing. For instance, a single feature such as orientation, colour or motion must be detected before the stimulus can be selected for further processing. In this stage, both features on the background and the objects are extracted. However, only the object ones may attract human attention in the attention stage. In the anatomical structure of the visual pathway, we can see that many simple cells of the primary visual cortex can extract these simple features from their receptive fields respectively by applying different filters when an input scene appears. The pre-attentive processing is supported by local processing and is independent of attention. It is an automatic process with very high speed and is involuntary. It works in parallel for multiple features in the visual field.
The attention stage occurs after the pre-attention one. The region with important information in the input scene is fixated longer and is observed in detail. In the attention stage, only one target is processed at a time. The stage may need the integration of many features and sometimes needs guidance from human experience, intention and knowledge. In most cases, pre-attention gives all the salient information in the visual field as a result of parallel processing, and in the attention stage, a selected object is observed firstly. There is a special case with the same focus in both the pre-attention and attention stages. If a target in the pre-attention stage can be discriminated, for example, a spotlighted target in a dark room can attract attention rapidly, and then the target always can be dealt with first, in the attention stage.
In summary, pre-attention is an operation based on a single feature such as colour, orientation, motion, curvature, size, depth cues, lustre or aspects of shape. In the pre-attentive stage, there is no capacity limitation; that is, all the information is processed across the entire visual field. Once the field of view has been analyzed and features are processed, the attention is focused. Features are only analyzed but not integrated in the pre-attentive stage.
Attention is an operation of feature integration. In the attentive stage, features may be bound together or the dominant feature may be selected. A target with several features can be focused.
Pre-attention and attention are also called vision before attention and vision with attention, respectively in [5, 27]. Another stage, proposed in [28], is vision after attention or the post-attentive stage. In the post-attentive stage, a subject performs further searches among objects of the same group. Search efficiency in this stage will improve because the HVS has already attended to the presented objects and is now familiar with them.
Many experimental results favour a two-component framework for the control of attentive deployment [27–30]. This framework suggests that the subject's attention for an input scene arises from both stimuli-driven factors referred as bottom-up attention and task-driven factors referred as top-down attention.
The bottom-up attention is based on salient features of the input image such as orientation, colour, intensity and motion. Bottom-up attention in the pre-attention stage (as introduced in the previous section) is the outcome of the simple feature extraction across the whole visual field and the inhibition of the centre neuron versus surrounding neurons. Therefore, a highly salient region of input stimuli can capture the focus of human attention. For example, flashing points of light on a dark night, sudden motion of objects in a static environment, and red followers on a green background (its luminance version was shown in Figure 1.1(c)) can involuntarily and automatically attract attention. Bottom-up attention is derived from the conspicuousness of areas in the input visual field, influenced by an exogenous factor and regardless of any tasks and intentions. Therefore, bottom-up attention is sometimes called stimuli-driven attention. It is believed that stimuli-driven attention is probably controlled by early visual areas in the brain. Due to the cells in the early visual areas operating in parallel for input data, the response time of bottom-up attention is very fast, on the order of 25–50 ms per item, excluding eye shift time [23].
Top-down attention refers to the set of processes used to bias visual perception based on task or intention. This mechanism is driven by the mental state of the observer or cues they have received. In the famous top-down attention experiment proposed by Yarbus [24], observers were asked several questions about the scene of a family room shown in Figure 1.3(a). The tracking positions of the eye saccades vary in the question- free case and the cases with questions. The attention focus of observers is different between the question-free case and a case with a question, and that is also different if a different question was asked in the same scene. Figure 1.3(b)–(d) show the results of eye movement in the question-free case, in the case with the question about the ages of persons and in the case with a cue of remembering the positions of objects, respectively. The selected regions congregate around the faces of the people when the observers were asked about persons' ages (Figure 1.3(c)). In addition, the focal regions were around the locations of objects or people when they were required to remember the positions of objects. Since the saccade diversity is dependent on tasks or cues, top-down attention is also referred as task-driven attention. Note that the tasks or cues are concerned with the objects in a scene, and the final selective regions of top-down attention are probably related to the observer's prior knowledge, experience and current goal, which are mostly controlled by high-level cortex. Therefore, information from higher area is fed back to influence the attention behaviour.
Figure 1.3 The tracks of saccades and fixations by Yarbus (1967) [24]. Each record lasted 3 minutes for (b)–(d): (a) The scene of a family room (source: courtesy of www.liyarepin.org); (b) the saccade track in the question-free case without any cues; (c) the saccade track with a cue: to answer the ages of family members in the scene; (d) the saccade track with a cue: to remember object and person positions. With kind permission from Springer Science + Business Media: Eyes movements and vision, © 1967, Yarbus
Bottom-up attention only pops out the candidate regions where targets are likely to appear, while top-down attention can depict the exact position of the target. Sometimes top-down attention is not related to bottom-up saliency at all. A tiger in the forest can capture small animals hidden in brushwood rapidly, notwithstanding there being no prominent sign of bottom-up processing in the area where the animals are hidden. Under the experiential guidance from itself and its mother, the tiger can still find its prey. It has become obvious that top-down attention is more powerful in object search and recognition. Nevertheless, both forced and voluntary attention comes at a price. In general, task-driven attention costs more than 200 ms [23] for a young and inexperienced subject. Learning and knowledge accumulation help to reduce the reaction time of top-down attention.
Commonly, both bottom-up and top-down attention mechanisms operate simultaneously. It is difficult to distinguish which attracted region is the effect of bottom-up and which part is influenced by top-down. In most situations, the final results of attentive focus in an image or a scene come from both mechanisms. For instance, freely viewing a scene, different subjects may gaze at different salient regions. Knowledge and experience – and even emotion embedded in subject's higher areas of the brain – will be partly involved in the attention processing. For studying these two kinds of attention respectively, many psychologists and scientists of computational neuroscience and cognitive science have designed various psychophysical patterns for subjects to test the reaction time of searching targets. Some carefully designed image patterns can roughly distinguish between sensory-driven and task-driven processes [4, 5, 29, 30].
Since the structure and principles for early visual regions in the brain have been revealed by physiologists [31–34] and the analysis of input stimuli is easier than for mental states in higher areas of the brain, a large number of computational models simulating bottom-up attention have been developed. A two-dimensional topographical map that represents the conspicuity of input stimulus at every location in the visual scene has been proposed in bottom-up attention models [5, 30, 35]. The resultant map for attention is called the ‘activation map’ in Wolfe's model [30], and the ‘saliency map’ in Koch's model [35]. The level in the saliency map reflects the extent of attention. A location with a higher value attracts attention more easily than a location with a lower value in the saliency map.
Contrarily, only a few computational models of top-down processing have been investigated up to now, and usually the models are based on the knowledge about an object to be found. The other top-down factors such as expectations and emotions are very difficult to control and analyse. Therefore, this book introduces more bottom-up computational models and only a few top-down models [36, 37] in which the aspects on expectations and emotions are not investigated.
The essential points of bottom-up and top-down mechanism are summarized as follows.
It is known that the neurons in our brain are interconnected. Also, they work in massive and collective fashion. Many physiological experiments have revealed that input stimuli projected on the retina of our eyes are processed in parallel. In addition to that, the cells in the primary visual areas work in parallel too, as mentioned in the pre-attention stage. On the other hand, deduced from the phenomenon in Figures 1.2 and 1.3, the focus of our eyes often shifts from one place to another, so the search for eye movement is serial. Since parallel processing is faster than serial, the reaction time of object search can be used to test the processing type. Some psychological patterns were proposed to test which is parallel and which is serial according to the reaction time of observers seeing these patterns [4, 30]. Figure 1.4 shows a simple example. The unique object of all the patterns in Figures 1.4(a)–(d) is located in the central place to test the reaction time of the observers. In the early 1980s, Treisman suggested that the search for a target with simple features that stands out or pops up relative to its neighbours (many distractors) should be detected in parallel, since the search was little affected by variations in the number of distractors [4]. Figures 1.4(a) and (b) demonstrate a good example for the simple feature case with Treisman's suggestion. The object (vertical bar) in the midst of horizontal bars (distractors) pops out very easily and quickly with varying numbers distractors and their distribution.
Figure 1.4 Examples of simple psychological patterns to test reaction time of an observer: (a) and (b) are the cases with a single feature involving parallel processing; (c) and (d) are the cases with conjunction of multiple features involving serial processing
In contrast, the search for a target with a combination of more than one feature should be detected after a serial scan of varying numbers of distractors [4]. The examples in Figures 1.4(c) and (d) can explain this situation. The unique object is a combination of two simple features, a cross made up of a horizontal line segment and one tilted at 45°. The surrounding distractors are crosses that all include a different line segment (line segments vertical or tilted at 135°) and the same line segment (horizontal or tilted 45° line segment) as the object. The detection time of search object slows down with increasing the number of distractors; that is, the pattern of Figure 1.4(d) is more difficult to detect than the pattern of (c). That means that single features can be detected in parallel, but combinations of two (or more) different features results in a serial scan. Therefore, in a complex scene eye search is serial by nature.
Figures 1.2 and 1.3 illustrate the fact that the HVS has the ability of information selection within a scene, and the location attended shifts from one to another. After an interesting region is selected by the attention mechanism, its saliency will decrease along with the novelty being weakened because an inhibitive signal from higher areas returns, so the location of the next salient location replaces the current attention focus. This ability of the HVS is called attention shift [35, 38]. The shifting attention involves the eye movement to the positions in the visual field as shown in Figures 1.2 and 1.3. Eye movement occurs typically about 3–5 times/per second. Some attention shifts do not depend on eye movement, which is usually observed when the viewers look at an object or attend to an event out of the corner of their eye or by intentionally using the visual periphery. For example, a scatterbrained student sitting in the classroom uses their eyes' periphery to attend to the bird out of window while their eyes still face the blackboard. Such attention shift in the absence of eye movement is frequent. If two objects in a scene need to be attended at the same time, a viewer has to employ attention without eye movement to track the second object, since your eyes cannot fixate two different locations simultaneously.
We call the visual attention associated with eye movement as overt attention, and the attention shift independent of eye movement as covert attention. With covert attention, it is not necessary to move the eyes or the head to concentrate on the interesting regions, so it is quicker than overt attention. Covert attention and overt attention can be studied separately, but in most cases both attentions often work together. As eye fixation is easy to be observed with measurement equipment, most studies of visual attention have been concerned with overt attention so far.
There are two phenomena related to visual attention that people experience every day. One is that a person turns a blind eye to some change in their environment. The phenomenon is called change blindness. The other is that attention focus never fixes a location for a long time, and there exists a mechanism to inhibit the fixation returning back to the original location. In the following sections, we explain the two phenomena.
Change blindness (CB) is defined as the induced failure of observers to detect a change in a visual display [39]. This invisible change often happens in the alternating images with a blank fields (about 80 ms or more). Two nearly identical scenes with a certain background change appear one by one. The changes can be noticed easily since the change location can be popped out at alternate times. However, when a transient blank frame (more than 80 ms) is inserted between the two nearly identical scenes, the change is often not noticed by observers. In another setting, two nearly identical pictures with some changes are displayed side by side as shown in Figure 1.5. Pointing the difference between them at a glance is difficult. Once the large change is detected in Figure 1.5, most people are amazed at failing to notice it. The cause of change blindness is human selective visual attention. Some changes in no-attention regions are ignored, as shown in Figure 1.3, in which many regions of the family picture are not reached by the eye saccade. Figure 1.5 is a similar case since the change occurred in the inattention background. The other cause is that only one feature or object can be focused on by observers for a transient time. If the alternating images that accompany the change are unable to provide location information in a short time, as mentioned above – for example, a full blank field is inserted between two successive frames with little difference – then the change cannot be picked up by the HVS.
Figure 1.5 An example of change blindness, by glancing at the two nearly identical pictures. Reproduced with permission from Christopher G. Healey, ‘Perception in Visualization,’ North Carolina State University, http://www.csc.ncsu.edu/faculty/healey/PP/index.html (accessed October 1, 2012)
How can change blindness be so easily induced? The main reason is that the focused attention only operates on one item at a time. In the real world, there are a lot of items which can attract observers. If the change between two images is not within the focus of attention, the change information is often swamped. In general, for a complex scene, eye scanning usually needs to take considerable time by serial processing to find the change.
Attention shift (both overt and covert) is a phenomenon of the visual system as mentioned in Section 1.2.4. After people view a scene, the saliency can be inhibited for the currently selected location. So the fixated location will move to its peripheral location. Why can the attention focus leave the most salient location in the visual field and not come back to the location immediately? Weakening of the information novelty over a long period of staring is a reason for the situation. Is there another physiological cause for the phenomenon? The first physiological experiment was described in 1984 by Posner and Cohen [40]. They had measured an inhibitory aftereffect at the original location in delayed response to subsequent stimuli [41]. The phenomenon was later called inhibition of return (IoR). Some studies reported that the inhibitory signal may come from the superior colliculus in the brain [41, 42]. Whatever the principle of IoR is, the effect of discouraging attention from returning back to the original attended location is very useful for object search in visual fields. Most computational models proposed later have used the IoR mechanism.
The history of research and development toward selective visual attention can be divided into three phases. The first phase began at the time of William James about a century ago, and we refer to it as the biological study phase. Many neurophysiologists and psychologists have discovered valuable truths and developed theories of visual attention in this first phase.
Although, during the ancient era (300 bc), Aristotle found the attention phenomenon for humans or animals, the real research work of visual attention was introduced in James's book, The Principles of Psychology [21]. The concepts of pre-attention and attention, two-stage models, competition and normalization in the neuronal system, and the feature integration theory of attention and so on were proposed and discussed in the first phase [4, 5, 25, 26] with many psychological and physiological experiments. The theories and methodologies devised in this phase become the basis of building computational attention models later on.
The next phase started in 1980s. A two-dimensional map (i.e., saliency map) which encodes visual conspicuity stimulus was put forward by Koch and Ullman in 1985 [35]. Various computational models to automatically generate the saliency map, including spatial domain and frequency domain approaches, had been suggested to simulate bottom-up or top-down visual attention over the past 30 years [9–13, 43–45]. The phenomenon of visual attention in physiological and psychological experiments were simulated in these computational models. Many scientists and engineers in the areas of computer vision, artificial intelligence and computer engineering participated in the studies in the second phase, so the methods of performance measurement and comparison among different models have appeared.
The third phase began in the end of 1990s after many computational models had been built. Many applications of visual attention on object detection, image and video coding, image segmentation, quality assessment of image and video and so on have been proposed. Now it has been known that visual attention plays a central role not only in the studies of biological perception but also in the computer vision and other engineering areas. It should be noted that the three phases started from different times in the past, but they are concurrent at this moment because the work is still on-going for all aspects.
The studies in the first phase of visual attention were based on the relevant evidences from psychology and physiology whose contributions were alternate. As mentioned above, this phase started with the book The Principles of Psychology [21] in 1890 by W. James, who was the first to publish a number of facts related to brain functions and activities. Visual attention was discussed as a chapter in this book. The two-component attention and covert attention without moving eyes were mentioned in his book, although they were not named and defined at that time.
Over half a century later, in the 1960s, physiologists Hubel and Wiesel recorded the activities of a single cell in the primary visual cortex of cats, and reported that some cells responded preferentially to input stimuli with particular spatial orientations in their receptive fields [6, 31]. Then, many electrophysiological experiments showed that some basic neurons in the early visual cortex can respond to other features in their receptive fields apart from orientations, such as colour contrast, motion direction, spatial frequency and so on [7, 32, 33]. These physiological evidences suggest that the visual scene is analysed and selected in the early visual cortex, and then these features are mapped onto different regions of the brain [34]. In the same decade, in 1967, Russian biophysical scientist Yarbus developed a novel set of devices and related method to accurately record eye movement tracks while the observers watched scenes with or without cue guidance [24]. His studies on eye movement have had significant influences on visual attention, especially on overt attention. In the same year, other contributions were from research into psychology; for example, Neisser [25] suggested the two-stage attention: pre-attention and attention. Pre-attention is parallel processing over the whole visual field at one time. Attention is limited-capacity processing, which is restricted to a smaller area related to the interested object or event in the visual field at one time. Afterwards Hoffman proposed a two-stage processing model (pre-attention and attention stages) in 1975 [26]. As an extension of the two-stage attention concept, after 21 years, Wolfe [28] in his book proposed a post-attention stage to supplement the system after the attention stage.
In 1980s, psychologists Treisman and Gelade proposed a feature integration theory for visual attention [4] based on physiological evidences of a single cell's feature extraction from the visual field in parallel. How can we combine these separate features of an object in the visual field? Their theory suggests that a number of the features that come from parallel perception need the focal attention to form a single object. Some testing paradigms have confirmed the feature integration hypothesis of Treisman and Gelade's literature [4]. They found that searching a target based on a single feature is very easy since it does not need to consider the relation among different features; by contrast, searching a conjunction with more than one feature is slower due to a serial search in the processing. The feature integration theory has become the foundation for subsequent attention studies.
An implementation of the feature integration theory, the guided search model, was proposed by Wolfe et al. in 1989, 1994 and 1996 [5, 28, 30]. In the original guided search model, the parallel process is used for several separable features to guide attention during object search with the conjunction of multiple features [5]. In the revised version – the guided search 2.0 model [30] – each feature consists of several channels (e.g., red, yellow, green and blue channels for colour features), and three features are synthesized from their respective channels such as colour, orientation, and others (size or shape) as a bottom-up process were proposed. They extracted features from the visual field in parallel, to form three feature maps. The information from the top-down process guides the feature maps' active locations, and then the three feature maps are integrated into a topographic activation map with two dimensions. The attention focus is located with the higher values of the activation map. We can see subsequently that the famous computational model proposed by Koch and Ullman [35] and by Itti et al. [43] is very close to the guided search 2.0 model. Since the simulation results and conclusions of the guided search model are based on psychology, in this book, we label this model as one from psychological studies.
The lateral inhibitory interaction in the cells of the early visual system was discovered by physiologists in the 1950s [46]. In the 1960s and 70s, physiologists found that the receptive field of a retinal ganglion cell in cats had the effect of central enhancement and surrounding inhibition (on/off) or reversed (off/on) [47, 48]. Afterwards, the experiments in the extrastriate cortex of the macaque found a competing mechanism in other areas of the visual cortex [49–51]. If a local region of the cortex receives input from two stimuli, the neuronal response in that attention region will be generated via competitive interaction that represents a mutually suppressive effect [52, 53]. The cell with the strongest response can suppress the response of its surrounding cells, and this leads to the winner-takes-all (WTA) strategy. The phenomenon of competition was discovered earlier in a psychological experiment: when two objects were presented in the visual field, the subject only focused on one object at a time because of the WTA strategy, so competition and attention are consequential to each other. In 1990s, Desimone discussed the relations of attention and competition: neurons representing different stimulus components compete with each other, and attention operates by biasing the competition of neurons that encoded the attended stimulus [53]. Due to the competitive nature of visual selection, most of the attention models are based on WTA networks such as Lee et al. proposed in 1999 [54]. Through the neurons' computations, a WTA network selects the neurons, the place located at the winner, as the fixated focus [54].
In the same period, a normalization model of attention was proposed based on the non-linear response of simple cells in the primary visual cortex [55]. This idea is based on physiological investigation of a simple cell of cats by [56, 57], which is different from the longstanding view of linear relations. In 1994, Carandini and Heeger; and in 2009 Reynolds and Heeger proposed that the non-linear response of a simple cell can be represented as a linear response of every cell divided by all the cells' activity, called the normalization model [58, 59]. The normalization can also be explained by suppressive phenomena resulting from the inhibition of neighbouring cells. In the late 1980s, stimulus-related neuronal oscillations were discovered in the primary visual cortex of cats and monkeys [60, 61]. These findings supported the hypothesis that neuronal pulse synchronization might be a mechanism to link local visual features into coherent global perception. Based on the facts of pulse synchronizing oscillations, many spiking neural networks were proposed in [62–64]. Since spiking neuron networks consider the connected context and pulse transfer between neurons, it can simulate visual attention phenomena well [65, 66].
Many physiological and psychological experiments showed that top-down attention plays a critical role in determining object search [67–69]. In Wolf's guided search, bottom-up activations are modulated by top-down gain that specifies the contribution of a particular feature map related to the current task [30]. In the review article by Itti and Koch [23], top-down processing is represented as a hierarchical decision tree that can learn object knowledge, and the signals from the decision tree control the salient location in the input visual field. Some studies using single-cell recording suggested that top-down control signals from working memory of object representation can modulate neural response, so the top-down models, related to working memory biasing selection in favour of the object, were proposed [37, 70, 71].
It should be noted that, in this first phase of research and development, the studies were mostly to reveal the attention phenomena and find the related principles, so most of the models proposed in this phase are models formulated in principle to verify physiological and psychological experiments. However, these main research results, which are related to physiology and psychology, are the foundation for building computational models of visual attention in the next phase. We should also note that this phase is not over yet since the related research still continues. Every new finding on visual attention from physiology and psychology will promote the development of computational models and applications in computer engineering. The reader is encouraged to constantly track the latest progress in the related scientific and technical literature.
