139,99 €
This title concerns the use of a particle filter framework to track objects defined in high-dimensional state-spaces using high-dimensional observation spaces. Current tracking applications require us to consider complex models for objects (articulated objects, multiple objects, multiple fragments, etc.) as well as multiple kinds of information (multiple cameras, multiple modalities, etc.). This book presents some recent research that considers the main bottleneck of particle filtering frameworks (high dimensional state spaces) for tracking in such difficult conditions.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 252
Veröffentlichungsjahr: 2015
Contents
Notations
Introduction
1. Visual Tracking by Particle Filtering
1.1. Introduction
1.2. Theoretical models
1.3. Limits and challenges
1.4. Scientific position
1.5. Managing large sizes in particle filtering
1.6. Conclusion
2. Data Representation Models
2.1. Introduction
2.2. Computation of the likelihood function
2.3. Representation of complex information
2.4. Conclusion
3. Tracking Models That Focus on the State Space
3.1. Introduction
3.2. Data association methods for multi-object tracking
3.3. Introducing fuzzy information into the particle filter
3.4. Conjoint estimation of dynamic and static parameters
3.5. Conclusion
4. Models of Tracking by Decomposition of the State Space
4.1. Introduction
4.2. Ranked partitioned sampling
4.3. Weighted partitioning with permutation of sub-particles
4.4. Combinatorial resampling
4.5. Conclusion
5. Research Perspectives in Tracking and Managing Large Spaces
5.1. Tracking for behavioral analysis: toward finer tracking of the “future” and the “now”
5.2. Tracking for event detection: toward a top-down model
5.3. Tracking to measure social interactions
Bibliography
Index
First published 2015 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
www.iste.co.uk
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
www.wiley.com
© ISTE Ltd 2015The rights of Séverine Dubuisson to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2014955871
British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-84821-603-7
With the progress made in the domains of electronics and microelectronics, the acquisition of video sequences became a task of particular triviality. Hence, in computer vision, algorithms working with video sequences have undergone considerable development over the past few years [SZE 10]. Skimming through a book dedicated to computer vision, written 30 years ago [BAL 82], we note that the notion of movement is discussed nearly in terms of approximation: here, the issue was detecting movement, rather than analyzing it. In particular, analysis of the optical flow [BAR 94], very popular at the time, only allowed characterizing temporal changes within the sequence. Little by little, with the rapid improvement of sensor quality and therefore of the resolution of the images they provided, as well as computer processing power and memory, it became possible, perhaps essential, to analyze movement in addition to detecting it: where does it come from? What behavior does it reflect? Hence, new algorithms made their appearance [SHI 94], their purpose being to detect and to follow entities in a video sequence. These are grouped under the name of tracking algorithms. Today, tracking single and multiple objects in video sequences is one of the major themes of computer vision. There are in fact many practical applications, notably in human–machine interaction, augmented reality, traffic control, surveillance, medical or biomedical imagery and even interactive games. The diversity of the problems to solve, as well as computing challenges created by object tracking in video sequences, motivates an increasing amount of research every year. Thus, in the year 2012 alone, in three major computer vision conferences (IEEE Conference on Computer Vision and Pattern Recognition (CVPR), European Conference on Computer Vision (ECCV) and British Machine Vision Conference (BMVC)), three workshops and two tutorials were dedicated to tracking; we can even mention the PETS workshop (International Workshop on Performance Evaluation of Tracking and Surveillance), which suggests a competition around tracking every two years. Furthermore, databases are increasingly available to allow researchers to compare their results [SIG 10a, WU 13]1. The intense research activity around object tracking in video sequences is explained by the large amount of challenges that it undertakes. Indeed, it requires efficiently extracting the information related to the object or objects to track from the images, modeling it to obtain a representation that is both precise and compact and solving the compromise between tracking quality and efficiency. It is then required to be able to create temporal links between the object instances in each time step, while managing the occasional appearances and disappearances of objects on the scene. Finally, it is sometimes necessary to extract meta-data to respond to the needs of a specific application (behavioral analysis, detection of an event, etc.). In addition to these difficulties, there are those induced by the state of the object (appearance and deformation), the variations of the illumination of the scene, the noise present in the images, object occlusion, etc. Hence, object tracking reveals itself as a very complex process, especially given the ever-growing requirements in terms of tracking quality and processing speed in practical applications.
Over the last few years, sequential Monte Carlo methods [DOU 01, GOR 93, ISA 98a], better known as particle filters, became the algorithm for visual tracking par excellence. Their aim is to estimate the density of the filtering that links the states of the tracked objects to previous and current observations by approximating it using a weighted sample. Outside of the simplicity of their implementation, these approaches are capable of maintaining multiple hypotheses over time, which makes them robust to the challenges of visual tracking. Additionally, given their probabilistic nature, their very generic formalism makes it possible to consider complex modeling for the available objects and observations, whose densities could be non-parametric and/or multimodal. Nevertheless, their use requires making sure to stay within their mathematical framework, which is rigorous in spite of its simplicity. Moreover, we need to make sure that algorithmic costs remain reasonable (by incorporating, for example, independence hypotheses when they are justifiable). We positioned ourselves naturally in this methodological context, particularly by noting that some of the primary advantages offered by particle filtering cannot, at the current time, be used without making a certain number of often simplifying hypotheses. Specifically, if maintaining multiple hypotheses over time presents a real advantage to particle filtering, the minimal amount to maintain a good approximation of the filtering density is as high as the chosen data model leads to a high-volume representation. This results in serious problems once we attempt to refine the representation by integrating, for example, all the richness of information that is supplied by the sensors.
The goal of this book is to present the various contributions related to managing large state and observation representation spaces, which we consider to be one of the major challenges of particle filtering today. We distinguished three primary axis that guided the research and will be the subject of Chapters 2 through 4.
The first axis concerns the choice of the data model in order to lighten the representation, as well as accelerate its extraction. The work on this axis is essential to simplifying the calculations related to the estimation by particle filtering. Indeed, in order to be solved in a robust and targeted manner, current tracking problems require exploiting a multitude of available informations/observations, whose quality is constantly improving. Therefore, this requires increasingly finer descriptions of the dynamic scene that is being processed, which have the tendency to put considerable weight on calculations. Today, however, there are reliable and efficient techniques for data extraction that allow better exploitation of image information, they are not necessarily appropriate to the case of particle filtering and its multihypothesis representation, as they lead to repeating calculations, which may be disastrous for the efficiency of the filter. Hence, histograms are a widely used representation model, however, their extraction can quickly become a bottleneck for the filter response time, so it is appropriate to find methods that are appropriate for their extraction. The size of the spaces that we are working on has a significant influence on the response times, and being able to combine a set of characteristics, observations, information and to supply a model described in less space is an equally essential task, undertaken by numerous researchers. As we will see later, efficient combinations can be made either during the correction precess, which can be considered as a posteriori fusion, or be integrated directly to the tracking process (propagation and correction).
Several suggested theoretical and/or algorithmic solutions were able to obtain either a significant decrease in the processing time of likelihood functions, notably with optimized histogram extractions, or original models of classical tracking problems, such as deformation of the object over time, its multiple possible representations (appearances, modalities and fragments) and even the detection of objects (new or known) between two images.
The second axis concerns the exploration of the state space. Indeed, the particle filter features a hypothesis (particle) propagation phase, aiming to diffuse hypothesis toward areas with high likelihood, where they will be attributed a high weight during the first correction stage. When the state space is large, detecting areas with high likelihood turns out to be difficult, as exhaustive exploration of the space requires multiplying the number of hypotheses, which is impossible within reasonable processing times. This problem can be solved in two different ways. The first solution constitutes a primary research axis and consists of “choosing” the areas of the space to explore, which we call “focusing”. Hence, we can maintain a reasonable number of hypothesis, while smartly exploring the areas in which they will be propagated, areas with assumed high likelihood. This can be done either through detection before propagation (detect-before-track), or by defining dedicated proposition functions. It is this latter solution that we chose to develop in this book and that consists of decomposing the state space into subspaces of smaller sizes, that can be processed with “few” hypothesis.
We suggested several approaches that allow better focusing within the state space and, therefore, accelerate the process of tracking by particle filtering process. Two types of contributions dedicated to multiobject tracking allowed us not to have to process all the possible combinations associating measures with objects. The first type aims to, first of all, model a proposition function allowing the propagation of particles only in areas where movement was detected in advance, and which are seen as measurements, and then to classify these particles by associating them with objects on the scene. The second type takes into account the past dynamics of the objects and suggests a data association model that is very unconstrained, as it depends on few parameters. These models allow simply calculating the association probabilities between the measurements and objects. We will show that it is equally possible to input fuzzy spatial information into the particle filter. This allowed modeling of a new proposition function taking into account not only the current observation, but also the history of fuzzy spatial relations that characterized the past trajectories of the objects. The result allows considerably more flexible and better adapted tracking of sudden changes in trajectory or form. The appeal of this type of modeling is shown through various applications, such as object tracking, by managing erratic movements, multiobject tracking, by taking into account the case of object occlusion and finally multiform tracking, by taking into account deformable objects.
As the previous axis, the third one aims to make the exploration of large state spaces possible in practice. However, here we are no longer looking to work with the entire space whose exploration hyper-volume we reduced. Rather, we suggest decomposing it into subspaces of smaller sizes, in which calculations can be made within reasonable times, as they allow estimating distributions with fewer parameters than those required in the whole space. Hence, the latter is defined as a joint space. In this book, we are interested in “non approximated” decomposition methods, that is methods that guarantee asymptotically that the particles sample correctly the filtering distribution over the whole space. Thus, these methods do not make any simplifying hypothesis and only exploit the independences existing in the tracking problem. Among these techniques, partitioned sampling is used a lot today, although it has several limits. Indeed, this technique sequentially covers every subspace in order to construct, progressively, all the particles on the joint space (whole space), which can create problems if the order that the subspaces are processed in is completely arbitrary: if the sub-hypothesis made in the first subspace is incorrect, it will contribute to diminish the global score of the hypothesis as a whole, even if other sub-hypotheses are correct. Thus, the quality of the tracking will be low.
There are contributions that allowed us to solve this problem. First is the possibility to add the order in which objects need to be processed to the estimation process. This order is estimated sequentially, at the same time as the states of the objects, which allows taking the least reliable objects into account for tracking last. We can also exploit the conditional independences intrinsic to the problem of tracking (without making abusive hypothesis). This naturally leads to using dynamic Bayesian networks, rather than Markov chains, in order to model the filtering process. The exploitation of independence properties of these networks allowed us to develop a new method for permuting certain subsamples of subparticles that allow better estimations of the filtering density models, while guaranteeing that the estimated density remains unchanged. This method leads to reducing not only tracking errors, but also calculation times. This idea of permutation is exploited to suggest a new resampling method, which also allows us to significantly improve tracking.
The structure of this book is as follows. In Chapter 1, we will present the theoretical elements that are necessary to understand particle filtering. We will then explain how this methodology is used in the context of visual tracking, particularly the fundamental points to consider. This will eventually allow us to describe several current limits and challenges of tracking by particle filtering and, thus, justify our scientific position. Chapter 2 presents contributions related to modeling and extracting the data to process, as well as the choice for its representation to simplify, and thereby accelerating the calculations. In Chapter 3, we describe several contributions that allow exploring the state space by focusing on certain specific areas, considered more interesting than others. Chapter 4 shows, through certain works, how to decompose the state space into subspaces in which calculations are possible. Finally, we suggest a conclusion and an opening on the future of tracking, specifically by particle filtering, in Chapter 5.
1 A non-exhaustive list is available at http://clickdamage.com/sourcecode/cv_datasets.php.
The aim of this introductory chapter is to give a brief overview of the progress made over the last 20 years in visual tracking by particle filtering. To begin (section 1.2), we will present the theoretical elements necessary for understanding particle filtering. Thus, we will first introduce recursive Bayesian filtering, before giving the outline of particle filtering. For more details, in particular theorem demonstrations and convergence studies, we invite the reader to refer to more advanced studies [CHE 03b, DOU 00b, GOR 93]. We will then explain how particle filtering is used in visual tracking in video sequences. Although the literature is abundant on this subject and evolving very fast, it is impossible to give a complete overview of this subject. Next, section 1.3 presents certain limits of particle filtering. Toward the end, we specify our scientific position in section 1.4 and the methodological axes that allow a part of these problems to be solved. Finally, section 1.5 gives the current state of the main large families of approaches that are concerned with managing large-sized state and/or observation spaces in particle filtering.
[1.1]
The first equation is the state equation, with the state transition function ft between the instants t – 1 and t, and the second is the observation equation, giving the measurement of the state through an observation function gt. ut and vt are independent white noises.
[1.2]
[1.3]
The state transition equation is represented by the density p(xt|xt–1) and is linked to the ft function. This density is also called the transition function and gives the probably state xt at the instant t, given its previous state xt–1. The observation equation is represented by p(yt|xt) and is linked to the function gt. This density is also called the likelihood function and gives the probability of making the observation yt given the state xt. We can see that equation [1.3] is recursive and it decomposes into two primary stages that we detail below.
In order to obtain calculable estimators of x0:t, we can use, for example, the conditional mean, given by:
[1.4]
where is some bounded function. If the densities are Gaussian, then there exists a solution (analytical expression of the Gaussian parameters to approximate) given by the Kalman filter [KAL 60]. Otherwise, the whole of equation [1.4] is not calculable directly. We can invoke, under special conditions, the solutions given by the following types of methods:
Most of the time, in vision, solutions are not adapted as the integrals are not directly calculable. For the general case (non-parametric and multi-modal densities), it is necessary to make use of numerical approximations, such as those provided by sequential Monte-Carlo methods, which we will present in the following section and that are the methodological heart of this work.
Sequential Monte-Carlo methods, also known under the name of particle filters (PFs), were studied by many researchers at the beginning of the 1990s [GOR 93, MOR 95] and combine Monte-Carlo simulation and recursive Bayesian filtering. Today, they are widely used in the computer visualization community. Before detailing the principle of particle filtering, we need to introduce importance sampling.
Once the a posteriori density defined by equation [1.3] has been approximated, we can evaluate the estimator given in equation [1.4]. The Monte-Carlo method allows us to approximate this integral with the realization of a random variable distributed according to the a posteriori density. Unfortunately, we are almost never able to sample following this law, so to solve this problem, we introduce a proposal function (or importance function) q(x0:t|y1:t), whose support contains p(x0:t|y1:t) and from which we can sample. The conditional mean is then given by:
[1.5]
[1.6]
[1.7]
This estimator almost certainly converges when N tends to infinity. Then, it is sufficient to make the importance sampling recursive, to obtain the particle filtering algorithm described below.
[1.8]
where the individuals , also called particles, are the realizations of the random variable x0:t (state of the object) in the state space (δ being the Dirac function). Every particle is therefore a possible solution of the state to approximate and its associated weight represents its quality according to the available observations. Hence, the sample at the instant t is calculated from the previous sample , so as to obtain an approximation (via sampling) of the filtering density p(x0:t|y1:t) at the current instant. For this, three stages are necessary: i) a state exploration stage, during which we propagate the particles via the proposal function, ii) a stage for the evaluation (or the correction) of the particle quality, which aims to calculate their new weight and finally iii) an optional stage for particle selection (re-sampling). The generic particle filtering scheme (SIR filter – sequential importance resampling), between the instants t – 1 and t, is summarized in the algorithm below.
The equations below allow us to approximate the trajectory of the objects, but they can also allow to approximate only their state at instant t, by simply integrating over x0:t–1. In practice, this amounts to replacing x0:t and x0:t–1, respectively, by xt and xt–1 in the algorithms. In the rest of this work, depending on the applications, either one or the other possibility will be studied.
Once the theoretical framework is defined, we will discuss the problem of visual tracking by particle filtering in the next section.
The PF has been used in numerous disciplines, such as communication, networks, biology, economy, geoscience, social sciences, etc. In image processing, it has been used in many domains (medical imagery, video analysis, meteorological imagery, robotics, etc.), for various applications such as segmentations or tracking in video sequences, which is the primary subject of our research.
Visual tracking poses many problems, among which the changes in appearance or illumination, occlusion, the appearance and the disappearance of objects, environmental noise and erratic movements are just a few examples. Particle filtering allows us to represent the arbitrary densities, focusing on specific regions of the state space and managing multiple models. It is easy to implement, robust to noise and to occlusions, although this requires taking a certain amount of precautions, among which:
We will later give several solutions suggested by the literature to each of these points.
The choice of a model xt for the state depends on the available knowledge and the characteristic differences of the object that we would like to track. In this part, we describe how to model the state xt of an object.
The most common method to represent an object is to use its geometric characteristics, in particular its position in the image (this is the case of the illustration in Figure 1.3). The 2D form can be given by a set of arbitrary points [ARN 05a, ARN 07, VER 05b] or specific points, such as edges [DOR 10, DU 05], contour points [CAR 10, CHE 01, LAK 08, MOR 08, XIA 08] or reference points [TAM 06]. Classical forms are also used, such as rectangles [BRA 07a, HAN 05b, HU 08, LEI 06, LEI 08, PÉR 02, WAN 09] or ellipses [ANG 08, MAG 09, NUM 03a], as well as forms interpolated by splines [LAM 09, LI 04a, LI 03]. We can also use level-sets [AVE 09, RAT 07a] or active contours [RAT 05, RAT 07b, SHE 06]. Finally, more evolved models integrating the relations between sets of pixels [HOE 10, HOE 06] are sometimes used. Among 3D forms, we use simple shapes (parallelepipeds, spheres) [GOY 10, MIN 10, MUÑ 10, ROU 10], thin 3D mesh of the face [DAI 04, DOR 05], the human body [GAL 06] or the hand [BRA 07c, CHA 08], as well as the contours [PEC 06].
Recently, numerous studies were conducted on the tracking of articulated objects, in which an object was modeled by a set of 2D or 3D shapes linked between themselves by articulations [BER 06, BRU 07, QU 07, SIG 04, YU 09]. The appearance models are also used, which require learning color [MAR 11, WAN 07], thumbnails [BHA 09], illumination [BAR 09, SMA 07], the exposure [WAN 05] or multiple shapes [BRA 05, GIE 02]. We also find more exotic appearance models using blur [SMA 08] or laser [GID 08] information. Finally, the state can be described by movement information, given by the refined transformations [GEL 04, KWO 08, MEI 09], velocity and/or acceleration [BLA 99a, CUI 07, VER 05a, DAR 08b] (we sometimes talk about auto-regressive models) or the trajectory [BLA 98a].
Naturally, these models are often combined to improve the description of the object, which increases the size of the state space, often making calculations unacceptable. We then need to make a compromise between the quality of the description and the computation time. Figure 1.1 gives several examples of state models used in tracking by PF.
Here, again, the choice of the observation model yt depends on the available information. In visual tracking, this information is extracted from the images, which are generated by different types of sensors, the number of which can vary. Many approaches work directly on pixels, which are often filtered during a simple pre-processing stage [BHA 09, GEL 04, GON 07, KAZ 09, KHA 06, SCH 07] or simply on pixels of the area from the extracted foreground [CHE 03a]. The difference between these approaches depends on the form of acquisition, which can supply, for example, fluorescent [LAM 09], 2D [SHE 06, SMA 08] or 3D [CHE 08] microscopic, infrared [PÉT 09] or even ultrasound [SOF 10] imagery. Note that for the color, we primarily use RGB representations [CZY 07, HU 08, MAG 05a, MAG 07, MAR 11, NUM 03a] and HSV [LIU 09, MUN 08b, PÉR 02, PER 08, SNO 09] (the latter being generally more adapted to vision problems, as it is less sensitive to changes in illumination). Other types of sensors are sometimes used, providing information such as distance and depth maps [ARN 05a, BER 06, LAN 06, MUN 08b, ZHU 10], movement maps [SCH 06], laser data [CUI 07, GID 08, GOY 10], projective images [ERC 07], occupation [MUÑ 09] or sound [CHE 03a, PÉR 04] maps. Figure 1.2 gives several examples of these.
Figure 1.1.Some examples of state models used to represent the object to track. Form left to right, top to bottom, a model integrating illumination [BAR 09], an articulated model [SIG 10a], a trajectory [BLA 98a], a 3D facial mesh [DOR 05], level sets [AVE 09], a sphere [ROU 10], a set of points-of-interest [ARN 07], areas and their relations [HOE 10], a rectangle [BRA 07a], edges [DOR 10], an ellipsis [MAG 09] and appearance models [MAR 11]. For a color version of the figure, see www.iste.co.uk/dubuisson/tracking.zip
The importance function, or the proposal function,
