Monocular 3D Tracking of Multiple Interacting Targets - IEEE Xplore

Report 2 Downloads 101 Views
Monocular 3D Tracking of Multiple Interacting Targets Tatsuya Osawa, Kyoko Sudo, Hiroyuki Arai and Hideki Koike NTT Cyber Space Laboratories, NTT Corporation 1-1 Hikarinooka Yokosuka-Shi Kanagawa, Japan [email protected]

Abstract In this paper, we present a new approach based on Markov Chain Monte Carlo(MCMC) for the stable monocular tracking of variable interacting targets in 3D space. The crucial problem with monocular tracking multiple targets is that mutual occlusions on the 2D image cause target conflict(change ID, merge targets...). We focus on the fact that multiple targets cannot occupy the same position in 3D space and propose to track multiple interacting targets using relative position of targets in 3D space. Experiments show that our system can stably track multiple humans that are interacting with each other.

1

Introduction

There is much interest in visual surveillance systems to ease the growing public fear[2]. Human tracking is one of the most critical steps in visual surveillance since the movements of humans correlate well with human behavior. Compared to the 2D approach, the 3D approach is more effective in accurately estimating position in space and is more effective for handling occlusions. In [4], the 3D positions of humans in very cluttered environments is tracked using multiple cameras located far from each other. [3] proposed a method that tracks humans under severe occlusions by directly predicting the 3D position of humans in a 3D environment model and evaluating the predictions using 2D images from multiple cameras. The above methods are designed on the assumption of a multi-camera system where the cameras are synchronized and their field of views overlap. The installation cost of these system is commercially unrealistic for most applications; the market strongly needs a practical monocular 3D tracking system.

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

The crucial problem of monocular tracking of multiple targets is that mutual occlusions on the 2D image cause target conflict(change ID, merge targets...). The work of [5] realized the tracking of multiple targets in the framework of the MCMC particle filter. However, there is a problem in that they merge targets under severe mutual occlusions because they only use 3D information to create the perspective effect of the camera projection. We focus on the fact that multiple targets cannot occupy the same position in 3D space and propose to track multiple interacting targets by using the relative positions of targets in 3D space. We formulate the state of multiple targets as a joint state space of each target, and recursively estimate the multi-body configuration and the position of each target in the 3D space by using framework of trans-dimensional MCMC.

2 2.1

3D Tracking from Monocular Camera Human State Model

Tracking means the sequential estimation of the state of multiple humans St ; it includes the position and shape parameters of humans at time t. The state of human k is represented as a 5 dimension vector Mk = (X, b). X = (uk , vk ) is the head position on the 2D image while b = (hk , r1k , r2k ) are the ellipsoid’s height and radius (this allows us to handle the shape difference between individuals), see Figure 1. Note that the 3D position of the human model can be computed by using ellipsoid’s height parameter and head position on the 2D image after camera calibration. The state of multiple humans is defined as the union state space of each human. Consider a system tracking K people in the t-th image frame. St is represented as the 5K dimension vector St = (M1 , M2 , · · · , MK ).

where Vt−1 is the previous velocity of the state. If the system tracks K humans at time t − 1, vector Vt−1 has 5K dimensions. 2.2.2

Move type

We use the following move types to traverse the union state space: 1. Target Addition (enter new target) 2. Target Removal (depart target) Figure 1. 3D human model

2.2

3. Update-Position (update position of target) 4. Update-Shape (update shape of target)

MCMC based Tracking Algorithm

The number of dimensions of the state space estimates varies with the number of humans being tracked, so we employ a trans-dimensional MCMC[1] based estimation algorithm. First, in each time step, we compute the initial state of the Markov chain at time t using the state of previous time St−1 and the motion model. After initialization, we generate B + P samples by changing the current state depending on the random selection of move type(MCMC sampling step) to obtain P samples because the first B samples are assumed to vary widely among samples. We adopt four move types: entering target, leaving target, update position of target, and update shape of target. We decide to accept or reject a new sample as a new state by computing the likelihood of the new sample. After B + P iterations, we compute state St as the MAP state using samples generated from the last P samples. The flow of the MCMC-based tracking algorithm proposed here is given below. 1. Initialize the Markov chain 2. MCMC sampling step (B + P times): (a) Select move type and generate new sample (b) Compute Likelihood of the new sample (c) Decide accept or reject by computing acceptance ratio

Initialization of Markov Chain

We initialize the MCMC sampler at time t using the state of previous time St−1 according to a simple linear motion model. The initial state of the Markov chain at time t Sˆt,0 is computed by: Sˆt,0 = St−1 + Vt−1

1. Target Addition Sample the new human model Mn from qa (M ) and add it to Sˆt,i . q(Sˆ |Sˆt,i ) = padd · qa (Mn )

(2)

where qa (Mn ) = qb (bn )qx (Xn |bn ). We first sample shape parameter bn from qb (b) ∼ N3 (μb , Σb ). μb = (μh , μr1 , μr2 ) represents the average human shape parameter and Σb = diag(σh , σr1 , σr2 ) is the covariance matrix. We then sample position parameter Xn from qx (Xn |bn ). We compute qx (Xn |bn ) by using projection analysis of background subtraction image. 2. Target Removal Randomly select an existing human model Mk and remove it from Sˆt,i . Q(Sˆ |Sˆt,i ) = pdel · qd (Mk ) = pdel (1/k)

(3)

where K is the current number of existing human models. 3. Update-Position Randomly select an existing human model Mk from Sˆt,i and update position parameter Xk to Xk .

3. Estimate the MAP 2.2.1

In each iteration, we select the above types with probability padd , pdel , ppos , psh , respectively, and we generate candidate sample Sˆ according to Sˆt,i from proposal distribution Q(Sˆ |Sˆt,i ). In [5], ”merge” and ”split” move are used for resolving target conflicts on the 2D image plane. Instead of this, we use 3D information(relative distance of targets).

(1)

Q(Sˆ |Sˆt,i ) = ppos · (1/K)qx (Xk |Xk )

(4)

where K is the current number of existing human models. qx (Xk |Xk ) ∼ N2 (Xk , ΣX ) is the distribution proposed for updating position parameter X.

Finally, likelihood L is computed by: L(S) = R(S) × V (S)

(9)

This likelihood realizes efficient estimation by preventing target conflicts in 3D space. Figure 2. left:camera mid:background subtracted right:simulation image

image, image

2.2.4

4. Update-Shape Randomly select an existing human model Mk from Sˆt,i and update position parameters bk to bk Q(Sˆ |Sˆt,i ) = psh · (1/K)qb (b’k |bk ))

(5)

where K is the number of existing human models in Sˆt,i . qx (bk |bk ) ∼ N3 (bk , Σb ) is the distribution proposed for updating shape parameter b. 2.2.3

Acceptance Ratio

We decide whether to accept or reject state Sˆ by using acceptance ratio a:   L(Sˆ ) Q(Sˆ |Sˆt,i ) (10) a = min 1, L(Sˆt,i ) Q(Sˆt,i |Sˆ ) If a ≥ 1, accept the new sample. Otherwise we accept the new sample with probability a. If we reject the new sample, we accept the current sample as the new sample. 2.2.5

MAP estimation

After repeating sampling B+P times, we compute state St by:

Likelihood of the State

State S = (M1 , M2 , · · · , MK ) is simulated by using 3D models of humans. We capture this scene by a virtual camera that has the same camera parameters as the real camera. Likelihood of the state is computed by comparing the camera image to the corresponding simulation image. Figure 2 shows a real camera image, background subtracted image, and simulation image. We compare the background subtracted image to the simulation image using:  u,v BgIt (u, v) ∩ SmS (u, v) (6) V (S) =  u,v BgIt (u, v) ∪ SmS (u, v) BgIt (u, v), SmS (u, v) are the (u, v)pixels of the background subtracted image and the simulation image, respectively, at time t. Penalty based on relative distance in 3D space Since multiple humans cannot occupy the same position in 3D space siumltaneously, we define penalty functionR(S) based on relative distance among targets for overcoming target conflicts in the 2D image:  ψ(Mk , Ml ) (7) R(S) = k,l

ψ(Mk , Ml ) = 1 − exp−λDk,l (8)  2 2 (xk − xl ) + (yk − yl ) is the diswhere Dk,l = tance in 3D space between target k and target l, and (xk , yk ) is a position on the ground plane.

St =

B+P 1  ˆ St,i P

(11)

i=B

We use only the last P samples to compute stateSt , because the first B samples are assumed to vary widely among samples and include different target configurations. Models that do not appear in all P samples are deleted.

3

Experiment

Our system consisted of a PC and a color CCD camera. Each captured image had a resolution of 360 × 288. In this experiment, the number of iterations, B + P , was set to 150. For MAP estimation, we use the last P = 50 samples. In order to evaluate the basic tracking performance of our system, we used an image sequence in which humans entered and left the monitored area at different times. Figure 5 shows images selected from the monitoring results. Our system could correctly catch the movements of the humans. From the 832th frame to the 912th frame two humans completely overlap on the 2D image(see Figure 3). Even under this severe mutual occlusion, the tracking error was not significant due to our use of the information about relative distance of targets. Figure 4 shows the trajectories of these two humans on the ground plane. Trajectory tracking was stable even under the severe mutual occlusion demonstrating the occlusion robustness of our system.

Figure 3. camera image:background subtracted image:simulation image 500 trajectory of target1 trajectory of target2

400

y(cm)

300

200

100

0

200

250

300

350

400 450 x(cm)

500

550

832th frame

878th frame

912th frame

1246th frame

1344th frame

1366th frame

1628th frame

1662th frame

1722th frame

600

Figure 4. Motion trajectory on the ground plane.

4

Conclusion

This paper proposed a method that is capable of tracking variable interacting targets under the severe occlusions created by target movement in 3D space. We introduced a tracking algorithm based on transdimensional MCMC; together with information about the relative distance of targets in the 3D space. It provides accurate and stable predictions of object movements; it well handles the occlusions created by moving targets.

References [1] P. J. Green. Trans-dimensional Markov chain Monte Carlo, Highly Structured Stochatic Systems. Oxford Univ. Press, 2003. [2] W. Hu, T. Tan, L. Wang, and S. Maybank. A survey on visual surveillance of object motion and behaviors. IEEE Trans. Systems, Man, And Cybernetics Part C:Applivations and Reviews, 34(3):334–352, 2004. [3] T. Osawa, X. Wu, K. Wakabayashi, and T. Yasuno. Human tracking by particle filtering using full 3d model of both target and environment. In ICPR06, pages II: 25–28, 2006. [4] K. Otsuka and N. Mukawa. Multiview occlusion analysis for tracking densely populated objects based on 2-d visual angles. In CVPR (1), pages 90–97, 2004. [5] Z. Tao and N. Ram. Tracking multiple humans in crowded environment. In CVPR (2), pages 406–413, 2004.

Figure 5. Estimation results (images) of sequence #1(the top row: camera image, the mid row:estimated simulation image, the bottom row:bird’s eye view of estimated position of humans).