Object Detection Using a Cascade of 3D Models - Nanyang ...

Report 2 Downloads 82 Views
Object Detection Using a Cascade of 3D Models Hon-Keat Pong and Tat-Jen Cham School of Computer Engineering, Nanyang Technological University, Singapore [email protected], [email protected]

Abstract. We present an alignment framework for object detection using a hierarchy of 3D polygonal models. One difficulty with alignment methods is that the high-dimensional transformation space makes finding potential candidate states a time-consuming task. This is an important consideration in our approach, as an exhaustive search is applied on a densely-sampled state space in order to avoid local minima and to extract all possible candidates. In our framework, a level-of-detail (LOD) 3D geometric model hierarchy is generated for the target object. Each of this model acts as a classifier to determine which of the discrete states are potential candidates. The classification is done through the estimation of pixel and edge-based mutual information between the 3D model and the image, where the classification speed significantly depends on the LOD and resolution of the image. By combining these models of various LOD into a cascade, we show that search time can be reduced significantly while accuracy is maintained.

1

Introduction

In this paper we address the problem of alignment-based object detection, in which a 3D geometric model is transformed to align with the target object in image. Typically, in finding the set of transformation parameters that best align the model with its image, features from the 3D model are matched to image features by measuring their similarity using an evaluation function. The function values associated with each possible transform form an energy landscape in the parameter space. Most of the existing alignment methods use some directed search techniques to find the optimal transformation, but these usually require initialization near the final solution. In contrast, our proposed algorithm attempts to find a global optimal transformation through exhaustive search, but carried out in a computationally efficient manner. We present a search method performed using a 3D model hierarchy. These models are decimated versions of a polygonal model of the target object and form a level-of-detail (LOD) hierarchy. The 3D models are loaded on-the-fly at run-time and their images rendered using graphics library, bypassing the need to store 2D multiple-view profiles of these models. Figure 1 illustrates an example LOD hierarchy. We note that models with lower LOD can be evaluated much faster due to reduced number of data points, although with lower accuracy. A P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 284–293, 2006. c Springer-Verlag Berlin Heidelberg 2006 

Object Detection Using a Cascade of 3D Models

285

Fig. 1. 3D models level-of-detail. The leftmost model has the least number of polygons. White points on the model are data samples. The models are shaded according to surface normal profiles on the polygonal surfaces.

densely sampled set of states in parameter space are evaluated with these models; the bulk of very unlikely states are quickly discarded, while the remainder are subsequently be evaluated via the higher LOD models. By combining the 3D models with increasing LOD into a hierarchy, we form a detection cascade that can be globally optimized with respect to running time and overall detection performance. Our method does not rely on local search techniques and will not be trapped in local minima. The consideration for using 3D models directly is motivated by the fact that a complete description of an object may not always be available. Existing works build a large database of 2D shape templates generated from a 3D model, with each template corresponds to a certain viewpoint of the object. In appearancebased methods, it is assumed that objects possess known surface properties that allow associations to some learned feature descriptors. However, object reflectance or emission information may not always be available or constant (for instance, in non-visible spectrum imagery, an object may have different appearance depending on its thermal profiles. Objects may also have very different appearance under varied lighting conditions). Current generative models [1, 2] do not handle significant lighting changes, and assume that features can be reliably detected by interest operators. One major limitation of generative models is that the learned classifiers cater only to single viewpoints (i.e. one set of model parameters for each different viewpoint of an object, even for mirror images of the object). We solve the detection problem in an alignment framework. As in Viola’s work [3], given vertices and their connections in a model, we derive surface normals and match distribution of the surface normals to the observed intensity using mutual information [4]. While Viola highlighted that their technique is purely intensity-based, we found that for reduced ambiguity, mutual information between projection contour of the 3D model and image edge maps can be included to help increasing detection performance. This is a crucial enhancement when mutual information is to be applied to real world scenes, out from the medical image registration domain where mutual information has enjoyed a great deal of success. We note that mutual information is chosen as the matching metric as it can be used for measuring similarity between multi-modal data, allowing the framework to be applied to multispectral imagery. Matching polyhedral models of objects to images in order to recover pose parameters is a problem that has been tackled by many authors. The contributions of this work include:

286

H.-K. Pong and T.-J. Cham

– An alignment framework by maximization of mutual information with enhanced detection performance by including contour information. – Speeding up the search in the 6D pose transformation space by using a cascade of 3D models of increasing levels-of-detail. Section 2 reviews related work. We then discuss in section 3 about how nonuniform sampling is introduced to reduce the number of candidate hypotheses. In section 4, we present the cascaded detection strategy using an LOD hierarchy. Some experimental results and future work end the paper.

2

Previous Work

Campbell and Flynn provided a comprehensive survey of 3D object recognition techniques using 3D geometric models [5]. We discuss some previous alignment based work. Kollnig and Nagel [6] described a vehicle tracking system that fits discontinuities between surface facets of a simple polyhedral model to image gradients. A gradient image obtained from the discontinuities is matched to gray value gradients of the input image. The difference between the synthetic gradient image and the gray value gradient of the image is used to update the model pose. Tan et al. [7] described a vehicle detection system using simple polyhedral car models. Target objects are assumed to be lying on a known ground plane. This assumption reduces the problem of localization and recognition from 6 degreesof-freedom to 3 degrees-of-freedom. The ground plane constraint allows pose to be estimated by matching 2D image and 3D model lines using Hough transform. Before line correspondences are established, the ground plane has to be recovered from the input image. Suveg and Gosselman [8] aligned simple polyhedral models to aerial views of buildings using mutual information as matching metric. Mutual information between gradient magnitude along model contour and image data is computed. If more images are available, mutual information between texture information of multiple images is included as additional information. Our work is based on Viola’s alignment approach [3]. Surface normals of the object are model instances and matched to intensity values by maximizing their mutual information with respect to a set of transformation parameters. Leventon et al. [9] extended the alignment framework to using multiple views of the object when single image does not provide enough information. The notion of cascading has been applied to object detection [10]. In this work, the cascading of 3D LOD models for object detection in 2D images is a new idea, which aims to detect target objects and discard unlikely hypothesis rapidly.

3

Parameter Space Sampling

In this work, the state space comprises six parameters of 3D rigid-body transformation (three for translation and three for rotations). Existing work estimates pose parameters by optimization of an evaluation function. Such optimizationbased methods have common problems of being sensitive to initial pose, and may experience slow convergence or be trapped in local minima. Viola [3] derived an

Object Detection Using a Cascade of 3D Models

287

approximation of the derivative of mutual information with respect to the transformation parameters, and used a stochastic gradient descent algorithm to seek the local maximum. Although stochastic gradient search is relatively fast as compared to techniques that do not require function derivatives (such as Powell’s), it is still faced with the problem of local extrema. In order to escape from the aforementioned problems, our framework falls back on exhaustive search. An exhaustive search in a discretized state space of the pose parameters will allow the global maximum to be obtained if the state space is sampled with sufficient resolution. Such exhaustive search does not have problems of being trapped at local maxima and do not depend on initial states, but can be enormously expensive for high-dimensional spaces. 3.1

Appearance-Dependent Sampling of State Space

Attempting to uniformly sample the full transformation state-space is inefficient as various different combinations of parameter values do not necessarily lead to significantly visible changes in the projected image space. In this section, we describe how the range and sampling intervals of parameters are manually determined in order to limit the sampling to pose variations of interest. The range for each parameter is set depending on the visibility of the projections. For instance, the X-axis translation of the polyhedral model is placed in the range {−1.5, 1.5} units with respect to a virtual camera of known focal length and viewing screen size. The object is either totally clipped or unrecognizable if the x parameter exceeds this range. The Y -axis translation has a range of between −1 and 1 while the range for Z-axis translation is {−2, 2}. The Y -axis rotation has the largest range as it involves greater variation in object appearance, i.e. the number of visible views corresponding to Y -axis rotation is larger than X- and Z-axis rotations. Both rotations about the X- and Z-axis have a range of between −10 and 10 degrees. After determining the range for each parameter, we can further improve efficiency by setting different sampling scales for each parameter. One way of defining these step sizes is by looking at how different the models appear in image space when each of the parameter is changed: for instance, when the model is translated by one unit along the X-axis in the object space, how many pixels the model appears to have been translated in the X-axis direction in image space? Through such observation for all the parameters, we can define a step size ∆Sp for each parameter p, where each ∆Sp accounts for a cluster of parameter values with very similar appearance on the viewing screen. In the next section, we describe detection using a cascade of increasing levelsof-detail of the object model.

4

Cascaded Detection Using a Level-of-Detail Model Hierarchy

We construct 3D models of different levels-of-detail (LOD) using a model simplification software [11], which reduces the number of polygons while maintaining

288

H.-K. Pong and T.-J. Cham

high-quality approximation to the original polygonal surfaces. As models with lower LOD take much shorter time to render, these models are first used to evaluate the densely sampled states of parameter space in order to quickly discard the very unlikely states. However, as the accuracy of these lower LOD models are poorer, higher LOD models are required to further evaluate the more likely states. By combining the 3D models with increasing LOD into a hierarchy, we form a detection cascade. Recent improvements in methods for the acquisition of 3D models allows for high-quality 3D models to be obtained more easily. Additionally, we use 3D models that are freely available from the Internet. Figure 1 illustrates an example LOD hierarchy with white dots on the models as locations where surface normals are sampled. Surface normals are collected from normal maps (images in figure 1 are normal maps) rendered using OpenGL, where (x, y, z) components of a normal correspond to (r, g, b) values of a point on the normal map. For a set of pose parameters P , the model has normal samples N and corresponding intensity values I. The mutual information M I between N and I is [4]: M I(N, I) = H(N ) + H(I) − H(N, I)

(1)

H(A) is entropy for random variable A: H(A) = −



p(a) log p(a)

(2)

a

while H(A, B) is joint entropy for random variables A and B that is defined as: H(A, B) = −

 a

p(a, b) log p(a, b)

(3)

b

As the lower LOD models are coarse shape approximations to the object, their MIs have lower values than MI for the model with the highest LOD. In addition, models of lower LOD are weak models as they may correspond to multiple objects (i.e. including non-target objects) in the image. In the initial levels, we use these weak models to discard unlikely states using a lower threshold value. State vectors that meet the threshold will get passed to the next level with a higher threshold value. As the weak models have lower rendering cost, detection in a cascade manner results in a speed up. Figure 2 shows the cascade architecture for a car model. For a cascade C = {m1 , m2 , ..., mn }, MIs between model mi at level i and image are evaluated at the discrete 6D state vectors defined by the stratification. The point at which MI (computed using (1), in the same manner as Viola’s algorithm [3]) is maximum is recorded, ti . Given r training images, we run the same evaluation using model mi for each training image and record its maximum MI values in Tmi : Tmi = {tmi1 , tmi2 , ..., tmir } The average of Tmi becomes the MI threshold value for level i.

Object Detection Using a Cascade of 3D Models

289

Fig. 2. A cascade of 3D models with increasing LOD, with each model acting as a classifier

5 5.1

Experiments Normal Maps Generation

Surface normals are collected from visible surface patches for each hypothetical pose. While determining front-facing polygons is a simple task, it is non-trivial to determine visible polygons as occlusion has to be taken into account. We adopt the normal map generation method in computer graphics. Normals are collected from normal maps rendered using the methods described in [12]. Leventon et al. [9] also generated normal maps for MI computations. RGB channels of the normal maps correspond to (x, y, z) coordinates of surface normals (figure 1). 5.2

Edge Information for Reduced Ambiguity

Our evaluation function is the mutual information between object and image data as expressed in (1). Using intensity information alone in a single image may not be sufficient as shown by [9], as the observed data may not provide enough information due to occlusion, background clutters or variation in illumination condition. While Viola highlighted that their method is purely intensity-based, we found that to apply mutual information to real world scenes, we have to include other information so that the matching metric is more discriminative. To illustrate the ambiguity issue, a model is rotated around the Y-axis and mutual information measures are recorded at uniform steps of five degrees from 0 to 180 degrees (figure 3). At one of the angles, the model is correctly aligned with the image. The graph shows that maximum mutual information does not occur at the ground truth (the shaded marker) but at a nearby pose (65). To resolve this ambiguity, edge orientation for the projected contours of the model (figure 4) are added into the mutual information between model and image data. For each hypothetical pose, contours of the projected model are detected using an edge detector. Edge orientations of the model contours, EOm , are computed. We then detect edges around the model contours on the image. Image edge pixels within a window (we used 10 pixels) around each edge pixel of the model contours are included in the calculation of the edge orientations for the image edges, EOi . Mutual information between model and image edge orientations, M I(EOm , EOi ), is then added to M I(N, I) defined in (1): M I = M I(EOm , EOi ) + M I(N, I)

(4)

290

H.-K. Pong and T.-J. Cham

22

56

20

54

18

52 MI

58

MI

24

16

50

14

48

12

46

10

0

20

40

60

80 100 Degrees

120

140

160

180

44

0

20

40

60

80 100 Degrees

120

140

160

180

Fig. 3. (a) Intensity information alone is insufficient for matching using mutual information. (b)After adding in edge information into the objective function (1), maximum mutual information is achieved at the ground truth (i.e. the shaded marker).

Fig. 4. (a) Edge orientation is included as additional information. (b) Some of the test images. (c) One of the infra-red images used in experiments.

Figure 3 shows that after adding in the edge information, the maximum mutual information occurs at the ground truth. 5.3

Error Analysis

To ensure that maximum mutual information (MI) appears at state vectors (which are defined by the stratification) near the ground truth, we examine the energy surface of the state vectors. Firstly, a plane d that cuts through the point with the maximum mutual information and the ground truth point is chosen. We then consider points near to plane d (points G) and project the points onto plane d. MI values versus 2D coordinates of points G projected on plane d is then plotted. A visualization of the energy surface is shown in figure 5. We found that with edge information added, the estimated pose is always at or close to the ground truth pose. 5.4

Object Detection

We applied our framework to vehicle detection. Software for constructing 3D models of different LOD are readily available on the Internet, such as the popular model simplification tool by Garland and Heckbert [11]. We used the MultiRes

Object Detection Using a Cascade of 3D Models

291

Fig. 5. Visualization of energy surface for pose parameters. Maximum mutual information (point with red diamond) appears near to the ground truth point (point with red circle).

1

1

0.9

0.9

0.8

0.8

0.7

0.7 Detection rate

Detection rate

Fig. 6. 3D car models that form the LOD hierarchy in our experiments

0.6 0.5 0.4

0.6 0.5 0.4 0.3

0.3

0.2

0.2 intensity only intensity + edge information

0.1 0

LOD1 LOD2 LOD3 LOD4 LOD5

0

0.2

0.4 0.6 False positive rate

0.8

0.1

1

0

0

0.2

0.4 0.6 False positive rate

0.8

1

Fig. 7. (a) ROC curves for mutual information with and without edge information. (b) ROC curves for the LOD models, where LOD1 is the highest LOD model.

modifier in 3D Studio Max to generate the LOD models. The cascaded detection method was tested on both real and infra-red images (figure 4). To evaluate the performance of the detection algorithm, we first manually align the highest LOD model to the images and these ground truth poses (which are a few 6D

292

H.-K. Pong and T.-J. Cham

state vectors) are then recorded as true positives. We then run through the cascade and record the number of hits and false positives. Figure 7 shows the receiver operating curves (ROC) for mutual information with and without edge information on one of the test images. The ROC curves show that by including edge information, detection performance improve significantly. While an exhaustive search in the stratified parameter space using the highest LOD model (i.e. single layer) takes near to thirty minutes to complete, the cascaded detection takes about eight minutes using a hierarchy of five models. The car models have 13, 26, 78, 366, 3317 polygons respectively (figure 6). ROC curves for the five LOD models are shown in figure 7. We noticed that there is still room for improvement in speed, as currently the models are handled independently without considering their individual detection performance. This is a design issue of the cascade: choosing which model to be included in the hierarchy, and how to set the threshold value for each model by analyzing their ROC curves.

6

Conclusion

We have presented an alignment-based detection framework using a hierarchy of 3D models of increasing levels-of-detail. The designed cascade speeds up the search for the optimal pose parameters in a densely sampled parameter space. As the method does not face the issues of local optimum and convergence failures, it is more reliable and practical than methods that rely on directed search techniques. We have demonstrated that by adding edge information into the calculation of mutual information, discriminative power of the matching metric is increased significantly for real scenes. We are working on an optimization framework for improving the design of the cascade such that optimal trade-off between performance and running time can be achieved. Choosing models at the optimal levels-of-detail to be included is part of the cascade design issue. We would also like to more extensively test the framework using other data set.

References 1. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Volume 2., Madison, WI (2003) 264–271 2. Weber, M., Welling, W., Perona, P.: Unsupervised learning of models for recognition. In: Proceedings of the European Conference on Computer Vision. Volume 1., Dublin, Ireland (2000) 18–32 3. Viola, P.: Alignment by Maximization of Mutual Information. PhD thesis, Massachusetts Institute of Technology (1995) 4. Cover, T., Thomas, J.: Elements of Information Theory. John Wiley (1991) 5. Campbell, R., Flynn, P.: A survey of free-form object representation and recognition techniques. Computer Vision and Image Understanding 81 (2001) 166–210

Object Detection Using a Cascade of 3D Models

293

6. Kollnig, H., Nagel, N.N.: 3d pose estimation by directly matching polyhedral models to gray value gradients. International Journal of Computer Vision 23 (1997) 283–302 7. Tan, T., Sullivan, G., Baker, K.: Model-based localization and recognition of road vehicles. International Journal of Computer Vision 27 (1998) 5–25 8. Suveg, I., Gosselman, G.: Mutual information based evaluation of 3d building models. In: Proceedings of the International Conference on Pattern Recognition. Volume 3., Quebec City, Canada (2002) 188–197 9. Leventon, M., Wells III, W., Grimson, W.: Multiple view 2d-3d mutual information registration. In: DARPA IMage Understanding Workshop. (1997) 625–630 10. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Volume 1., Kauai, HI (2001) 511 11. Garland, M., Heckbert, P.: Surface simplification using quadric error metrics. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. (1997) 209–216 12. Decaudin, P.: Cartoon-looking rendering of 3d scenes. Technical Report 2919, INRIA (1996)