pongcham-accv2006a - Nanyang Technological University

Comment

Report 3 Downloads 91 Views

Alignment of 3D Models to Images Using Region-Based Mutual Information and Neighborhood Extended Gaussian Images Hon-Keat Pong and Tat-Jen Cham School of Computer Engineering, Nanyang Technological University, Singapore [email protected] [email protected]

Abstract. Mutual information has been used for matching and registering 3D models to 2D images. However, in Viola’s original framework [1], surface albedo variance is assumed to be minimal when measuring similarity between 3D models and 2D image data using mutual information. In reality, most objects have textured surfaces with diﬀerent albedo values across their surfaces, and direct application of this method in such circumstances will fail. To solve this problem, we propose to include spatial information into the original formulation by using histogram-based features of local regions that are robust to local but signiﬁcant albedo variation. Neighborhood Extended Gaussian Images (NEGI) are used as descriptors to represent local surface regions on the 3D model, while pixel intensity data are considered within corresponding region windows on the image. Experiments on aligning 3D car models in cluttered scenes using this new framework demonstrate substantial improvement as compared to the original pixel-wise mutual information approach.

1

Introduction

One of the diﬃcult problems in computer vision is the registration of a 3D model to an image. 2D-3D alignment techniques are applied in the medical images domain to register 3D volumetric data with 2D images, with mutual information as one of the most popular similarity measures [2]. 3D geometric models are used for detecting faces and objects in 2D images by ﬁnding pose estimates through alignment. Representations of object models have been studied extensively for varied detection techniques and application purposes. Some existing 2D-3D alignmentbased detection methods use edges of the 3D geometric models as a matching cue, through ﬁnding invariant descriptors from 2D projection proﬁles or by deﬁning shape signatures [3, 4, 5, 6]. Viola [1] and Maes et al. [7] proposed an alignment approach using a similarity measure derived from information theory [8]. An interesting application in [1] is that we can take surface normal samples (N ) from a geometric model, collect the corresponding intensity values (I) in the image and then compute the mutual information (MI) between N and I. Object pose in the image is estimated by maximization of mutual information. P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3851, pp. 60–69, 2006. c Springer-Verlag Berlin Heidelberg 2006

Alignment of 3D Models to Images Using Region-Based MI and NEGI

61

As a similarity measure, mutual information does not assume a known functional relationship between the model and the image. Rather, it only assumes that a consistent relationship exists. The consistency principle states that similar model data will map to similar image data and it is observed that a correct alignment will generally lead to a consistent relationship (ﬁgure 3). This makes mutual information a more robust similarity measure for matching multi-modal data [2]. A likely situation for object detection is that functional relationships between 3D model and 2D image can be diﬃcult to model or hard to establish due to complexities such as illumination changes and shadows, the 3D model being a weak descriptor (for instance, the available model is only a rough approximation of the object shape with low polygonal counts), or the rather unusual appearance of the object for its image being captured using a thermal camera. Mutual information has been shown to be a promising matching metric in such situation, but very little has been studied in the case of 2D-3D alignment beyond the initial framework in [1] and the medical image registration domain. There is an important limitation to mutual information applied in alignment of 3D geometric models to image data: it fails on surfaces that have signiﬁcant albedo variation. The reason for the failure is that, as mutual information takes into account only the relationship between single dimension points (i.e. a single model normal and intensity of a single pixel), the consistency principle breaks down when similar surface normals map to diﬀerent intensity values (ﬁgure 2). In reality, many objects have textured surfaces with varying albedo values across their surfaces. To make mutual information more applicable to real world scenarios, it is important to handle the issue of varied albedo across object surface. In this paper, a method to solve the aforementioned problem is presented. We propose to include spatial information into the original formulation by including a neighborhood set of points in a novel manner that makes it robust to albedo variation. To accommodate the extension to alignment of 3D models to 2D image data, we deﬁne the Neighborhood Extended Gaussian Images to represent shape within local surface regions on the model, and consider intensity data within region windows on the image. The method makes it more practical for 2D-3D alignment based on the mutual information for non-medical images. The paper is organized as follows: section 2 contains review of related work. Section 3 presents discussion on the mutual information as a matching metric and an extension of its original formulation. Section 4 presents the Neighborhood Extended Gaussian Images. Section 5 presents our experimental results and ﬁnally Section 6 presents some conclusions and future research directions.

2

Previous Work

Instead of comparing images using singleton pixels, Russakoﬀ et al. [9] extended mutual information to include spatial information by using more pixels in a neighborhood when computing the mutual information – this is applied to 2D medical image registration. The framework exploits the spatial relationship of

62

H.-K. Pong and T.-J. Cham

pixels in a simple manner to provide greater regularization to the optimization problem, but does not deal with signiﬁcant albedo variation. Our method not only extends the problem to 3D alignment, but is speciﬁcally designed to handle substantial (albeit local) albedo variation. Campbell and Flynn provide a comprehensive survey of 3D object recognition techniques using 3D geometric models [3]. Two related works that use 3D vehicular models are that of Kollnig and Nagel [5] and Tan et al. [6]. Kollnig and Nagel made use of intensity discontinuities along projection contours to update object pose while Tan et al. estimated the model pose by matching 2D image and 3D model lines using the Hough Transform. Recently, Suveg and Gosselman [4] aligned simple polyhedral block models to aerial views of buildings using mutual information as matching metric. Mutual information between gradient magnitude along model contour and image data is computed. Their framework is still subject to the consistency breakdown issue as no spatial information is included in the formulation. In Viola’s alignment approach [1], surface normals of the object are matched to intensity values by maximizing their mutual information with respect to a set of transformation parameters. Leventon and Grimson [10] extended the alignment framework to using multiple views of the object when single image does not provide enough information.

3

Mutual Information as Similarity Measure

Mutual information is a statistical measure assessing the dependency between two random variables, without requiring that functional forms of the random variables be known [8]. It can be thought of as a measure of how well one random variable explains the other, i.e. how much information about one random variable is contained in the other random variable. If random variable A explains random variable B well, their joint entropy is reduced. Deﬁned in terms of entropies, the mutual information between two random variables A and B, I(M, I) is I(A, B) = H(A) + H(B) − H(A, B) where H(A) and H(B) are marginal entropies derived from the probability distribution functions corresponding to A and B, i.e. p(a) log p(a) H(A) = − a

H(B) = −

p(b) log p(b)

b

H(A, B) is the joint entropy of the two random variables that is deﬁned as H(A, B) = − p(a, b) log p(a, b) a

b

Alignment of 3D Models to Images Using Region-Based MI and NEGI

63

Fig. 1. Consistent relationship between model and image. The ﬁgure shows a scatter plot of the variation of intensity values on the white scan line of the teapot image versus variation of the corresponding x components of the surface normals for a correct alignment.

Mutual information is assumed to be maximal when the model is aligned correctly with the image for a set of transformation parameters (we consider six-parameter rigid transformation of the object model, i.e. 3 for rotations and 3 for translations). For our 2D-3D alignment framework using polygonal models, mutual information is computed from the joint and marginal entropies of surface normals (using x and y components of the normals) and image intensities. However, the original formulation is only feasible for surfaces with minimal albedo variance. As mutual information does not contain information about spatial distributions of intensities and surface normals, ambiguity arises when the maximum mutual information doesn’t occur at the correct object pose due to varying albedo on surface points. Longer-range interaction between point samples is ignored when they are considered independently in the mutual information formulation. Russakoﬀ et al. [9] extended the original formulation of mutual information (MI) to include spatial information by using higher dimensional points consisting of pixels in a neighborhood – the regional mutual information (RMI). For a sample point S, spatial information is brought into MI by grouping neighboring pixels within a chosen radius to form a higher dimensional vector. When applying Russakoﬀ et al. ’s formulation to our case, the normals in a neighborhood (here, a 4x4 window) are grouped into a higher dimensional vector N , and the corresponding pixels in the image form the vector I:

64

H.-K. Pong and T.-J. Cham

Fig. 2. Consistency breakdown: a scatter plot of the intensity values on the same scan line of the teapot image versus variation of the corresponding x components of the surface normals when the teapot is textured

N = {x1 , x2 , x3 , x4 , y1 , y2 , y3 , y4 } I = {i1 , i2 , i3 , i4 } where xi and yi are x and y components of the normals. ii is intensity value for the corresponding image pixel. To deal with the curse of dimensionality, the dimensions are assumed to be independent from each other to allow entropy calculation to be decoupled from one involving d-dimensional distribution to one involving d one-dimensional distributions. Shanon’s entropy formulation for a set of points distributed in Rd with covariance matrix Σd is then used to calculate entropy of the high-dimensional points [11]: 1 Hg (Σd ) = log((2πe)d/2 det(Σd ) 2 )

4

Region Mutual Information Using the Neighborhood Extended Gaussian Images

The straightforward concatenation of neighborhood pixel data into a high dimensional state vector does not automatically induce invariance to non-constant albedo. A diﬀerent representation is therefore required. In our framework, the assumption is that while albedo may be substantially diﬀerent from one point on the object to the next and uncorrelated with the

Alignment of 3D Models to Images Using Region-Based MI and NEGI

65

geometry of the object, the statistics of the albedo within a larger semi-local region on the object surface is much more strongly correlated to the geometry. This is based on the observation that, at least for the classes of objects that we are interested in, the portion of albedo variation that is independent of the object geometry is often only of higher spatial frequencies. These high-frequency variations are substantially reduced by considering histogram-based features of larger regions on the object and image. On the other hand, the portion of albedo variation that directly depends on the object geometry can be preserved and used in the computation of mutual information. The Extended Gaussian Images (EGI) [12] is a 3D shape descriptor obtained by having each polygon vote on the bin corresponding to its normal direction, with a weight equal to the area of the polygon. It is a global representation of the model as normals on all polygons are mapped spherically to the histogram (ﬁgure 3). The Neighborhood EGI (NEGI) describes local shape of surface regions by grouping neighborhood surface normals according to their spherical coordinates (i.e. latitude and longitude){θ, φ}. When building the EGI, one has to tessellate

Fig. 3. High-dimensional point to include spatial information. The mannequin model is shown with normals on the triangles. For a 2D window on the projection screen of the 3D model, normals that fall within the window are collected. Spherical coordinates of each normal, N , are computed. Corresponding intensity values, I, in the image are collected. Both vectors then combined to form a high-dimensional point for estimation of joint entropy term in the mutual information formulation.

66

H.-K. Pong and T.-J. Cham

the Gaussian sphere into cells. These cells should have the same area and similar shape. As the NEGI represents local normals within a small region window, we can assume that the surface patch on the Gaussian sphere that corresponds to the region window is ﬁnely subdivided. Normals are sampled from normal maps generated using OpenGL. All pixels on the object in the normal map have RGB corresponds to (x, y, z) of surface normals on surface points at the pixel locations. An example normal map is shown in ﬁgure 4. We note that instead of one normal for a polygon with area A on the geometric model, normals are continuous on the normal map. Therefore when only normals within a small 2D window on the normal map are considered, we can assume that the model has very high resolution and unit weight is associated with each normal sample.

Fig. 4. A sample normal map for a car model. We uniformly sample pixel locations on the normal map. RGB values of each pixel correspond to (x, y, z) components of surface normal at the pixel location.

As shown in ﬁgure 3, for a normal sample n on the normal map, neighboring normals within a 2D region window w (in this case, a 2 by 2 window) are collected to form a high-dimensional vector N : N = {θ1 , θ2 , θ3 , θ4 , φ1 , φ2 , φ3 , φ4 } and the corresponding image intensity values are collected to form the highdimensional vector, I. N and I are then concatenated to form a high-dimensional point p: p = {θ1 , θ2 , θ3 , θ4 , φ1 , φ2 , φ3 , φ4 , i1 , i2 , i3 , i4 } 4.1

Algorithm

The algorithm proceeds as follows: – Given an object A and image B, render normal map of object A at current pose, generate sample locations on the normal map. – For each sample location on the normal map, collect N and I. Concatenate N and I to form p. – For n sample locations, we have n high-dimensional points, pi , P = [p1 , p2 , ..., pn ]. – Calculate covariance of the points [9], C = (1/n)P0 P0T , where P0 is zeromean of P . – Calculate joint entropy using Hg (C) and marginal entropies using the method described in [9].

Alignment of 3D Models to Images Using Region-Based MI and NEGI

5 5.1

67

Experimental Results Alignment Using the NEGI

In the ﬁrst experiment, we looked at misalignment with respect to rotational oﬀsets along the y-axis. Region mutual information is plotted with varying neighborhood sizes (r = 2, r = 3, r = 4, r = 5) (ﬁgure 5). As we consider region mutual information with larger neighborhoods, more spatial information is included and we get a stronger peak at the global optimum. This distinctiveness of the response at the ground truth point will help to reduce ambiguity, as shown in the following detection experiments.

MI r=2 r=3 r=4 r=5

30

25

MI

20

15

10

5

0

0

50

100

150

200 Degree

250

300

350

Fig. 5. Combined plots of RMI as a function of rotational misalignment in the y-axis with neighborhoods of varying sizes (r = 2, r = 3, r = 4, r = 5) to clearly show that a stronger peak is obtained at the global optimum when more spatial information is brought into the metric, thus reducing ambiguity when comparing model to image data. Original MI is also plotted in the graph.

5.2

Detection

For comparing detection performances, we did a naive search of the pose space. This allows us to obtain the global optimal pose parameters, without having to worry about the issues of local optimums and convergence failures. We manually aligned a detailed 3D car model to the test image (some of the test images are shown in ﬁgure 6). Ground truth poses for the model are recorded. When plotting the receiver operating characteristics (ROC) curves, these ground truth poses are the true positives. The average of the mutual information values at the

68

H.-K. Pong and T.-J. Cham

Fig. 6. Test images 1 0.9

Detection Rate

0.8 0.7 0.6 0.5

MI r=2 r=3 r=4 r=5

0.4 0.3 0.2 0.1

0.2

0.3

0.4

0.5 0.6 0.7 False Positive Rate

0.8

0.9

1

Fig. 7. ROC curves for RMI and MI

ground truth poses is used as detection threshold. ROC curves for original MI and RMI with varying neighborhood sizes are shown in ﬁgure 7. The plots show that when spatial information is included, there is gain in detection performance.

6

Conclusion

This paper presents a method to align 3D geometric model to image using mutual information (MI) as similarity measure. While MI has enjoyed a great deal of success in the medical image registration domain, its application to general object detection has been limited, one major reason being its failure in capturing longer-range information when comparing model to image data. To solve this issue, we propose to use a region-based method so that ambiguity due to albedo variance is reduced when spatial information is included. We deﬁned the Neighborhood Extended Gaussian Images for the case of 3D model-2D image alignment. Experiments showed that the method works better than the original formulation. In the future, we plan to include regional edge information in the mutual information calculation, which we believe would make the metric more discriminative. Additionally, we have made some progress in designing an approach to allowing this framework to run much more quickly [13]. We would also like to validate the method more extensively with other data set.

Alignment of 3D Models to Images Using Region-Based MI and NEGI

69

References 1. Viola, P.: Alignment by Maximization of Mutual Information. PhD thesis, Massachusetts Institute of Technology (1995) 2. Pluim, J., Maintz, J., Viergiver, M.: Mutual information based registration of medical images: A survey. IEEE Transcations on Medical Imaging 22 (2003) 986– 1004 3. Campbell, R., Flynn, P.: A survey of free-form object representation and recognition techniques. Computer Vision and Image Understanding 81 (2001) 166–210 4. Suveg, I., Gosselman, G.: Mutual information based evaluation of 3D building models. In: Proceedings of the International Conference on Pattern Recognition. Volume 3., Quebec City, Canada (2002) 188–197 5. Kollnig, H., Nagel, N.: 3D pose estimation by directly matching polyhedral models to gray value gradients. International Journal of Computer Vision 23 (1997) 283– 302 6. Tan, T., Sullivan, G., Baker, K.: Model-based localization and recognition of road vehicles. International Journal of Computer Vision 27 (1998) 5–25 7. Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Transactions on Medical Imaging 16 (1997) 187–198 8. Cover, T., Thomas, J.: Elements of Information Theory. John Wiley (1991) 9. Russakoﬀ, D., Tomasi, C., Rohlﬁng, T., Maurer, C.: Image similarity using mutual information of regions. In: Proceedings of the European Conference on Computer Vision, Prague, Czech Republic (2004) 596–607 10. Leventon, M., Wells III, W., Grimson, W.: Multiple view 2D-3D mutual information registration. In: DARPA Image Understanding Workshop. (1997) 625–630 11. Shannon, C.: A mathematical theory of communication. The Bell System Technical Journal 27 (1948) 379–423 12. Horn, B.: Extended gaussian images. Proceedings of the IEEE 72 (1984) 1656–1678 13. Pong, H., Cham, T.: Object detection using a cascade of 3D models. In: Proceedings of the Asian Conference on Computer Vision, Hyderabad, India (2006)

Recommend Documents

Paper - Nanyang Technological University

Preprint - NTU.edu - Nanyang Technological University