Alignment of 3D Models to Images Using Region-Based Mutual Information and Neighborhood Extended Gaussian Images Hon-Keat Pong and Tat-Jen Cham School of Computer Engineering, Nanyang Technological University, Singapore
[email protected] [email protected] Abstract. Mutual information has been used for matching and registering 3D models to 2D images. However, in Viola’s original framework [1], surface albedo variance is assumed to be minimal when measuring similarity between 3D models and 2D image data using mutual information. In reality, most objects have textured surfaces with different albedo values across their surfaces, and direct application of this method in such circumstances will fail. To solve this problem, we propose to include spatial information into the original formulation by using histogram-based features of local regions that are robust to local but significant albedo variation. Neighborhood Extended Gaussian Images (NEGI) are used as descriptors to represent local surface regions on the 3D model, while pixel intensity data are considered within corresponding region windows on the image. Experiments on aligning 3D car models in cluttered scenes using this new framework demonstrate substantial improvement as compared to the original pixel-wise mutual information approach.
1
Introduction
One of the difficult problems in computer vision is the registration of a 3D model to an image. 2D-3D alignment techniques are applied in the medical images domain to register 3D volumetric data with 2D images, with mutual information as one of the most popular similarity measures [2]. 3D geometric models are used for detecting faces and objects in 2D images by finding pose estimates through alignment. Representations of object models have been studied extensively for varied detection techniques and application purposes. Some existing 2D-3D alignmentbased detection methods use edges of the 3D geometric models as a matching cue, through finding invariant descriptors from 2D projection profiles or by defining shape signatures [3, 4, 5, 6]. Viola [1] and Maes et al. [7] proposed an alignment approach using a similarity measure derived from information theory [8]. An interesting application in [1] is that we can take surface normal samples (N ) from a geometric model, collect the corresponding intensity values (I) in the image and then compute the mutual information (MI) between N and I. Object pose in the image is estimated by maximization of mutual information. P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3851, pp. 60–69, 2006. c Springer-Verlag Berlin Heidelberg 2006
Alignment of 3D Models to Images Using Region-Based MI and NEGI
61
As a similarity measure, mutual information does not assume a known functional relationship between the model and the image. Rather, it only assumes that a consistent relationship exists. The consistency principle states that similar model data will map to similar image data and it is observed that a correct alignment will generally lead to a consistent relationship (figure 3). This makes mutual information a more robust similarity measure for matching multi-modal data [2]. A likely situation for object detection is that functional relationships between 3D model and 2D image can be difficult to model or hard to establish due to complexities such as illumination changes and shadows, the 3D model being a weak descriptor (for instance, the available model is only a rough approximation of the object shape with low polygonal counts), or the rather unusual appearance of the object for its image being captured using a thermal camera. Mutual information has been shown to be a promising matching metric in such situation, but very little has been studied in the case of 2D-3D alignment beyond the initial framework in [1] and the medical image registration domain. There is an important limitation to mutual information applied in alignment of 3D geometric models to image data: it fails on surfaces that have significant albedo variation. The reason for the failure is that, as mutual information takes into account only the relationship between single dimension points (i.e. a single model normal and intensity of a single pixel), the consistency principle breaks down when similar surface normals map to different intensity values (figure 2). In reality, many objects have textured surfaces with varying albedo values across their surfaces. To make mutual information more applicable to real world scenarios, it is important to handle the issue of varied albedo across object surface. In this paper, a method to solve the aforementioned problem is presented. We propose to include spatial information into the original formulation by including a neighborhood set of points in a novel manner that makes it robust to albedo variation. To accommodate the extension to alignment of 3D models to 2D image data, we define the Neighborhood Extended Gaussian Images to represent shape within local surface regions on the model, and consider intensity data within region windows on the image. The method makes it more practical for 2D-3D alignment based on the mutual information for non-medical images. The paper is organized as follows: section 2 contains review of related work. Section 3 presents discussion on the mutual information as a matching metric and an extension of its original formulation. Section 4 presents the Neighborhood Extended Gaussian Images. Section 5 presents our experimental results and finally Section 6 presents some conclusions and future research directions.
2
Previous Work
Instead of comparing images using singleton pixels, Russakoff et al. [9] extended mutual information to include spatial information by using more pixels in a neighborhood when computing the mutual information – this is applied to 2D medical image registration. The framework exploits the spatial relationship of
62
H.-K. Pong and T.-J. Cham
pixels in a simple manner to provide greater regularization to the optimization problem, but does not deal with significant albedo variation. Our method not only extends the problem to 3D alignment, but is specifically designed to handle substantial (albeit local) albedo variation. Campbell and Flynn provide a comprehensive survey of 3D object recognition techniques using 3D geometric models [3]. Two related works that use 3D vehicular models are that of Kollnig and Nagel [5] and Tan et al. [6]. Kollnig and Nagel made use of intensity discontinuities along projection contours to update object pose while Tan et al. estimated the model pose by matching 2D image and 3D model lines using the Hough Transform. Recently, Suveg and Gosselman [4] aligned simple polyhedral block models to aerial views of buildings using mutual information as matching metric. Mutual information between gradient magnitude along model contour and image data is computed. Their framework is still subject to the consistency breakdown issue as no spatial information is included in the formulation. In Viola’s alignment approach [1], surface normals of the object are matched to intensity values by maximizing their mutual information with respect to a set of transformation parameters. Leventon and Grimson [10] extended the alignment framework to using multiple views of the object when single image does not provide enough information.
3
Mutual Information as Similarity Measure
Mutual information is a statistical measure assessing the dependency between two random variables, without requiring that functional forms of the random variables be known [8]. It can be thought of as a measure of how well one random variable explains the other, i.e. how much information about one random variable is contained in the other random variable. If random variable A explains random variable B well, their joint entropy is reduced. Defined in terms of entropies, the mutual information between two random variables A and B, I(M, I) is I(A, B) = H(A) + H(B) − H(A, B) where H(A) and H(B) are marginal entropies derived from the probability distribution functions corresponding to A and B, i.e. p(a) log p(a) H(A) = − a
H(B) = −
p(b) log p(b)
b
H(A, B) is the joint entropy of the two random variables that is defined as H(A, B) = − p(a, b) log p(a, b) a
b
Alignment of 3D Models to Images Using Region-Based MI and NEGI
63
Fig. 1. Consistent relationship between model and image. The figure shows a scatter plot of the variation of intensity values on the white scan line of the teapot image versus variation of the corresponding x components of the surface normals for a correct alignment.
Mutual information is assumed to be maximal when the model is aligned correctly with the image for a set of transformation parameters (we consider six-parameter rigid transformation of the object model, i.e. 3 for rotations and 3 for translations). For our 2D-3D alignment framework using polygonal models, mutual information is computed from the joint and marginal entropies of surface normals (using x and y components of the normals) and image intensities. However, the original formulation is only feasible for surfaces with minimal albedo variance. As mutual information does not contain information about spatial distributions of intensities and surface normals, ambiguity arises when the maximum mutual information doesn’t occur at the correct object pose due to varying albedo on surface points. Longer-range interaction between point samples is ignored when they are considered independently in the mutual information formulation. Russakoff et al. [9] extended the original formulation of mutual information (MI) to include spatial information by using higher dimensional points consisting of pixels in a neighborhood – the regional mutual information (RMI). For a sample point S, spatial information is brought into MI by grouping neighboring pixels within a chosen radius to form a higher dimensional vector. When applying Russakoff et al. ’s formulation to our case, the normals in a neighborhood (here, a 4x4 window) are grouped into a higher dimensional vector N , and the corresponding pixels in the image form the vector I:
64
H.-K. Pong and T.-J. Cham
Fig. 2. Consistency breakdown: a scatter plot of the intensity values on the same scan line of the teapot image versus variation of the corresponding x components of the surface normals when the teapot is textured
N = {x1 , x2 , x3 , x4 , y1 , y2 , y3 , y4 } I = {i1 , i2 , i3 , i4 } where xi and yi are x and y components of the normals. ii is intensity value for the corresponding image pixel. To deal with the curse of dimensionality, the dimensions are assumed to be independent from each other to allow entropy calculation to be decoupled from one involving d-dimensional distribution to one involving d one-dimensional distributions. Shanon’s entropy formulation for a set of points distributed in Rd with covariance matrix Σd is then used to calculate entropy of the high-dimensional points [11]: 1 Hg (Σd ) = log((2πe)d/2 det(Σd ) 2 )
4
Region Mutual Information Using the Neighborhood Extended Gaussian Images
The straightforward concatenation of neighborhood pixel data into a high dimensional state vector does not automatically induce invariance to non-constant albedo. A different representation is therefore required. In our framework, the assumption is that while albedo may be substantially different from one point on the object to the next and uncorrelated with the
Alignment of 3D Models to Images Using Region-Based MI and NEGI
65
geometry of the object, the statistics of the albedo within a larger semi-local region on the object surface is much more strongly correlated to the geometry. This is based on the observation that, at least for the classes of objects that we are interested in, the portion of albedo variation that is independent of the object geometry is often only of higher spatial frequencies. These high-frequency variations are substantially reduced by considering histogram-based features of larger regions on the object and image. On the other hand, the portion of albedo variation that directly depends on the object geometry can be preserved and used in the computation of mutual information. The Extended Gaussian Images (EGI) [12] is a 3D shape descriptor obtained by having each polygon vote on the bin corresponding to its normal direction, with a weight equal to the area of the polygon. It is a global representation of the model as normals on all polygons are mapped spherically to the histogram (figure 3). The Neighborhood EGI (NEGI) describes local shape of surface regions by grouping neighborhood surface normals according to their spherical coordinates (i.e. latitude and longitude){θ, φ}. When building the EGI, one has to tessellate
Fig. 3. High-dimensional point to include spatial information. The mannequin model is shown with normals on the triangles. For a 2D window on the projection screen of the 3D model, normals that fall within the window are collected. Spherical coordinates of each normal, N , are computed. Corresponding intensity values, I, in the image are collected. Both vectors then combined to form a high-dimensional point for estimation of joint entropy term in the mutual information formulation.
66
H.-K. Pong and T.-J. Cham
the Gaussian sphere into cells. These cells should have the same area and similar shape. As the NEGI represents local normals within a small region window, we can assume that the surface patch on the Gaussian sphere that corresponds to the region window is finely subdivided. Normals are sampled from normal maps generated using OpenGL. All pixels on the object in the normal map have RGB corresponds to (x, y, z) of surface normals on surface points at the pixel locations. An example normal map is shown in figure 4. We note that instead of one normal for a polygon with area A on the geometric model, normals are continuous on the normal map. Therefore when only normals within a small 2D window on the normal map are considered, we can assume that the model has very high resolution and unit weight is associated with each normal sample.
Fig. 4. A sample normal map for a car model. We uniformly sample pixel locations on the normal map. RGB values of each pixel correspond to (x, y, z) components of surface normal at the pixel location.
As shown in figure 3, for a normal sample n on the normal map, neighboring normals within a 2D region window w (in this case, a 2 by 2 window) are collected to form a high-dimensional vector N : N = {θ1 , θ2 , θ3 , θ4 , φ1 , φ2 , φ3 , φ4 } and the corresponding image intensity values are collected to form the highdimensional vector, I. N and I are then concatenated to form a high-dimensional point p: p = {θ1 , θ2 , θ3 , θ4 , φ1 , φ2 , φ3 , φ4 , i1 , i2 , i3 , i4 } 4.1
Algorithm
The algorithm proceeds as follows: – Given an object A and image B, render normal map of object A at current pose, generate sample locations on the normal map. – For each sample location on the normal map, collect N and I. Concatenate N and I to form p. – For n sample locations, we have n high-dimensional points, pi , P = [p1 , p2 , ..., pn ]. – Calculate covariance of the points [9], C = (1/n)P0 P0T , where P0 is zeromean of P . – Calculate joint entropy using Hg (C) and marginal entropies using the method described in [9].
Alignment of 3D Models to Images Using Region-Based MI and NEGI
5 5.1
67
Experimental Results Alignment Using the NEGI
In the first experiment, we looked at misalignment with respect to rotational offsets along the y-axis. Region mutual information is plotted with varying neighborhood sizes (r = 2, r = 3, r = 4, r = 5) (figure 5). As we consider region mutual information with larger neighborhoods, more spatial information is included and we get a stronger peak at the global optimum. This distinctiveness of the response at the ground truth point will help to reduce ambiguity, as shown in the following detection experiments.
MI r=2 r=3 r=4 r=5
30
25
MI
20
15
10
5
0
0
50
100
150
200 Degree
250
300
350
Fig. 5. Combined plots of RMI as a function of rotational misalignment in the y-axis with neighborhoods of varying sizes (r = 2, r = 3, r = 4, r = 5) to clearly show that a stronger peak is obtained at the global optimum when more spatial information is brought into the metric, thus reducing ambiguity when comparing model to image data. Original MI is also plotted in the graph.
5.2
Detection
For comparing detection performances, we did a naive search of the pose space. This allows us to obtain the global optimal pose parameters, without having to worry about the issues of local optimums and convergence failures. We manually aligned a detailed 3D car model to the test image (some of the test images are shown in figure 6). Ground truth poses for the model are recorded. When plotting the receiver operating characteristics (ROC) curves, these ground truth poses are the true positives. The average of the mutual information values at the
68
H.-K. Pong and T.-J. Cham
Fig. 6. Test images 1 0.9
Detection Rate
0.8 0.7 0.6 0.5
MI r=2 r=3 r=4 r=5
0.4 0.3 0.2 0.1
0.2
0.3
0.4
0.5 0.6 0.7 False Positive Rate
0.8
0.9
1
Fig. 7. ROC curves for RMI and MI
ground truth poses is used as detection threshold. ROC curves for original MI and RMI with varying neighborhood sizes are shown in figure 7. The plots show that when spatial information is included, there is gain in detection performance.
6
Conclusion
This paper presents a method to align 3D geometric model to image using mutual information (MI) as similarity measure. While MI has enjoyed a great deal of success in the medical image registration domain, its application to general object detection has been limited, one major reason being its failure in capturing longer-range information when comparing model to image data. To solve this issue, we propose to use a region-based method so that ambiguity due to albedo variance is reduced when spatial information is included. We defined the Neighborhood Extended Gaussian Images for the case of 3D model-2D image alignment. Experiments showed that the method works better than the original formulation. In the future, we plan to include regional edge information in the mutual information calculation, which we believe would make the metric more discriminative. Additionally, we have made some progress in designing an approach to allowing this framework to run much more quickly [13]. We would also like to validate the method more extensively with other data set.
Alignment of 3D Models to Images Using Region-Based MI and NEGI
69
References 1. Viola, P.: Alignment by Maximization of Mutual Information. PhD thesis, Massachusetts Institute of Technology (1995) 2. Pluim, J., Maintz, J., Viergiver, M.: Mutual information based registration of medical images: A survey. IEEE Transcations on Medical Imaging 22 (2003) 986– 1004 3. Campbell, R., Flynn, P.: A survey of free-form object representation and recognition techniques. Computer Vision and Image Understanding 81 (2001) 166–210 4. Suveg, I., Gosselman, G.: Mutual information based evaluation of 3D building models. In: Proceedings of the International Conference on Pattern Recognition. Volume 3., Quebec City, Canada (2002) 188–197 5. Kollnig, H., Nagel, N.: 3D pose estimation by directly matching polyhedral models to gray value gradients. International Journal of Computer Vision 23 (1997) 283– 302 6. Tan, T., Sullivan, G., Baker, K.: Model-based localization and recognition of road vehicles. International Journal of Computer Vision 27 (1998) 5–25 7. Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Transactions on Medical Imaging 16 (1997) 187–198 8. Cover, T., Thomas, J.: Elements of Information Theory. John Wiley (1991) 9. Russakoff, D., Tomasi, C., Rohlfing, T., Maurer, C.: Image similarity using mutual information of regions. In: Proceedings of the European Conference on Computer Vision, Prague, Czech Republic (2004) 596–607 10. Leventon, M., Wells III, W., Grimson, W.: Multiple view 2D-3D mutual information registration. In: DARPA Image Understanding Workshop. (1997) 625–630 11. Shannon, C.: A mathematical theory of communication. The Bell System Technical Journal 27 (1948) 379–423 12. Horn, B.: Extended gaussian images. Proceedings of the IEEE 72 (1984) 1656–1678 13. Pong, H., Cham, T.: Object detection using a cascade of 3D models. In: Proceedings of the Asian Conference on Computer Vision, Hyderabad, India (2006)