An Adaptive Resolution Voxelization Framework for 3D Ear Recognition Steven Cadavida Sherin Fathyb Jindan Zhoua Mohamed Abdel-Mottaleba a Department of Electrical and Computer Engineering, University of Miami b Institute of Graduate Studies and Research, Alexandria University
Abstract
A voxel representation capturing the former of these attributes - presence - is achieved by a process referred to as a binary voxelization: a value of ’1’ is assigned to voxels within which the surface traverses, and ’0’ otherwise. The binary voxelization of two surfaces utilizing a common voxel structure results in two normalized representations of equal dimension that can be efficiently compared using established binary vector similarity measures [6]. To our knowledge, the voxelization techniques reported in the biometrics literature employ a structure consisting of uniformly-sized voxels. Although simplistic, these approaches fail to account for the discriminative features of the surface during the voxelization process. This results in large, sparse binary representations used for matching (sparse because the majority of the voxels are empty due to the 3D model being thin-shelled [3]). To this end, we propose a novel framework that recursively subdivides the space occupied by an object based on the proximity to distinct surface features. This approach, in effect, places greater emphasis on distinct surface regions by designating a higher quantity of features to describe these regions in the feature vector. Unlike its fixed resolution counterpart, the resulting structure additionally captures a hierarchical relationship between a child voxel and its larger-sized parent, reminiscent to an octree. The remainder of this paper is organized as follows: Sections 2-6 describe in detail the system approach. Experimental results are reported and compared against the holistic component of the State-Of-the-Art (SOA) presented in [8] in Section 7. Lastly, concluding remarks are provided in Section 8.
We present a novel voxelization framework for holistic Three-Dimensional (3D) object representation that accounts for distinct surface features. A voxelization of an object is performed by encoding an attribute or set of attributes of the surface region contained within each voxel occupying the space that the object resides in. To our knowledge, the voxel structures employed in previous methods consist of uniformly-sized voxels. The proposed framework, in contrast, generates structures consisting of variable-sized voxels that are adaptively distributed in higher concentration near distinct surface features. The primary advantage of the proposed method over its fixed resolution counterparts is that it yields a significantly more concise feature representation that is demonstrated to achieve a superior recognition performance. An evaluation of the method is conducted on a 3D ear recognition task. The ear provides a challenging case study because of its high degree of inter-subject similarity.
1. Introduction The ear is emerging as an important class of biometrics that offers a number of inherent benefits over conventional biometric mainstays such as the face and fingerprint. These advantages include a stable structure that is relatively unchanged by age and facial expression. Researchers have recently considered the use of voxelized surface representations for holistic surface matching [7]. This approach has been proposed as an alternative to the matching method employed in the Iterative-Closest-Point (ICP) algorithm, which requires the explicit establishment of a dense set of point correspondences between the two surfaces prior to the computation of a similarity score. A voxelized surface is represented by a structured array of volume elements (voxels) in a 3D space, analogous to pixels in a bitmap. Several surface characteristics can be encoded within this representation, such as presence (i.e., whether a point on the surface is contained within a voxel) and density (i.e., the number of points contained within a voxel).
2. Constructing the Proximity Function The proposed voxelization process is driven by a proximity function that captures the nearness of every surface point to a set of distinct local features (keypoints). These distinct local features can be extracted using keypoint detectors such as Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Surface Patch Histogram of Indexed Shapes (SPHIS). The features employed in this work 1
978-1-4577-1359-0/11/$26.00 ©2011 IEEE
are extracted using the SPHIS method [8] because of its proven performance in the 3D ear representation in both repeatability across multiple images of the same subject, its selectivity of distinct features, and its small number of false positive detections (i.e., the detection of features in non-ear regions). We use the geodesic distance to capture the proximity of a surface point to a keypoint. Advantages of the geodesic distance over conventional distance metrics such as the Euclidean distance include its invariance to isometric deformations of the surface. Additionally, the geodesic distance captures information about the topology of the surface given that it is a measure of the distance between two surface points traversing the surface. We commence by constructing a discrete geodesic distance map, wi [v] , i = 1, . . . , np , defined over a discrete nv surface (i.e., point cloud), S = {vi }i=1 , comprised of 3D points, vi , i = 1, . . . , nv , for each keypoint, pi , i = 1, . . . , np . Our next objective is to fuse the map set, np {wi [v]}i=1 , into a single map, W [v], such that the map is smooth and returns small values in the Region-of-Interest (ROI) and large values in surrounding non-ear regions. To this end, we investigate the efficacy of three popular data fusion techniques, namely the min, max, and sum rules. An experiment is conducted to determine which of these schemes produces the maps most similar to the ’ideal’ maps over the 415-subject gallery set described in Section 7. The ideal map is unknown and subject-dependent, however, a reasonable and realizable estimate of the ideal map is a binary map - W [v] = 0 for all v contained within the ear region, and W [v] = 1 for all v contained within surrounding non-ear regions. These maps are generated by manually segmenting the ear region in each image. The optimal fusion scheme would result in the smallest Mean-SquaredError (MSE) between these ’ideal’ maps and the maps produced by the fusion scheme. The results, provided in Table 1, are reported in terms of the average (and standard deviation in parenthesis) of the MSE over the gallery set. We chose to pursue the sum rule, given in (1), because it Table 1. MSE between the estimated maps produced by the fusion schemes and the ’ideal’ maps.
Min Rule 0.6299 (0.0476)
Max Rule 0.4381 (0.0259)
Sum Rule 0.4351 (0.0310)
resulted in the minimum average MSE and produced the smoothest maps: W [vj ] =
Xnp
i=1
wi [vj ]
(1)
We then compute the compliment and normalize W such that the distance map is converted into a proximity map (i.e.,
Figure 1. Proximity Map. (left and middle images) The geodesic distance maps derived from two sample keypoints (depicted as white points). (rightmost image) The proximity map (the black points represent the keypoint locations).
Pnv contains larger values near a keypoint) and i=1 W [vi ] = 1. Specifically, this proximity map returns larger values over surface regions containing higher densities of keypoints; these values gradually decrease and eventually fall to zero in surface regions not comprising the ROI. An illustration of two sample maps of the map set and the final map, W , is provided in Figure 1. We then utilize W [v] to generate a piecewise continuous proximity function, W (v), by triangulating S (in our implementation, the Delaunay triangulation method was used) nT to form a triangle mesh, ST = {Ti }i=1 , and defining a linear interpolant over each triangle, Ti . Given the three vertices of Ti , vi = [xi , yi , zi ] , i = 1, . . . , 3, the interpolated proximity value, W (v∗ ), of a point, v∗ , lying on the plane of Ti is computed by solving the following linear system for the parameter vector b = [α, β, γ]: bviT = W [vi ] W (v∗ ) = bv∗T
(2)
3. Generating the Voxel Structure We propose an adaptive resolution voxelization framework reminiscent to an octree to partition the 3D space enclosing the ear surface. An octree is a data tree structure in which each internal node has exactly eight children nodes. It is often used in the computer graphics domain to perform back-face culling in order to render 3D objects more efficiently. An octree is initiated with a root voxel that is the Minimum Bounding Cube (MBC) of S, and the eight children voxels that result from the subdivision of the root voxel. Given a discrete surface, S, the voxels of the octree are recursively subdivided until all voxels contain at most one surface point. The octree voxelization scheme does present clear advantages over its fixed resolution counterpart: 1) it enables the use of variable-sized voxels, where smaller voxels are concentrated within the space that is occupied by S, and 2) it captures a hierarchical relationship between smaller and larger voxels, producing a coarse-to-fine representation of
Figure 2. Sample iterations of the voxelization process.
S. However, in terms of deployment for use in a recognition application, the traditional octree framework is limited in two aspects: 1) the partitioning scheme is simply driven by the density of the point cloud, and does not account for the discriminative features of S, and 2) the smallest achievable voxel size is limited by the resolution of the point cloud; when, for instance, a higher resolution point cloud is voxelized to encode presence using a voxel structure derived from a lower resolution point cloud, surface information about the higher resolution point cloud is lost. To this end, we propose a framework that retains the adaptive voxel resolution capabilities and the hierarchical relationship between voxels provided by the octree while introducing a partitioning process that is driven by the discriminative features of the surface and is not limited by the sampling resolution of S. V The voxel structure, Υ = {Vi }ni=1 , generated by the proposed framework is comprised of nV voxels of varying size, where nV should satisfy the equation nV = k×8+1 (k ≥ 1 is predefined by the user). As in an octree, the first nine voxels of Υ are assigned as being the MBC of S (root voxel), where the MBC is axis-aligned and centered on the centroid of S, and the eight children voxels of the root voxel. For each of these voxels, we compute Ψ [V ], which is the area integral of the continuous proximity function, W (v), evaluated over the surface region that is contained within the given voxel, V . The detection procedure of this surface region will be explained in the following section. All children voxels with corresponding area integrals greater than zero are assigned to a temporary list. The voxel in the list with the largest corresponding area integral is removed from the list and is subsequently subdivided into eight additional children voxels. These children voxels are assigned to Υ, and those with corresponding area integrals greater than zero are also assigned to the list. This process is repeated until nV voxels have been assigned to Υ. For the sake of clarity, an outline of the proposed partitioning scheme is given in Algorithm 1. An illustration of this process is also provided in Figure 2. Note that the aforementioned framework is generalizable
Algorithm 1 Generating the voxel structure 1: Designate the MBC of S as the root voxel and compute Ψ [V ]. The MBC should be axis-aligned and centered on the centroid of S. 2: Store the root voxel and its corresponding Ψ [V ] in Υ and in list, Q. 3: while Υ contains less than nV voxels do 4: Find the voxel, V∗ , in Q with the largest corresponding Ψ [V ]. 5: Remove V∗ from Q, and subdivide V∗ into eight children voxels and compute Ψ [V ] for each. 6: Store the children voxels in Υ. 7: Store the children voxels with corresponding Ψ [V ] > 0 in Q. 8: end while
to any proximity function defined over S, produces a higher concentration of voxels in surface regions where the energy of W (v) is large, and because W (v) is continuous the smallest achievable voxel size is not restricted by the resolution of S. Since the proximity function employed in this work returns larger values in surface regions containing higher densities of distinctive features, the voxelization framework assigns a higher concentration of voxels to these surface regions.
4. Triangle-Voxel Intersection To determine the surface region contained within voxel, V , we firstly apply the method in [1] to obtain the subset of nL triangles {Ti }i=1 ∈ ST that intersects V , where nL denotes the number of intersecting triangles. Secondly, we employ a proposed extension of the Sutherland-Hodgman clipping algorithm [4] to the 3D domain to determine the geometric primitive (i.e., convex polygon, line segment, or point) resulting from the intersection between Ti and V . For the sake of conciseness, we describe here only the differences between the method presented in [4] and the proposed 3D extension. In Algorithm 2, we adopt the termi-
nology presented in the webpage1 given in the footnote for ease of algorithmic comparison. The proposed extension is based on clipping a 2D projection of Ti to the corresponding 2D projection of V . Both Ti and V are iteratively projected upon the orthogonal subspaces X-Y , X-Z and Y -Z, resulting in either a 2D line segment or 2D triangle representing Ti and a square representing V . After clipping a line segment of the 2D projection of Ti to the clipping boundary, the third coordinate of the clipped endpoint, i, lost in the projection is recovered using the appropriate linear equation from the following set using function computeCoordinate: ix = fx + (px − fx ) ∗ (iy − fy )/(py − fy ) iy = fy + (py − fy ) ∗ (ix − fx )/(px − fx )
(3)
iz = fz + (pz − fz ) ∗ (ix − fx )/(px − fx )
simply taken to be W (v1 ) and is denoted by Ai1 . Secondly, a line segment can only result when a triangle intersects a voxel at an edge, and W (v) is linear over an edge of a triangle. Therefore, in the line segment case, we use the vertex quadrature rule [2] extended to line segments (trapezoidal rule), which states that the definite integral function evaluated
P over the line i iof a linear segment v1 , v2 is Ai1 = v1i − v2i 12 2k=1 W vki . Thirdly, in the convex polygon case, we commence by decompos nF ing each polygon, Li , into a set of sub-triangles, Fji j=1 . To compute the area integral of W (v) evaluated over Fji , we utilize the vertex quadrature rule Pextendedto triangles, which is given by Aij = Fji 13 k W vki . Fji denotes the area of Fji , which is computed using Heron’s formula (a, b, q and c are the lengths of the three sides of 2 1 i i F ), F = (a2 + b2 + c2 ) + 2 (a4 + b4 + c4 ). The j
Algorithm 2 A 3D Extension of the Sutherland-Hodgman Clipping Algorithm 1: List3D outputList = subjectPolygon; 2: for Polygon clipPolygon in voxelProjections do 3: for Edge clipEdge in clipPolygon do 4: List3D inputList = outputList; 5: outputList.clear(); 6: Point3D f = inputList.last; 7: for Point3D p in inputList do 8: Point2D s = Project2D(f ); 9: Point2D e = Project2D(p); 10: if e inside clipEdge OR e on clipEdge then 11: if s outside clipEdge then 12: Point2D i = Intersection(s,e,clipEdge); 13: outputList.add(computeCoordinate(i,f ,p)); 14: end if 15: outputList.add(p); 16: else if s inside clipEdge OR s on clipEdge then 17: Point2D i = Intersection(s,e,clipEdge); 18: outputList.add(computeCoordinate(i,f ,p)); 19: end if 20: f = p; 21: end for 22: end for 23: end for
5. Computing the Area Integral of the Proximity Function Evaluated over the Surface Region contained within a Voxel To compute Ψ [V ], the area integral of W (v) evaluated over each geometric primitive, Li , resulting from the intersection of triangle Ti and voxel V , must first be determined. Firstly, The integral of W (v1 ) at a point v1 is 1 http://en.wikipedia.org/wiki/Sutherland-Hodgman
algorithm
j
4
area integral of W (v) evaluated over the convex polygon L Pi is isimply the sum of all its sub-triangle contributions, Ψ [V ] is the sum of all Li contributions, j Aj . Moreover, P i Ψ [Vi ] = i,j Aj . A sample case of a triangle mesh intersecting a voxel is illustrated in Figure 3.
T1 v41
v31
L1
v21 , v22
L2 v11 , v12
v32
v31
T2 v41
F21 v21 , v22
F11
Vi
F12 v11 , v12
v32
Figure 3. A sample triangle mesh intersecting a voxel.
6. Recognition Process A voxel structure, ΥG , is generated for each model, MG , enrolled in the gallery set. MG is also voxelized using its corresponding ΥG . When comparing a probe model, MP , to MG for recognition, MP is firstly aligned to MG utilizing the method presented in [8], and is subsequently voxelized using ΥG . The voxelizations of both MP and MG generated utilizing a common voxel structure results in respective representations of equal dimension. To compute a match score, the similarity between these representations can be efficiently calculated using dot products. The following subsections describe the voxelization procedure and the metric used to compute a match score between the models.
6.1. Binary Voxelization A binary voxelization of a model, B [V ], encodes the presence of the model’s surface within a voxel. A voxel is assigned a value of ’1’ if a portion of the surface traverses the voxel, and ’0’, otherwise. As is demonstrated in Algorithm 3, we exploit the tree structure of Υ through a recursive procedure to determine efficiently the occupancy of a voxel. The voxelization of MG is performed on the piecewise continuous version of the discrete surface, ST . A voxel is considered occupied if a collision between a triangle belonging to ST and the voxel is detected utilizing the method described in Section 4. The collision detection, however, is computationally expensive and can only be applied to MG because it is voxelized off-line. As a result, the voxelization of MP is performed on the discrete surface, S. The voxelization of S is efficient because it only amounts to determining if a point lies within the boundaries of a voxel. Algorithm 3 Binary Voxelization of a Discrete Surface 1: {Initialize: B [V ] = 0} 2: FUNCTION determineOccupancy (S, V ) 3: Let s be the subset of points in S contained within V 4: if s is not empty then 5: B [V ] = 1 6: for each child c of V do 7: determineOccupancy (s, c) 8: end for 9: end if 10: return B [V ] It is important to emphasize that by voxelizing two aligned surfaces using a common voxel structure, equalsized and anatomically-corresponding binary representations are achieved. This framework enables the comparison between two surfaces without the explicit establishment of a dense set of correspondences between the two surfaces, as is done in ICP.
6.2. Matching Procedure We investigate the use of eight well-known binary vector distance measures to compute a match score between the binary representations of MP and MG . These measures, described in detail in [6], are the Jaccard-Needham (JN), Dice, Correlation (CORR), Yule, Russell-Rao (RR), Sokal-Michener (SM), Rogers-Tanmoto (RT), and Kulzinsky (KUL) measures.
7. Experimental Results We evaluate the efficacy of the proposed approach on the publicly-available University of Notre Dame (UND) 3D ear biometric dataset, collection G [5]. The dataset is
comprised of 1801 profile facial range images belonging to 415 subjects. For experimentation, we employ a probe and gallery set each randomly populated with a unique image for each of the subjects. The ear region is then segmented using the method presented in [7]. The SOA identification and verification rates on this dataset were achieved by the method presented in [8]. The aim of this section is to demonstrate that the proposed scheme produces a feature representation that outperforms the representation generated by the holistic component of the SOA (herein referred to as the SOA) in an identification and verification task with considerably lower dimensionality.
7.1. Identification Scenario For the identification scenario, we present our results in terms of the rank-one recognition rate versus the number of voxels comprising the voxel structure. The results of the SOA and the proposed method are independently presented in Figures 4(a) and 4(b), respectively. Firstly, for the SOA only 3 of the 8 measures surpass a recognition rate of 93%, while all of the measures surpass 93% for the proposed method. This indicates that the results of the proposed method are more stable and robust than those of the SOA. Secondly, the peak performance achieved by the proposed method (97.6% achieved with a voxel structure consisting of 4,801 voxels using the CORR measure) is greater than that of the SOA (96.6% achieved with a structure comprised of voxels of dimension 1.6mm - corresponding to an average structure comprised of more than 64,000 voxels using the CORR measure). Notice the more than an order of magnitude difference between the number of voxels corresponding to the peak performance of the two compared methodologies. Lastly, Figure 4(c) provides a direct performance comparison between the two methods based on the top performing distance measure - CORR. Since at a given resolution of the SOA, the number of voxels produced is dependent on the gallery-probe model pair, we computed the average voxel count to generate this figure (and Figure 4(f)). It is evident from 4(c) that the performance of the proposed method is more stable than and exceeds that of the SOA particularly at lower voxel resolutions.
7.2. Verification Scenario For the verification scenario, we present our results in terms of the Equal Error Rate (EER) versus the number of voxels comprising the voxel structure. The results of the SOA and the proposed method are independently presented in Figures 4(d) and 4(e), respectively. Firstly, for the SOA only 4 of the 8 measures achieve an EER less than 4% at any given resolution, while all of the measures achieve an EER less than 4% for the proposed method. As in the identification scenario, this demonstrates that the results of the proposed method are more stable and robust than those
Fixed Resolution Voxelization Identification Performance
Adaptive Resolution Voxelization Identification Performance
0.97 0.96
Rank−One Recognition Rate
0.95 0.94 0.93
0.99
0.98
0.98
0.96
0.97 0.96 0.95 0.94 0.93
0.92
0.92
0.91
0.91 2
1.8
1.6
1.4 1.2 1 Voxel Size (mm)
0.8
0.6
0.9
0.4
JN Dice CORR Yule RR SM RT KUL 0.5
1
1.5
2 2.5 Number of Voxels
(a) Fixed Resolution Voxelization Verification Performance 0.1 0.09 JN Dice CORR Yule RR SM RT KUL
0.05 0.04
0.86
0.8
0.07 0.06 0.05 0.04
0.06 0.05 0.04
0.02
0.01 0.6
0.4
0
0.01 0.5
1
1.5
2 2.5 Number of Voxels
(d)
4
x 10
0.07
0.01 0.8
3.5
0.08
0.03
1.4 1.2 1 Voxel Size (mm)
3
0.09
0.02
1.6
2 2.5 Number of Voxels
Fixed Resolution vs. Adaptive Resolution Voxelization Verification Performance 0.1
JN Dice CORR Yule RR SM RT KUL
0.02
1.8
1.5
(c)
0.03
2
1
4
0.03
0
CORR − Adaptive CORR − Fixed 0.5
x 10
0.08
Equal Error Rate
Equal Error Rate
0.06
0.9 0.88
0.82
3.5
Adaptive Resolution Voxelization Verification Performance
0.1
0.07
0.92
(b)
0.09 0.08
3
0.94
0.84
Equal Error Rate
Rank−One Recognition Rate
0.98
0.9
Fixed Resolution vs. Adaptive Resolution Voxelization Identification Performance 1
1 JN Dice CORR Yule RR SM RT KUL
Rank−One Recognition Rate
1 0.99
3
0
3.5 4
x 10
(e)
CORR − Adaptive CORR − Fixed 0.5
1
1.5
2 2.5 Number of Voxels
3
3.5 4
x 10
(f)
Figure 4. Identification (top row) and verification (bottom row) performance comparisons between the proposed method and the SOA.
of the SOA. Secondly, the peak performance achieved by the proposed method (2.1% achieved with a voxel structure consisting of 12,001 voxels using the CORR measure) outperforms that of the SOA (2.4% achieved with structures consisting of voxels of dimensions 1.2mm and 1.4mm - corresponding to average structures comprised of more than 167,000 and 103,000 voxels, respectively - using the CORR measure). Lastly, Figure 4(f) provides a direct performance comparison between the two methods based on the top performing distance measure, which is again the CORR measure. Similarly to the identification scenario, this figure demonstrates that the performance of the proposed method is more stable than and exceeds that of the SOA particularly at lower voxel resolutions.
8. Conclusion We have presented an adaptive resolution voxelization framework that produces a concise and discriminative feature vector to represent an object by adaptively assigning a higher concentration of voxels near distinct surface features. Additional strengths of the proposed methodology include that 1) the voxelization procedure is driven by a proximity function that can be generalized to any function defined over the surface, and 2) the resolution of the resulting voxel structure is not limited by the inherent resolution of the input data. Lastly, the experimental results have demonstrated a clear advantage in performance as well as a significantly reduced feature dimensionality over the SOA.
References [1] T. Akenine-Moller. Fast 3d triangle-box overlap testing. Journal of Graphics, GPU, and Game Tools, 6(1):29–33, 2001. 3 [2] A. Donev. Numerical methods i, numerical integration. December 2010. 4 [3] F. S. Nooruddin and G. Turk. Simplification and repair of polygonal models using volumetric techniques. IEEE Transactions on Visualization and Computer Graphics, 9:191–205, 2003. 1 [4] I. E. Sutherland and G. W. Hodgman. Reentrant polygon clipping. Communications of the ACM, 17:32–42, January 1974. 3 [5] P. Yan and K. W. Bowyer. Biometric recognition using 3d ear shape. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:1297–1308, 2007. 5 [6] B. Zhang and S. N. Srihari. Binary vector dissimilarity measures for handwriting identification. volume 5010, pages 28– 38, 2003. 1, 5 [7] J. Zhou, S. Cadavid, and M. Abdel-Mottaleb. Histograms of categorized shapes for 3d ear detection. In IEEE International Conference on Biometrics: Theory, Applications and Systems, September 2010. 1, 5 [8] J. Zhou, S. Cadavid, and M. Abdel-Mottaleb. A computationally efficient approach to 3d object recognition employing local and holistic features. In IEEE International Conference on Computer vision and Pattern Recognition: Workshop on biometrics, June 2011. 1, 2, 4, 5