robust feature detection based on local variation for image retrieval

Report 4 Downloads 37 Views
Proceedings of 2010 IEEE 17th International Conference on Image Processing

September 26-29, 2010, Hong Kong

ROBUST FEATURE DETECTION BASED ON LOCAL VARIATION FOR IMAGE RETRIEVAL Shao-Hu Peng1, Khairul Muzzammil2, and Deok-Hwan Kim3 1, 2, 3

Dept. of Electronic Engineering, Inha University, Incheon, Korea

Email: 1, 2 {pengshaohu, mellore}@iesl.inha.ac.kr, [email protected] ABSTRACT This paper proposes an interest point detector based on wavelet transform as well as a descriptor based on image variation and log-polar coordinate. Taking advantage of the wavelet properties, the proposed method detects a small number of interest points that are distinctive and robust to the illumination changes, scale changes and affine transform. A new descriptor based on the image variation and log-polar coordinate is proposed to represent the image local shape feature without edge detection. Since the proposed descriptor groups the image variation into various levels and separates the image local region into grids based on log-polar coordinate, it overcomes the problem of textured scenes or ill-defined edge images. Experimental results show that the proposed method achieves better matching accuracy and faster matching speed than those of the SIFT, PCA-SIFT and GLOH with less interest points. Index Terms— image retrieval, interest point, wavelet transform, detector, descriptor 1.

INTRODUCTION

Methods using image local feature have been demonstrated to be useful for content based image retrieval (CBIR) [1] [2] in recent years. One of the most popular methods is the scale invariant feature transform (SIFT) [3], which detects keypoints in the image and computes a descriptor for each keypoint to represent the characteristics of the image local feature. To detect the keypoints, the blob detector DoG [3] is adopted. Descriptors based on gradient magnitude and orientation such as the SIFT descriptor [3], PCA-SIFT descriptor [4], GLOH descriptor [5], are then computed in the image local region surrounding the keypoint. However, the large number of keypoints detected by the DoG detector, ranging from hundreds to thousands, poses a limitation to real time applications. Furthermore, the meaningful parts in the image cannot always be blob-like regions. The SIFT descriptor is effective for image matching, but its high dimensionality (128 dimensions) degrades its efficiency. The PCA-SIFT descriptor [4] was proposed to reduce the dimensionality and improve the matching

978-1-4244-7993-1/10/$26.00 ©2010 IEEE

accuracy. However, experiments showed that its matching accuracy was low in some situations [5]. Recently, the wavelet based salient point detector [6] [7] has been used as a tool for multi resolution analysis. The detector is based on wavelet transform to detect global variations as well as local ones. The salient point detection is based on the summation of the absolute wavelet coefficients so that it can overcome the problem of only focusing on blob regions. However, these methods failed in calculating the scale of the salient point. This paper proposes a novel approach based on wavelet transform to detect a small number of interest points representing the regions with local maximum variation. Moreover, a new descriptor based on image variation levels is proposed to represent the image local feature. To detect the interest points, wavelet transform is used to decompose the image into sub-images with multiple scales. A high frequency image is generated from the high frequency subimages for each scale. Average box filters with various sizes are then employed to the high frequency image to evaluate the local variation. Finally, the interest points are detected by obtaining the regions whose variation is the highest among their neighbor regions. To achieve robustness of image rotation, an orientation is assigned to each interest point based on the local maximum variation. To detect the shape information of the local region and overcome the problem of textured scenes or ill-defined edge images, a new descriptor based on variation levels and log-polar coordinate is proposed in this paper. The variation of the image is first grouped into various levels. For each level, a feature histogram is generated with respect to the pixel position and variation. Finally, all histograms of the levels are concatenated to form the new descriptor with 120 dimensions. The contributions of this paper are as follows: z A smaller number of interest points with more distinctive information are detected by the proposed detector. z An effective descriptor with lower dimensionality is formed to achieve higher image matching accuracy and better efficiency than traditional descriptors.

1033

2. THE PROPOSED DETECTOR

ICIP 2010

The proposed detector aims to detect meaningful points that are distinctive and invariant to image scales, illumination changes and affine transform. Therefore, wavelet coefficients which denote the variation of the image are employed. Fig.1 shows the main process of detecting the interest points of the proposed approach. The original input image is first decomposed into four sub-images by the Haar wavelet transform. The sub-image LL denotes the approximation of the image and the subimages LH, HL, and HH denote the variation of horizontal direction, vertical direction and diagonal direction, respectively. The pixels of the LH, HL, and HH images are composed of high frequency coefficients which describe the local variation of adjacent pixels of the original image. Therefore, the pixels in all three images are fused to form a high frequency image. The fusion of these three images is a simple process of summing up the pixel values of the images point by point. The high frequency image is then normalized. With the properties of the integral image technique [8] and the average box filter, the high frequency image is transformed into an integral image for fast computation. To evaluate the variation of the local regions of the original image with multi-scale, the average box filter with increasing sizes is applied to the integral image to construct an octave (Fig.1). The application of the box filter to the integral image results in a smoothed image. Further, it is obvious that the value of each point in the smoothed image represents the variation information of the region surrounding the point and the size of region is the same as that of the filter. By repeating Haar wavelet transform on the LL image, an image pyramid can be constructed for multi-resolution analysis. Finally, the value of each point of the smoothed image which is greater than a threshold is compared with those of its neighbors in the same octave. If it is the maximum among its neighbors, it is determined as an interest point.

frequency image. The circular region is then separated into 12 sectors based on the log-polar coordinate and thus 12 directions are obtained. For each direction, the summation variation in the sector is calculated. Finally, the direction with the maximum variation is determined to be the orientation of the interest point. Fig.2 shows an example of the orientation.

Figure 2: An example of the orientation

4.

THE PROPOSED DESCRIPTOR

Calculating an effective descriptor to represent the distinctive information of a local region is very important not only for matching accuracy but also for matching efficiency. As described earlier, the high frequency image represents the variation of the original image. The higher the variation is, the higher neighboring pixel transition occurs. As a result, the high frequency image also gives the shape information of the image. Therefore, this paper proposes a novel descriptor based on the image variation levels to extract the shape information from the local region around the interest point. The variation is grouped into various levels and the shape information is then extracted based on the variation levels and log-polar coordinates. Fig.3 illustrates the distribution of the image variation which was calculated from 27,387 regions around the interest points in 200 different images. Note that low variations (less than 21) were discarded in this distribution.

Figure 3: Distribution of the image variation Figure 1: The process of detecting the interest point

3.

ORIENTATION ASSIGNMENT

To achieve invariance to image rotation, an orientation is assigned for each interest point. A circular region with radius R is centered at the interest point in the high

1034

As shown in Fig.3, the low variation has more points than the high variation. To make the descriptor distinctive, the variation is grouped into various levels where the number of points in each level is uniform. Let F(x) be a function corresponding to the distribution shown in Fig.3. Suppose the variation is grouped into L levels, the average number of points for each level can be calculated as follows:

255 (1) A ª ³ F ( x)dx º / L ¬« 21 ¼» Let the range of level i be [Gi, Gi+1), we can get the range of each level by solving the following equation:

³

Gi1

Gi

A (Gi t 21, Gi 1 d 255, Gi  Gi 1 ,0 d i  L) (2)

F ( x)dx

Fig.4 shows a local region around the interest point when the variation is grouped into 7 levels.

Figure 4: Variation levels in a local region

After grouping the variation into L levels, a circular region centered at the interest point is set to calculate the descriptor. The circular region is separated into grids based on the log-polar coordinate. As shown in Fig.5 (a), the circular region is separated with 3 bins in a radial direction and 8 bins in an angular direction. Similar to the GLOH, the central bin is not divided in an angular direction. The values of the radii are adaptively set to make each grid in the region equal in area. Therefore, the circular region is separated into 17 grids. For each variation level, a histogram based on the variation and point position is generated. As shown in Fig.5 (b), the y-axis denotes the accumulation variation and the xaxis denotes the position of the point. Note that the position of the point is rotated to the orientation of the interest point to achieve robustness of image rotation. For each level, we can construct a histogram according to the variations and positions of the points. All the histograms for all the levels are concatenated to form a feature vector. Finally, the variation of the interest point is added to the feature vector. Seven levels result in a total dimensionality of 120 (7*17+1=120) for the proposed descriptor.

Figure 5: Calculation of the proposed descriptor

5.

EXPERIMENTAL RESULTS

The performance of the proposed approach was evaluated by comparing its matching accuracy, its number of interest points, and its matching time with those of SIFT, PCA-SIFT and GLOH. All the experiments were run on a dual core

1035

(TM) 2.4 GHz machine with 2GB of main memory and a Windows XP operating system. 5.1. Image datasets The ALOI dataset [9] was selected to evaluate the performance of the proposed approach. Eighty objects (three images for each object) from the stereo images were randomly selected to form the first image set. Forty eight objects (five images for each object) from the illumination changed images were randomly selected to form the second image set. Forty eight objects (five images for each object) from the view angle changed images were randomly selected to form the third image set. Finally, twenty objects (five images for each object) from the 3D generic object categorization dataset were randomly selected to form the fourth image set. These images were captured with different viewing angles, viewing heights and viewing distances. Fig.6 shows some image samples.

Figure 6: Image samples of the datasets

5.2. Matching accuracy For the proposed detector, the initial size of the average box filter was set to 3 and the size was increased by a factor of 2 in each octave. The number of images in each octave was set to 5. The size of the local region to calculate the descriptor was determined as follows:

R 2 *S *4 (3) where R is the maximum radius of the local region, S is the size of the box filter that the interest point belongs to. Note that all of these parameters were determined by experiments that gave the best matching accuracy. To obtain the matching accuracy, the distance of two points was calculated by the Euclidean distance [10]. For a single point Pi in the image I, we calculated the distance values between Pi and all the points in the other image J. Note that if the variation difference between the point in image J and Pi is greater than the fixed threshold 10, we consider two points are unmatched. If the closest distance was smaller than the distance thresholds: 182 for the proposed method, 210 for the SIFT, 131 for the PCA-SIFT, 204 for the GLOH, respectively, we determined that the image pair I and J was matched. The distance thresholds were chosen for best matching results. One of the images was selected and used to match with the others. Finally, we calculated the matching accuracy as S. H. Peng et al did [10].

Fig.7, 8, and 9 show the experimental results according to the number of average points, matching accuracy and average matching time.

6.

A new detector based on wavelet transform as well as a novel descriptor based image variation and log-polar coordinate are proposed in this paper. Taking advantage of the wavelet properties, the proposed detector detects a small number of distinctive interest points by means of finding the regions with the local maximum variation. Owing to the usage of the image variation, the proposed descriptor extracts the image local shape features with lower dimensionality than that of SIFT. Therefore, the proposed method improves the matching accuracy and matching speed using a smaller number of interest points and low dimensional descriptor. 7.

Figure 7: The number of average points

CONCLUSION

ACKNOWLEDGEMENT

This work was sponsored by ETRI System Semiconductor Industry Promotion Center, Human Resource Development Project for SoC Convergence 8.

Figure 8: The matching accuracy

Figure 9: The average matching time

As shown in Fig.7, the numbers of average points for the proposed detector were reduced to 31%, 32%, 12% and 35% than those of DoG detector in terms of the stereo, illumination, view angle and 3D object dataset, respectively. Notwithstanding, the matching accuracies of the proposed method using four datasets were higher than those of the SIFT, PCA-SFIT and GLOH. Its average matching accuracies for the four datasets were about 8.34%, 8.22%, and 6.95% higher than those of SIFT, PCA-SIFT and GLOH, respectively. Since the proposed method reduced the number of interest points, the average matching time of the proposed method was about 55% of those of SIFT and GLOH. Because PCA-SIFT reduced the descriptor dimension to 64, its average matching time was similar to the proposed method.

1036

REFERENCES

[1] D. H. Kim, J. W. Song, J. H. Lee, B. G. Choi, “Support Vector Machine Learning for Region-Based Image Retrieval with Relevance Feedback”, ETRI Journal, Vol.29, pp.700-702, 2007 [2] W. T Wong, F. Y. Shih, J. Liu, “Shape-based Image Retrieval Using Support Vector Machines, Fourier Descriptors and Selforganizing Maps”, Information Sciences, pp. 1878-1891, 2007 [3] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoint”, Computer Vision, pp.91-110, 2004 [4] Y. Ke1, R. Sukthankar, “PCA-SIFT: A More Distinctive Representation for Local Image Descriptors”, In: IEEE Computer Society Conference on Image Processing and Pattern Recognition, pp.506-513, 2004 [5] K. Mikolajczyk, C. Schmid, “A Performance Evaluation of Local Descriptors”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1615-1630, 2005 [6] E. Loupias, N. Sebe, S. Bres, J. M. Jolion, “Wavelet-based Salient Points for Image Retrieval”, In International Conference on Image Processing, pp.518-521, 1999 [7] D. W. Lin, S. H. Yang, “Wavelet-Based Salient Region Extraction,” on the conference of Advances in Multimedia Information Processing, pp.389-392, 2007. [8] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, ĀSpeeded-Up Robust Featuresā, Computer Vision and Image Understanding, Vol.110, pp.346-359ˈ2008. [9] J. M. Geusebroek, G. J. Burchouts, A. W. M. Smeulders, “The Amsterdam Library of Object Images”, International Journal of Computer Vision 61(1), pp.103-112, 2005 [10] S. H. Peng, D. H. Kim, S. L. Lee, C. W. Chung, “Pruning and Weighting of Keypoints Using the HSI Color Space for Image Recognition”, Third International Conference on Convergence and Hybrid Information Technology, Vol.1, pp.346-351, 2008