RELATIVE ORIENTATION AND SCALE FOR ... - Semantic Scholar

Report 1 Downloads 74 Views
RELATIVE ORIENTATION AND SCALE FOR IMPROVED FEATURE MATCHING Steven Mills Department of Computer Science University of Otago, Dunedin, New Zealand ABSTRACT Despite recent attention paid to feature detection and description methods, the basic criterion for feature matching has remained largely unchanged. Current techniques typically rely on the feature description vector extracted by SIFT or similar descriptors. Many feature detectors, however, also estimate the orientation and scale of features, which are valuable guides to correspondence. This paper outlines a technique for exploiting relative orientation and scale to improve feature matching performance. It is shown that these cues can significantly improve matching performance, as measured by the percentage of inliers, across a range of different image transforms and object recognition rates.

feature matching. In many cases, however, the scale and orientation information can be used to guide this process. In most applications, the filtered matches are used in a RANSAC process [13] to remove outliers, followed by leastsquares optimisation. In both cases it is desirable to remove as many outliers as possible, even if this means that some inliers are excluded. As a result the percentage of inlier matches is used as a performance metric. Matching performance is evaluated on the data sets (which include ground truth homographies) provided from Mikolajczyk et al.’s evaluation of feature detectors [14] and descriptors [6]. Two images of the ‘Graffiti’ set (Fig. 1) are used to illustrate the method.

Index Terms— Image matching, feature correspondence, object recognition 1. INTRODUCTION Finding reliable feature correspondences is a core problem in many vision applications. A truly reliable matching of feature points between images would greatly simplify tasks such as 3D scene reconstruction [1], object recognition [2], and robot navigation [3]. In many of these cases even a few incorrect matches (or outliers) can lead to significant errors in later processing. Much of the recent research in this area aims to create descriptors that lead to more reliable feature correspondences. The canonical example of this class of descriptors is SIFT [4], and a significant amount of research has been devoted to similar approaches [5, 6, 7, 8, 9, 10, 11]. Most of these approaches are based on the use of scale and rotation ‘invariant’ feature detectors and descriptors. These methods are often not truly invariant to rotation or scale. Rather they estimate the orientation and scale of a feature. A descriptor is then computed from the image data in a local region, accounting for the estimated orientation and scale. These descriptors are then compared in order to determine candidate matches between a pair of images. While there has been some work on the matching step, such as the use of k-d trees to accelerate the matching computation [12], it has received relatively little attention. In particular, the descriptor vector remains the main criterion for

Fig. 1. Two images from the ‘Graffiti’ sequence SIFT features are used in the experiments presented here. The method, however, does not depend on the descriptor chosen but on the characteristics of the detector. Any detector which estimates the scale and/or orientation of the image features can benefit from the proposed approach. 2. FEATURE MATCHING Matching is typically based on the Euclidean distance between feature descriptions, but similar-looking features in a scene often lead to ambiguous matches. While it is possible to reason about this ambiguity [15], Lowe [4] suggests that in order to be accepted, the best match must be less than 80% of the value of the second best match. This heuristic acts as a filter which removes ambiguous matches and so reduces the proportion of outliers in the final set of correspondences. The methods presented here to improve feature matching take a similar approach. An initial set of matches is estimated, and then filtered according to some heuristic in order to remove potential outliers. Two new heuristics are proposed,

one based on relative orientation and one based on relative scale. Similar ideas have been explored in the context of aerial imaging and remote sensing. Yi et al. [16], for example, reject SIFT matches between multi-spectral images where the absolute scale differs by some threshold based on a histogram analysis, and a similar approach is applied to SURF features by Teke and Temizel [17]. Li et al. [18] propose a method which uses both rotation and scale information. They identify the dominant relative orientation and scale using histogram analysis and then compute a modified distance function d0 = (1 + s )(1 + r )d,

(1)

where d is the Euclidean distance between two SIFT descriptors, s is a penalty for deviation from the dominant scale, and r penalises deviations from the dominant relative orintention. In this paper more general image transformations are considered, and both rotation and scale are investigated in concert with Lowe’s second-match heuristic. Unlike Li et al., the proposed approach maintains a separation between the different sources of information (relative orientation, scale, and the SIFT descriptor distance). Using OpenCV’s implementation of SIFT feature detection and description, followed by k-d tree based approximate nearest neighbour matching, 2749 feature matches are found between the images shown in Fig. 1. Using the ground truth homography between the images, 889 (32.4%) of these are inliers using a threshold of 2 pixels. Lowe’s heuristic, rejecting matches where the best match is more than 80% of the value of the second-best match, reduces this to 1133 matches with 842 inliers (74.3%). Li et al.’s modification of Lowe’s heuristic retains 849 matches with 661 inliers (77.9%). 2.1. Rotation Aware Matching Rotation aware matching is based on the observation that in most cases the image-plane rotation of the features is resonably constant across the image. This information is exploited in the matching process by estimating the orientation difference between each of the candidate matches. The dominant difference is found, and matches which differ significantly from this estimate are discarded. This can implemented using a histogram of orientation changes, with each bin representing a 5◦ increment. The dominant difference is identified as the largest bin, and this, along with the two neighbouring bins are retained. The histograms for the first two images of the ‘Graffiti’ data set are shown in Fig. 2. There is a clear peak in the histogram of all feature matches’ relative orientations. All of the inlier features determined from the known homography between the images lie within this peak. The outliers, by contrast, show a more uniform distribution. Applying this method to the first two images from the ‘Graffiti’ data set retains 1099 features, of which 810 (73.7%) are inliers.

2.2. Scale Aware Matching Scale can be used as a cue in a very similar way to orientation. Once again, it is the relative values between the two features, rather than their absolute scales, that is important. A histogram of relative scales is formed, the highest value taken as the dominant scale change between the two images, and used to restrict the matches considered for later processing. Since scales used in difference of Gaussian and similar detectors typically follow a geometric progression, the log of the scales is used for the histogram. This has the effect of distributing the different scales used in feature detection evenly along the axis. Unlike some earlier approaches [16, 17], the ratio rather than the difference of the scales is used, as a uniform scale change has a multiplicative rather than additive effect. If two features, f1 and f2 , are a hypothesised match, and the two features are described by statistics computed over patches of radius s1 and s2 respectively, then their relative scale is given by log(s1 /s2 ) = log(s1 ) − log(s2 ), where the logarithms may be taken to any base. The histograms for the first two images of the ‘Graffiti’ data set are shown in Fig. 3. In this example, and in the following experiments, the range between the smallest and largest relative scale is divided into 50 equal bins. All three distributions (all matches, inliers, and outliers) show a clear central peak. The inlier distribution, however, is more tightly clustered around this peak, while the outliers show a much heavier tails. The residual peak in the outlier distribution does, however, mean that the inliers and outliers are less clearly separated by scale than by orientation or Lowe’s heuristic. Removing all but the central three bins of the histogram retains 1386 features, of which 778 (56.1%) are inliers. This is lower than the inlier rate from Lowe’s heuristic or orientation filtering, but still larger than the 32.4% inlier rate of the raw matches. 2.3. Hybrid Filtering The three methods for filtering matches (Lowe’s second-best match heuristic, relative orientation, and relative scale) need not be applied exclusively. They can be applied in various combinations in order to remove outliers from the hypothesised feature matches. Results for applying various combinations of filters to the matches are shown in Table 1. In this example, any pair of filters is superior to any single filter, and all three together provide the highest inlier rate. Li et al’s method outperforms any single heuristic, but is weaker than any combination of two or more. In general the filters are not transitive — applying Lowe’s filter then relative orientation may not give the same results as applying relative orientation followed by Lowe’s. This is because the peak of the histograms used in the orientation and scale filters may change when the data set is reduced by some other method. In this example, however, the results remain consistent regardless of order.

Inlier Features

Outlier Features 500

400

400

400

300 200 100 0

0

50

100

150

200

250

300

300 200 100 0

350

Number of Features

500 Number of Features

Number of Features

All Features 500

0

50

Angle (degrees)

100

150

200

250

300

300 200 100 0

350

0

50

Angle (degrees)

100

150

200

250

300

350

Angle (degrees)

Fig. 2. Histograms of relative orientations for hypothesised feature matches between the first two images of the ‘Graffiti’ data set. Histograms are given for (from left to right) all features, inliers determined from the ground truth homography, and outliers. Inlier Features

Outlier Features 600

500

500

500

400 300 200 100 0

-3

-2

-1

0

1

2

3

Number of Features

600 Number of Features

Number of Features

All Features 600

400 300 200 100 0

-3

Relative Scale

-2

-1

0

1

Relative Scale

2

3

400 300 200 100 0

-3

-2

-1

0

1

2

3

Relative Scale

Fig. 3. Histograms of relative scales for hypothesised feature matches between the first two images of the ‘Graffiti’ data set. Histograms are given for (from left to right) all features, inliers determined from the ground truth homography, and outliers.

Table 1. Matching results for combinations of Lowe’s second-match threhsold (L), relative orientation (O), and relative scale (S). Li et al.’s method is also included for comparison (Li). Method(s) None L O S L+O L+S O+S L+O+S Li

Number 2749 1133 1099 1386 914 907 873 794 849

Inliers 889 842 810 778 777 752 715 699 661

Percent 32.4% 74.3% 73.7% 56.1% 85.0% 82.9% 81.9% 88.0% 77.8%

3. EXPERIMENTS In order to evaluate the merits of the proposed filters, they are used to match images from a standard data set [6, 14]. This consists of groups of six images, and a ground truth homography is provided between the first image and each of the others. The data sets include changes in viewpoint (‘Graffiti’ and ‘Wall’), zoom and rotation (‘Bark’ and ‘Boat’), blur

(‘Bikes’ and ‘Trees’), lighting (‘Leuven’), and JPEG compression level (‘UBC’). For each data set SIFT features are extracted and matched using approximate nearest neighbour search. The results are evaluated against the ground truth homography, H. A match x ↔ x0 is accepted as an inlier if |x0 − Hx| < T for some threshold T . The results reported here use T = 2 pixels, but similar results were found for other threshold values. Various heuristics were used to filter the initial matches: • • • • • •

No filtering (N in Table 2) Lowe’s 80% of 2nd match heuristic (L in Table 2) The relative orientation heuristic (O in Table 2) The relative scale heuristic (S in Table 2) All three heuristics, L then O then S (3 in Table 2) Li’s modified distance (Equation 1; Li in Table 2)

Table 2 shows the percentage of inliers with each heuristic across the data sets. In almost all cases Lowe’s heuristic is the best single filter to apply to the matched results. The relative scale approach is generally least successful, although in most cases it does offer an improvement in inlier rate over the raw feature matches. Except for large viewpoint changes the combination of all three methods gives the highest percentage of inliers. The combination of all three methods also outperforms the method proposed by Li et al.[18] in all cases except one case of extreme viewpoint change (Wall 1 → 6. In this

Table 2. The percentage inlier rate for different methods across a range of image transforms. Entries in italics have fewer than 100 inliers and the best performances are in bold.

Bark (zoom/rot)

Bikes (blur)

Boat (zoom/rot)

Graffiti (viewpoint)

Leuven (lighting)

Trees (blur)

UBC (JPEG)

Wall (viewpoint)

N: L: O: S: 3: Li: N: L: O: S: 3: Li: N: L: O: S: 3: Li: N: L: O: S: 3: Li: N: L: O: S: 3: Li: N: L: O: S: 3: Li: N: L: O: S: 3: Li: N: L: O: S: 3: Li:

1→2 13.3% 71.1% 58.1% 37.6% 81.6% 75.0% 23.8% 78.9% 63.1% 51.4% 94.8% 81.2% 24.1% 86.3% 68.1% 38.6% 95.0% 89.4% 32.3% 74.3% 73.7% 56.1% 88.0% 77.9% 42.2% 87.3% 77.0% 66.5% 94.0% 90.0% 12.0% 80.0% 58.9% 29.4% 90.0% 82.8% 51.7% 94.3% 84.6% 69.6% 96.7% 95.0% 39.8% 81.3% 70.6% 60.0% 83.1% 81.2%

1→3 5.7% 36.1% 25.9% 23.1% 39.6% 37.4% 16.5% 71.4% 57.8% 57.6% 93.8% 72.0% 18.4% 85.3% 61.0% 9.8% 96.9% 88.8% 16.1% 46.0% 44.7% 14.3% 65.4% 52.9% 35.0% 85.8% 73.5% 60.3% 94.9% 86.4% 9.9% 67.0% 49.9% 21.8% 75.8% 68.5% 39.6% 90.8% 75.7% 57.3% 94.5% 92.0% 35.8% 92.3% 75.5% 55.1% 94.6% 92.6%

1→4 12.7% 80.7% 60.7% 74.7% 91.7% 84.4% 9.5% 59.8% 46.9% 63.7% 93.5% 57.6% 7.1% 60.3% 36.1% 0.5% 90.2% 69.0% 4.7% 19.7% 20.4% 2.5% 66.7% 25.8% 27.8% 80.0% 67.1% 51.4% 88.7% 81.8% 3.1% 35.9% 22.9% 9.4% 49.1% 38.5% 26.2% 84.9% 65.0% 41.4% 90.3% 85.7% 16.6% 70.5% 49.8% 26.3% 70.9% 69.3%

1→5 9.0% 84.2% 55.8% 0.0% 99.2% 90.7% 6.9% 52.2% 35.7% 0.0% 90.6% 32.0% 5.0% 52.4% 27.4% 0.3% 88.2% 61.0% 0.7% 2.4% 3.1% 0.5% 0.0% 2.3% 21.9% 75.9% 62.0% 43.9% 89.8% 79.3% 1.8% 30.7% 15.0% 0.0% 53.1% 28.4% 12.1% 65.8% 42.9% 23.7% 79.2% 67.7% 5.8% 48.5% 24.0% 10.0% 56.6% 48.6%

1→6 3.9% 63.2% 34.3% 0.0% 90.2% 70.1% 3.2% 22.1% 19.3% 0.0% 65.3% 12.8% 0.9% 14.0% 6.0% 0.0% 44.6% 19.8% 0.1% 0.0% 0.0% 0.3% 0.0% 0.0% 16.2% 67.8% 52.6% 34.7% 85.4% 71.6% 0.7% 18.2% 7.6% 0.0% 41.3% 16.6% 5.5% 51.2% 26.5% 14.3% 74.2% 51.6% 0.8% 8.3% 4.2% 1.1% 8.3% 12.5%

case there are very few inliers, regardless of method. Li et al.’s method gives 9 inliers, while Lowe’s gives 11. The orientation and scale filters give somewhat more inliers (32 and 30 respectively), but the combination of all three heuristics keeps just one inlier point. As a second test, the filters are applied to naive object recognition on the ‘UK-Bench’ data set [2], which contains

four views of a range of objects. One image of each object is used for training and the remaining images are query images. Features are extracted from the training images, and then each query image is matched to all training images. The query image is assigned to the object category defined by the training image with the most feature matches. Different combinations of heuristics are applied to filter the matches before selecting the best match. Note that matching without any heuristic does not provide useful information since the number of raw matches is equal to the number of features in the query image. Table 3 shows the number of correctly identified object categories for different filtering strategies. Results are shown for various numbers of objects. There is a general decrease in performance as more objects are added, which is to be expected as there are more chances for misclassification. Relative orientation gives the best performance of any single heuristic. The best results are generally found using relative orientation and scale together (O+S), even though scale on its own is a weak heuristic. Table 3. Object classification results on the UK-Bench data set for various numbers of objects. ‘Number of objects’ is the training set size. There are three times as many query images. Number of objects 500 1000

Methods

100

L O S L+O L+S O+S L+O+S Li

223 (74.3%) 229 (76.3%) 147 (48.0%) 244 (81.3%) 220 (73.3%) 237 (79.0%) 230 (76.7%) 212 (70.7%)

806 (53.7%) 1038 (69.2%) 638 (42.5%) 967 (64.5%) 842 (56.1%) 1060 (70.6%) 974 (64.9%) 718 (47.9%)

1618 (53.9%) 2085 (69.5%) 1201 (40.0%) 1305 (43.5%) 1099 (36.6%) 2103 (70.1%) 1305 (43.5%) 1458 (48.6%)

4. CONCLUSIONS Two new heuristics for removing outliers from image feature matches have been presented. These are complimentary to the commonly used heuristic that the best match must be less than 80% of the second-best match. These heuristics are based on the relative orientation and scale of corresponding features, and identify the dominant transform between candidate matches. Matches that are consistent with this transform are retained, and those that are not consistent are rejected as potential outliers. It has been shown that a combination of these heuristics can provide an increase in the inlier percentages across a wide range of common transforms. The combined heuristic measures have also been shown to outperform the modified distance metric proposed by Li et al. [18]. Relative orientation has also been shown to improve performance in simple object recognition on a subset of the UK-Bench data set, particularly when used in combination with scale information.

5. REFERENCES [1] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2003. [2] D. Nist´er and H. Stew´enius, “Scalable recognition with a vocabulary tree,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 2161–2168. [3] T. Botterill, S. Mills, and R. Green, “Bag-ofwords driven single-camera simultaneous localization and mapping,” Journal of Field Robotics, vol. 28, no. 2, pp. 204–226, 2011. [4] D. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [5] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886– 893. [6] K. Mikolajczyk and C. Schmid, “A performance evlauation of local descriptors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615– 1630, 2005. [7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008. [8] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary robust independent elementry features,” in European Conference on Computer Vision (ECCV), 2010, pp. 778–792. [9] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in International Conference on Computer Vision (ICCV), 2011, pp. 2548–2555. [10] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in

International Conference on Computer Vision (ICCV), 2011, pp. 2564–2571. [11] A. Alahi, R. Ortiz, and P. Vandergheynst, “FREAK: Fast retina keypoint,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 510–517. [12] J. Beis and D. Lowe, “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces,” in Conference on Computer Vision and Pattern Recognition (CVPR), 1997, pp. 1000–1006. [13] M. A. Fischler and R. C. Bolles, “Random sample and consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381– 395, 1981. [14] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool, “A comparison of affine region detectors,” International Journal of Computer Vision, vol. 65, no. 1, pp. 43–72, 2005. [15] Tom Botterill, Steven Mills, and Richard Green, “New conditional sampling strategies for speeded-up RANSAC,” in British Machine Vision Conference, 2009. [16] Z. Yi, C. Zhiguo, and X. Yang, “Multi-spectral remote image registration based on SIFT,” Electronics Letters, vol. 44, no. 2, pp. 107–108, 2008. [17] M. Teke and A. Temizel, “Multi-spectral satellite image registration using scale-restricted SURF,” in International Conference on Pattern Recognition (ICPR), 2010. [18] Q. Li, G. Wang, J. Liu, and S. Chen, “Robust scaleinvariant feature matching for remote sensing image registration,” Geoscience and Remote Sensing Letters, vol. 6, no. 2, pp. 287–291, 2009.