Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems October 9 - 15, 2006, Beijing, China
Good Image Features for Bearing-only SLAM Xiang Wang
Hong Zhang
Department of Electrical and Computer Engineering University of Alberta Edmonton, AB, Canada, T6G 2V4 Email:
[email protected] Department of Computing Science University of Alberta Edmonton, AB, Canada, T6G 2E8 Email:
[email protected] Abstract— In this paper, we propose an algorithm for extracting and selecting SIFT (scale-invariant feature transform) visual features for bearing-only SLAM in indoor environments. The algorithm is based on analyzing the stability of the matching ratio of the SIFT features at different scales, and it is capable of extracting SIFT features that can be matched reliably and, at the same time, lead to accurate landmark initialization. In addition, the algorithm is an order of magnitude more efficient than the original SIFT algorithm and is therefore appropriate for the real-time nature of SLAM. As well, the algorithm can determine the quality of the visual features without any delay, and this eliminates the need for a matching or tracking procedure, as is often necessary in other feature extraction algorithms. Results from several experiments verify the performance of the proposed algorithm.
I. I NTRODUCTION Vision-based SLAM has received increasingly more attention as of late because cameras are now cheap, small, powersaving, and capable of providing textual information. Earlier work on vision-based SLAM was limited to stereo systems to capture 3D coordinates of landmarks. A more attractive question would be how to find solutions for SLAM problems with a single camera. This question is an example of the bearing-only SLAM problem, in which landmark initialization is one of the most important and, unfortunately, most difficult problems. Since good image features are prerequisites for all solutions of vision-based SLAM problems, this step is a critical consideration regardless of the specific algorithms used. Though some work has been done on vision-based SLAM problems, there has not been any advancement of image feature extraction and selection methods for vision-based SLAM. Instead, researchers have always made an assumption that good image features are available. A comparative study [2] of several local descriptors showed that best matching results were obtained using the SIFT algorithm [1] developed by Lowe, which was the most robust to image translation, scaling, rotation, and partially to illumination changes. SIFT has become a very popular image feature extraction algorithm for multi-camera SLAM. However, the original SIFT algorithm is not practical for bearing-only SLAM especially in environments with rich texture. Usually of the thousands of SIFT features extracted from one image, only 10% features can find their matches in another image even when the change between the two viewpoints is small. In addition, many of the matches are incorrect. Since a SLAM
1-4244-0259-X/06/$20.00 ©2006 IEEE
algorithm should run in real time, extracting a large number of features which can hardly find their matches does not support the requirement of speed. In feature initialization algorithms for bearing-only SLAM, all the features extracted from an image will be put into the initialization procedures and outliers will be discarded gradually by different methods based on reliable matching. As a result, there is no way to run any feature initialization algorithm online if an appropriate number of good features can not be extracted and mismatches can not be pruned. Consequently, how to extract good SIFT features for vision-based SLAM is critical. In this paper, we propose a feature extraction method which selectively detects SIFT features based on analyzing stable matching ratios at different scales. A limited but sufficient number of SIFT features will be extracted quickly with a large matching ratio with no extra errors introduced into the following processes. We also propose a mismatch pruning algorithm which can efficiently prune mismatches with 100% accuracy. II. BACKGROUND There are many kinds of sensors to achieve SLAM. Visionbased SLAM uses cameras as the only sensors. Different from other popular sensors which can directly return 3D information of physical features in the real world, cameras return 2D images with rich texture information. Obviously images themselves can not be input to any SLAM algorithm so extracting good image features is a prerequisite for all vision-based SLAM solutions. We should extract enough but not too many features which can be obtained efficiently, matched reliably and located accurately for vision-based SLAM applications. These criteria come from three important requirements of vision-based SLAM. First, SLAM algorithms must run in real time. Second, state updating requires reliable data association, which is related directly to reliable feature matching. Third, in bearing-only SLAM 3D positions of features can not be obtained from a single image, accurate and well-conditioned features have to be initialized with an initialization algorithm, which requires accurate image feature locations. Currently, there are three types of popular feature detectors in vision-based SLAM field. There are image segmentation based detector, Harris corners related detectors such as the detector introduced in [3] and SIFT [1]. Image segmentation based detectors are used to detect artificial landmarks,
2576
while the other two detectors detect features from the natural environment. Artificial objects of known color patterns and ceiling lights are the most commonly used artificial landmarks. The feature detector proposed by Shi and Tomasi in [3] is similar to the widely used Harris corner detector[4]. The Harris corner detects step edges using only first order derivatives. A corner is detected when the two eigenvalues of the matrix A = w ∗ [∆I∆I T ] are large, where w is the Gaussian smoothing mask and ∆I denotes the partial derivative of the image I. Applying a different approach than Harris’ algorithm, the detector introduced by Shi and Tomasi uses areas of a large size other than just one pixel. To evaluate how interesting a particular area of an image is, the horizontal gradients gx and vertical gradients gy of image intensity are calculated at each pixel position. The 2 x 2 matrix Z is then formed as follows: g2 gx gy x Z= (1) gy gx gy2
orientations. The features are invariant to rotation because of the canonical orientations assigned. After the whole procedure, every feature has been assigned an image location, scale and orientation. Finally a descriptor vector with 128 elements is created to represent a feature. A descriptor is composed by a 3D histogram of gradient locations and orientations in which each bin is weighted by the gradient magnitude and a Gaussian function. The descriptor is robust to small geometric distortions and small errors in the region detection because of the quantization of gradient locations and orientations.
patch
Two eigenvalues λ1 and λ2 of Z are found to judge the interest value of an area. The area has a high interest value if the smaller of λ1 and λ2 is larger than a threshold. The operator is moved over an image, and the areas giving the large smaller eigenvalues of Z are chosen. The general idea of SIFT [1] is to identify repeatable points (scale-space extrema) in a pyramid of scaled images built by a cascade filtering approach. There are four main steps of this algorithm. First a pyramid is built by incrementally convolving the initial image with Gaussian kernels. Then adjacent images separated by a constant factor in scale space are subtracted to produce the difference of Gaussian (DOG) images. One complete octave of scale space is built up to now. We start building the next octave by resampling the Gaussian image with twice the value of the standard deviation σ of the initial Gaussian kernel to shrink the size of the initial image. The procedures described above are repeated until the pyramid is completed. The second step is to identify keypoint locations by detecting local maxima and minima in the DOG pyramid. As shown in Figure 1, it is implemented by comparing each pixel to its 8 neighbors in the current image and 9 neighbors in the scale above and below. The next step is to identify accurate keypoint locations by fitting a 3D quadratic function to the points around a keypoint to determine the interpolated location of the maximum and eliminate keypoints with low contrast or located poorly along an edge. The fourth step is assigning each feature an orientation. The smoothed images within each octave of the pyramid is processed to extract image gradients and orientations. At each pixel Aij in a image L, the image gradient magnitude Mij , and orientation Rij , are computed using pixel differences as follows: Mij = (Lx+1,y − Lx−1,y )2 + (Lx,y+1 − Lx,y−1 )2 (2) θ = tan−1 ((Lx,y+1 − Lx,y−1 )/(Lx+1,y − Lx−1,y ))
(3)
The orientation of a feature is determined by the peak in a histogram of local image (a region around the feature) gradient
Fig. 1. Extrema of DOG images are found by comparing a pixel marked with a star with its 26 neighbors in three adjacent scales marked by dark colour
III. R ELATED WORK In bearing-only SLAM, aside from [5] and [6] by Davison who did all his work using Shi and Tomasi’s feature detector, all others used either simulation, which made the assumption that good feature extraction and matching are available, or used segmentation based detectors that use artificial landmarks which offer limited highly distinguishable landmarks. In [16], Bailey used simulation to prove his feature initialization algorithm . [7] used artificial sources of light in an office like environment. [10] used cardboard boxes covered with colored construction paper. [11] used six different artificial landmarks. [12] used cylinders made from colored paper as landmarks. In [13], lines and points were extracted for the bearing-only SLAM. Lines were extracted using the Hough Transform, while points, which are the centers of circular ceiling lights, were extracted based on their intensity. The latest work in [14] was done using Harris corner detector in scale space. [9] and [8] used SIFT features which worked with a stereo camera system, so it is not bearing-only SLAM. In the bearing-only SLAM field, we can see that the critical feature extraction and mismatch pruning problems were either avoided by using simulation or replying on well marked artificial landmarks. As the most robust feature detector, SIFT can hardly be used in any bearing-only SLAM algorithm especially for undelayed algorithms. There are two main reasons. The first is that too many features are extracted with a low matching ratio, so no feature initialization algorithm or particle filter related algorithms can afford the calculation cost. The second is that unless one uses small size images, SIFT algorithm is not fast enough to be used for real time SLAM algorithms. However, with small images, we easily lose the
2577
features we want to track. So our goal here is to develop a fast SIFT algorithm for bearing-only SLAM which can keep good properties of SIFT, and efficiently extracts limited but sufficient SIFT features with high matching ratios and limited introduction of errors. Feature matching is directly related to the data association problem, which is a key concern in all vision-based SLAM. There is no matching algorithm which can provide matches with 100% accuracy. So the other goal of ours is to develop an algorithm which can efficiently prune mismatches with high accuracy. IV. FAST SIFT FOR VISION - BASED SLAM It is easy to think of decreasing scale levels or image resolution to extract fewer features. However giving up high scale features will not contribute too much for the goal of decreasing the number of features because the number of high scale features is very limited. In addition, high scale features are good features because they are very robust to noise and offer strong matching results with low calculation cost. Decreasing image resolution can dramatically decrease the total number of features extracted, but these features are robust to occlusions and large robot motions so they are very important for long term tracking. Our fast SIFT method which can extract features from specific scales is described below. A. Feature extraction and matching In order to specify the interesting scales in which good features can be extracted, a set of hypotheses of characteristics of SIFT features at different scales are proposed as follows: • A match with the scale higher than 0.5MAXS (MAXS is the highest scale) can be used reliably for mismatch pruning, which will be described in the next section, if the probability P1 is higher than 0.5, where P1 is the probability that more than hi% (”hi” is the threshold with a large value) matches found at the scales higher than 0.5MAXS are right matches. 1 1 ; H11 : P1 > ; 2 2 Scales lower than 2σ0 (σ0 is the lowest scale) should not be included in a Gaussian pyramid if the probability P2 is higher than 0.5, where P2 is the probability that less than lo% (”lo” is the threshold with a small value) features with the scales lower than 2σ0 can find their matches. 1 1 H20 : P2 = ; H21 : P2 > ; 2 2 Scales lower than 2σ0 should not be included in a Gaussian pyramid if the probability P3 is higher than 0.5, where P3 is the probability that more than 60% matches found at the scales lower than 2σ0 are mismatches. 1 1 H30 : P3 = ; H31 : P3 > ; 2 2 Scales in the scale range (Rm ) between 2σ0 and 8σ0 are main parts of a Gaussian pyramid if the probability P4 H10 : P1 =
•
•
•
Fig. 2.
SIFT features extracted from two indoor images.
is higher than 0.5, where P4 is the probability that the ratio between the number of matches and the number of features in Rm can at least double the ratio between the number of matches and the number of features in the whole scale space. 1 1 ; H41 : P4 > ; 2 2 In order to test these hypotheses, the original SIFT algorithm was applied to a group of images taken in a lab under different lighting conditions and from different viewpoints. The parameter ”hi” is set at 95, ”lo” is set at 10 and ”mo” is set at 60. The distance between two feature descriptors introduced in [1] was used to search for matches between a pair of images. Before building a pyramid, an arbitrary value was chosen for the lowest scale σ0 (σ0 =1.6 was suggested in [1]). The MAXS value can be calculated from the original size of an image. For image size 640 x 480 used in our experiments, the MAXS is approximately 50. The critical value Cr is set at 70% of the number of trials n (in our case, n = 50 and Cr = 35). All possible scores greater than Cr constitute the critical region and all possible scores less than or equal to Cr determine the acceptance region. In order to compute the type II error (a false hypothesis is
2578
H40 : P4 =
Fig. 5. The left pair of images showed the matches found at scales smaller than 3.2. The middle pair of images showed the matches found in the scale range from 3.2 to 12.8. The right pair of images showed the matches found at the scales higher than one half of MAXS.
Fig. 3. Matches between two sets of features shown in fig. 2. The lengths of red lines represent scales.
Fig. 6. The left pair of images showed the matches with mismatches. The right pair of images showed that mismatches were all pruned.
or false negative) α and type II error β are defined based on a binomial distribution b(x;n,p)as follows: α = P (X > Cr when p =
1 )= 2
β = P (X ≤ Cr when p =
Fig. 4. Statistical properties of features and matches shown in figure 3 and 2. The first row are figures which showed distributions of features vs. scales, matches vs. scales and match ratios vs. scales. The second and third rows are histograms about matches and mismatches properties related to scales.
n x=Cr +1
1 b(x; n, ) 2
Cr 3 3 )= b(x; n, ) 4 4 x=0
(4)
(5)
To determine the probability of committing a type I error, we used the normal distribution approximation Z with µ = np and √ σ = npq, where q = 1 − p. So α and β can be calculated as follows: x−µ (6) z= σ α=
n x=Cr +1
accepted or false positive) β, we set Hi1 as Pi = 34 , where i=1, 2, 3, 4. So the type I error (a true hypothesis is rejected
2579
1 b(x; n, ) P (Z > z) = 1 − P (Z < z) 2
β=
Cr
3 b(x; n, ) P (Z < z) 4 x=0
(7)
(8)
For all the hypotheses, we obtained α = 0.017 and β = 0.05, implying our trials can give correct decisions with very low false negative and false positive rates. This indicates that our testing results of scale selection strategy will be very reliable. A set of result images are shown in Figures 2 - 5. The results from 50 trials are shown in the Table I. From these results we can see that Hi0 (i=1,2,3,4) are all rejected. Based on the analysis results described above, the fast SIFT algorithm is described as follows: • Calculate MAXS, resize the original image to one half of the original size and choose 2σ0 as the initial σ0 . • Build the first octave of a Gaussian pyramid by choosing 2σ0 as the initial σ0 . • Subtract the two neighbor images to create a DOG image. • Search for SIFT features in the DOG images. • Resize the top Gaussain smoothed image in the current octave to one half of its size. • Repeat the above procedures until the scale is equal to one half of the MAXS. • Continue to build the pyramid from the top octave and move downwards until two SIFT features are found or the scale equal to one half of the MAXSCALE is reached. • Create a descriptor for every extracted SIFT feature using the method introduced in [1].
If |θ − θm | < T H1 , keep the match; otherwise, prune the match. where θm is the angle associated to the match with highest scale. When there is a displacement ∆z along the z axis, we can get φ1 = φ2
where atan(φ1 ) = y1 /x1 and atan(φ2 ) = y2 /x2 . (x1 , y1 ) and (x2 , y2 ) are image coordinates of the matched features. We set T H2 = 2◦ . Then we decide if a match is a right one by following criteria: If |φ1 − φ2 | < T H2 , keep the match; otherwise, prune the match. We tested the algorithm on all 100 pairs of images, mismatches were 100% pruned. Because we have sufficient match candidates, it is not a severe problem if some right matches were pruned. In figure 6, one pair of images were chosen to show the results of our algorithm. For other robot motion models, different criteria should be applied.
TABLE I A NALYZING RESULTS AT DIFFERENT SCALE RANGES . A LL THE DATA WERE OBTAINED BY AVERAGING RESULTS OF
Scales (S)
Nmp Nf p
Nmisp Nmist
Nf p Nf t
(10)
V. E XPERIMENTAL RESULTS
50 IMAGE PAIRS . Nmt Nf t
Nmrp Nmp
S ≤ 2σ0 7.31% 73.99% 2σ0 < S ≤ 8σ0 25.83% 8.68% 9.07% S ≥ 0.5M AXS 94.76% 1N , N mp and Nmisp are the number of features, the number of matches fp and the number of mismatches within the scale range. Nf t , Nmt and Nmist are the total number of features, the total number of matches and the total number of mismatches. Nmrp is the number of right matches within the scale range.
B. Algorithm for mismatch pruning In our experiments, the robot motion is limited to a translation along z axis or a translation in x − y plane with a rotation respect to y axis. An efficient, accurate and simple algorithm is introduced here to prune mismatches by using the match with highest scale and camera motion model. The robot motion model is divided into two parts. One is the motion within x − y plane. The other is the motion along the axis z. For the first part of the model, we define θ as the angle related to a match, which is calculated as: atan(θ) = (y2 − y1 )/(x2 − x1 )
In order to test our fast SIFT and mismatch pruning algorithms for bearing-only SLAM applications, we used feature initialization algorithms introduced in [15], particle filter and Kalman filter to initialize matches generated by our algorithms. The time our algorithm used to extract right matches is only 8% of the original SIFT algorithm. The initialized 3D features are shown in Figure 7 and 8. Average errors associated with the means of initialized landmark positions are shown in Table II. Average standard deviations of initialized landmark positions are shown in Table III. From the results, we can see that our algorithm did not introduce extra errors. So the proposed algorithm can efficiently extract robust features and matches which can be directly used by bearing-only SLAM algorithms while the accuracy of final results are kept. TABLE II T HE AVERAGE ERRORS ASSOCIATED WITH MEANS OF ESTIMATED POSITIONS OF ALL THE MATCHES GENERATED FROM ORIGINAL
(9)
where (x1 , y1 ) and (x2 , y2 ) are image coordinates of the matched features. We set T H1 = 2◦ . Then we decide if a match is a right one by following criteria:
2580
FAST
Algorithms Original SIFT
SIFT.
SUF 0.28 0.30 3.63 0.33 Fast SIFT 0.24 3.14 2 Unit of the mean
UF 0.28 0.30 3.64 0.32 0.23 3.05 is mm.
EKF 0.42 0.41 5.10 0.68 0.53 6.70
SIFT AND
TABLE III T HE AVERAGE STANDARD DEVIATIONS OF ALL ESTIMATED LANDMARK POSITIONS . R ESULTS USING PF WERE CONSIDERED OPTIMAL .
our algorithms can work with different bearing-only SLAM algorithms in real time and provide accurate results without introducing extra errors. Our algorithms can be practically used as a feature detector for all vision-based SLAM purposes.
Algorithms PF SUF UF EKF Original SIFT 91.10 92.22 95.14 17.98 Fast SIFT 90.73 91.68 94.56 17.93 3 Unit of the standard deviation is mm.
R EFERENCES
400
200 Y
Z
c
c
0 X
c
−200
O
c
−400 2000 1500
400 200
1000 0
500
−200 0
Fig. 7.
−400
3D features initialized by output matches from original SIFT.
400
200 Y
Z
c
c
0 X
c
−200
O
c
−400 2000 1500
400 200
1000 0
500
−200 0
Fig. 8.
−400
3D features initialized by output matches from fast SIFT.
VI. C ONCLUSIONS In this paper, we have proposed SIFT-based feature extraction and mismatch pruning algorithms to select good features for bearing-only SLAM. For the first time, the currently strongest feature extraction algorithm can be used in bearingonly SLAM field. The fast SIFT algorithm is an order of magnitude more efficient than the original SIFT while the features generated have higher matching ratios and are more robust. The proposed mismatch pruning algorithm can efficiently prune mismatches with 100% accuracy, which solves the difficult data association problem. Two experiments were set up. The first one proved the accuracy of our algorithms. The Second one showed that
[1] David G. Lowe, ”Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, 60, 2 (2004), pp. 91110. [2] Krystian Mikolajczyk and Cordelia Schmid, ”A performance evaluation of local descriptors,” in Proc. of IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, Madison, USA, pp. 257-263, 2003. [3] J. Shi and C. Tomasi, ”Good features to track,” in Proceedings of the IEEE Conferrence on Computer Vision and Pattern Recognition, pages 593-600, 1994. [4] C. Harris and M. Stephens, ”A combined corner and edge detector,” in FourthAlvey Vision Conf., pages 147-151, 1988. [5] A. Davison, ”Real-time simultaneous localization and mapping with a single camera,” in Proc. International Conference on Computer Vision, Nice, October 2004. [Online]. Available at: http://www.robots.ox.ac.uk/ ActiveVision/Papers/davison iccv2003/ davison iccv2003.pdf [6] A. J. Davison, Y. G. Cid, and N. Kita, ”Real-time 3d slam with wide-angle vision.” In Proc. IFAC Symposium on Intelligent Autonomous Vehicles, Lisbon, July 2004. [Online]. Available at: http://www.robots.ox.ac.uk/ ActiveVision/Papers/davison etal iav2004/davison etal iav2004.pdf [7] S. Panzieri, F. Pascucci and G. Ulivi, ”Vision based navigation using Kalman approach for SLAM,” [Online], Available at: http://panzieri.dia.uniroma3.it/Articoli/ICAR2003a.pdf [8] Jaime Valls Mir´o, Gamini Dissanayake, Weizhen Zhou, ”Visionbased SLAM using natural features in indoor environments,” [Online], Available at: http://www.cas.edu.au/download.php/VallsMiro-ISSNIP05Vsual SLAM.pdf?id=1431 [9] Stephen Se, David Lowe and Jim Little, ”Mobile robot localization and mapping with uncertainty using Scale-Invariant Visual Landmarks,” Int J. Robotics Research, 21 (8), pp 735-758, Aug. 2002. [10] Frank Dellaert and Ashley W. Stroupe, ”Linear 2D localization and mapping for single and multiple robot scenarios,” in proceedings of the 2002 IEEE International Conference on Robotics and Automation, Washington, DC, May 2002. [11] Matthew Deam and Martial Hebert, ”Experimental comparison of techniques for localization and mapping using a bearing-only sensor,” in Proc. of the ISER’00, Seventh Int. Symposium on Experimental Robotics, pages 395-404, December 2000. [12] David Prasser and Gordon Wyetb, ”Probabilistic visual recognition of artificial landmarks for simultaneous localization and mapping,” in the proceedings of the 2003 IEEE Intelnational Conference on Robotics and Automation, Taipei, Taiwan, September 11-19, 2003. [13] J. Folkesson, Jensfelt P, and H. I. Christensen, ”Vision slam in the measurement subspace,” in IEEE ICRA05, 2005. [14] Patric Jensfelt, Danica Kragic, John Folkesson and M˚arten Bj¨orkman, ”A Framework for Vision Based Bearing Only 3D SLAM,” IEEE Intelnational Conference on Robotics and Automation, Orlando, Florida, May 2006. [15] Xiang Wang and Hong Zhang, ”Bearing-only landmarks initialization by using SUF with undistorted SIFT features,” IEEE Intelnational Conference on Robotics and Automation, Orlando, Florida, May 2006. [16] T. Bailey, ”Constrained initialization for bearing-only slam,” IEEE International Conference on Robotics and Automation, 2003. [17] Al Costa, George Kantor and Howie choset, ”Bearing-only landmark initialization with unknown data association,” International Conference on Robotics and Automation, New Orleans, LA, April 2004, pp17641770. [18] Joan Sol`a, Andr´e Monin, Michel Devy and Thomas Lemaire, ”Undelayed initialization in bearing only SLAM,” International Conference on Intelligent Robots and Systems, Edmonton, AB, Canada, August 2005.
2581