Proc. IEEE Workshop on Planetary Rover Technology and Systems
April 23, 1996. Minneapolis, Minnesota, USA
A Stereovision System for a Planetary Rover: Calibration, Correlation, Registration, and Fusion Zhengyou Zhang INRIA Sophia-Antipolis 2004 route des Lucioles, BP-93 06902 Sophia Antipolis Cedex, France Email:
[email protected] Abstract
This paper describes a complete stereovision system, which was originally developed for planetary applications, but can be used for other applications such as object modeling. A new eective on-site calibration technique has been developed, which can make use of the information from the surrounding environment as well as the information from the calibration apparatus. A correlation-based stereo algorithm is used, which can produce sucient dense range maps with an algorithmic structure for fast implementations. A technique based on iterative closest-point matching has been developed for registration of successive depth maps and computation of the displacements between successive positions. A statistical method based on the distance distribution is integrated into this registration technique, which allows us to deal with the important problems such as outliers, occlu sion, appearance and disappearance. Finally, the registered maps are expressed in the same coordinate system and are fused, erroneous data are eliminated through consistency checking, and a global digital elevation map is built incrementally.
1 Introduction Cameras and rangenders are two major types of sensors used in planetary rovers in order to perceive the surrounding environment [29, 12, 19, 33]. They have their own advantages and shortcomings, and can be used in a cooperative way to achieve a higher performance. In this paper, however, we only describe our work on the development of a complete stereovision system, from on-site calibration to incremental Digital Elevation Map (DEM) building. The work was rst carried out in the context of the Autonomous Planetary Rover program (known as VAP) initialized by the French Space Agency (CNES) in 1989 [15], and then pursued in the context of the Eureka program IARES (Illustration of an Autonomous Robot for the Exploration of Space) [5]. The perception system of a planetary rover usually consists of two parts. The rst is associated with the operator station on Earth, which exploits the perceptual information provided by the rover so that the operator can perform planning and operational control of the mission of the rover. The second is onboard, which should be able to provide the operator station with panoramas all around the rover;
calibrate the vision system on site; provide the information on the ground relief and nature, which is necessary for the modeling of the environment used for autonomous path generation;
estimate precisely the displacement of the rover from visual data; estimate the relative position of the rover with respect to a global map produced by an orbiter.
Due to the signicant delay in communication between the Earth and the planet, the rover should have the maximum possible autonomy. This requires an onboard perception system of high performance. Our work is focused on the development of the onboard perception system. More precisely, we present in this paper a new technique for calibrating a binocular stereo, a correlation-based stereo algorithm which produces a dense depth map of the local environment, a registration technique which estimates the displacement of the rover between successive positions and expresses the visual data in a common coordinate system, and nally a fusion technique which builds incrementally a global digital elevation map of the environment. Before proceeding further, we should mention that the approach we followed here is based on strong calibration, in the sense that we are able to extract 3D metric information from images. For some visual tasks, this may not be necessary. For example, in [34], a technique is presented to evaluate the traversability of terrain using only weakly calibrated stereo images, i.e. images for which only the epipolar geometry is known.
2 Stereo Calibration Since a metric reconstruction of the environment is necessary for the path planning and navi gation of the planetary rover, the stereovision system must be calibrated strongly, in the sense that a 3-D Euclidean reconstruction can be performed. The calibration should be performed on site, because if we do it before launching, it will be not valid anymore due to, for example, vibration and temperature variation during the ight. Classical calibration techniques rst estimate the perspective projection matrix for each camera, and then compute the epipolar geometry from the projection matrices. The calibra tion of each camera is performed by observing a calibration object whose geometry is known with a very good precision in 3-D space, and can be done very eciently. However, this approach is dicult to bring them into operation for space applications, because it usually requires human interaction to put the calibration object such that it is visible from both cam eras. Since, up to now, there does not yet exist a robust and fully automatic self-calibration technique, the calibration object cannot yet be thrown away. The solution to this is that we mount the calibration object on the vehicle. Furthermore, the classical approach suers from two problems: The fact that the set of image points in one camera corresponds to the set of image points in the other camera is not used in the classical approach, which degrades the precision of the estimation of the epipolar geometry.
The calibration is only valid for the volume around the position of the calibration object. The performance of the calibration degrades when we go away from that position.
The second problem is more critical in our situation. Because the calibration object is mounted on the vehicle, its distance to the cameras is only about 1.5 meters, and the range of interest is up to 10 meters or more. The calibration technique described in this section overcomes the above two problems:
The rst stage of our calibration technique is to estimate the epipolar geometry between
two cameras by minimizing the sum of the squared distances between points and their epipolar lines. Matches from the environment, which do not belong to the calibration object, can and should be used in the estimation of the epipolar geometry in order to increase its validity range.
Furthermore, our algorithm can take the advantage of the rigidity of the geometry between two cameras by incorporating matches established at dierent instants, thus yielding an estimate of the epipolar geometry asymptotically valid for the whole range. The idea underlying our approach is based on the now well-known fact that a projective structure can be reconstructed from two views given that the epipolar geometry is known [8]. If the positions of a set of points are known in 3D Euclidean space, then the projective distortion matrix (collineation) can be computed to bring the projective structure to a Euclidean one, which eventually allows to recover the camera projection matrices. Our method thus consists of the following steps:
Estimate the epipolar geometry using all available information. Compute the camera perspective projection matrices with respect to a projective refer ence frame.
Reconstruct projectively points belonging to the calibration object in 3D space. Estimate the collineation matrix between the projective structure and the known Eu clidean structure.
Compute the camera perspective projection matrices with respect to the Euclidean ref erence frame attached to the calibration object.
2.1 Determining the Epipolar Geometry of the Stereo Rig
This is the most crucial step, because everything else depends on the precision of the estimated epipolar geometry. The epipolar geometry can be described by a 33 matrix F, which is known as the fundamental matrix. This matrix is dened up to a scale factor, and its determinant is zero. Thus, the fundamental matrix is only dened by seven parameters. In order to estimate the fundamental matrix, we must establish correspondences between two images. In order to reduce the amount of information to be processed, we rst extract a set of points of interest in each image, which correspond to points of high curvature. We then try to match these points between images using a classical correlation technique followed by a relaxation procedure which uses contextual information (neighborhood) to disambiguate matching candidates. However, as the two matching techniques use heuristics and the only geometric constraint, i.e. the epipolar constraint, is not available, there exist inevitably a number of false matches. The false matches will severely aect the precision of the fundamental
matrix or even make the estimation useless if we directly use the whole set of matches. For this reason, we apply a robust technique called least-median-squares to detect these false matches in order to estimate accurately the fundamental matrix. The idea underlying this technique is to nd a fundamental matrix which is consistent with a majority of matches by searching in the parameterization space. This is a very robust technique, and can detect false matches as high as 50% of whole data. This stage has been fully described in [39], where an automatic and robust algorithm has been developed to estimate the unknown epipolar geometry, in terms of the fundamental matrix, between two images. All matched points are then used together to eectively estimate the fundamental matrix F: min
X
i
? wi d2 (mi ; Fm0i) + d2 (m0i ; FT mi)
where wi stands for the uncertainty measure associated to each measured point match (mi ; m0i ) and d(m; l) denotes the Euclidean distance from point m to the line l. Linear and nonlinear criteria, as well as dierent parameterizations for F have been considered in [27, 39] to determine accurately F.
2.2 Computing the Camera Projection Matrices
Once we have computed the epipolar geometry (in terms of the fundamental matrix) between two images, we are able to compute a projective reconstruction of the scene. To this end, we rst compute the camera projection matrices, P and P 0 , with respect to a projective basis. The method we use is to choose eectively ve pairs of points, each of four points not being coplanar, between the two cameras as a projective basis. We can of course choose ve corresponding points we have identied. However, the precision of the nal projective reconstruction will depend heavily upon the precision of the pairs of points. In order to overcome this problem, we have chosen the following solution. We rst choose ve arbitrary points in the rst image, noted by mi (i = 1; : : : ; 5). For each point mi , its corresponding epipolar line is given by l0i = Fmi . We can now choose an arbitrary point on l0i as m0i , the corresponding point of mi . Finally, we should verify that none of four points is coplanar, which can be easily done using the fundamental matrix. The advantage of this method is that the ve pairs of points satisfy exactly the epipolar constraint. Once we have ve pairs of points (mi ; m0i ), (i = 1; : : : ; 5), we can compute the camera projection matrices as described in [8]. Another possibility is to use a canonic factorization of F [28].
2.3 Projective Reconstruction
Now that the camera projection matrices of the stereo with respect to a projective basis are available, we can reconstruct 3-D structures with respect to that projective basis from point matches. This reconstruction can be done for all matches, but for the calibration purpose, we only need to reconstruct those points corresponding to the calibration object. Given a pair of points in correspondence: m = [u; v]T and m0 = [u0 ; v0 ]T . Let x~ = [x; y; z; t]T be the corresponding 3-D point in space with respect to the projective basis chosen
before. Following the pinhole model, we have:
s [u; v; 1]T = P [x; y; z; t]T ; s0 u0 ; v0 ; 1 = P 0 [x; y; z; t]T ;
(1) (2)
where s and s0 are two arbitrary scalars. Denote pi be the vector corresponding to the i-th row of P , and p0i be the vector corresponding to the i-th row of P 0 . The two scalars can then be computed as:
s = pT3 x~ ; s0 = p03 T x~ :
Eliminating s and s0 from (1) and (2) yields the following equation:
Ax~ = 0 ; where A is a 4 4 matrix given by
(3)
A = p1 ? up3 ; p2 ? vp3 ; p01 ? u0 p03; p02 ? v0 p03 T : As the projective coordinates x~ are dened up to a scale factor, we can impose kx~ k = 1, then the solution to (3) is well known to be the eigenvector of the matrix AT A associated with the smallest eigenvalue. If we assume that no point is at innity, then we can impose t = 1, and the projective reconstruction can be done exactly in the same way as for the Euclidean reconstruction. The previous method has the advantage of providing a closed-form solution, but it has the disadvantage that the criterion that is minimized does not have a good physical interpretation. Instead, we can carry out the minimization in the image plane, that is, we can minimize the following criterion:
0T T 0T T (u ? pT1 x~ )2 + (v ? p2T x~ )2 + (u0 ? p10 T x~ )2 + (v0 ? p20 T x~ )2 : p3 x~ p3 x~ p3 x~ p3 x~ However, there does not exist any closed-form solution, and we must use any standard iter ative minimization technique. The initial estimate of x~ can be obtained by using the above closed-form technique. We have implemented both methods. Our experiments show that the improvement through the nonlinear minimization is rather small, mainly because the points on the calibration pat tern are very precisely localized. Other projective reconstruction techniques can be found in [17, 35].
2.4 Estimating the Projective Distortion Matrix
Now we have reconstructed a set of 3-D points x~ i = [xi ; yi ; zi ; ti ]T (i = 1; : : : ; n) with respect to a projective basis. For each of these points, we know precisely its 3-D coordinates in a Euclidean reference frame because it belongs to the calibration object. Let the set of 3-D Euclidean points be X~i = [Xi ; Yi ; Zi ; 1]T (i = 1; : : : ; n). The projective points x~ i are related to the Euclidean points X~i by a collineation (a 4 4 matrix), which we would like to call the projective distortion matrix D , i.e.,
i [xi; yi; zi; ti ]T =
D
[Xi ; Yi ; Zi ; 1]T ;
(4)
or
i x~ i = D X~i ; where
2
D
D11 6 D21 = 64 D 31 D41
D12 D22 D32 D42
(5)
D13 D23 D33 D43
D14 D24 D34 D44
3 7 7 5
;
and i is an arbitrary scalar because D is only dened up to a scale factor. From (5), the two vectors x~ i and D X~i are related by a scalar. Let v = D X~i = [v1 ; v2 ; v3 ; v4 ]T , then we have the following three independent equations:
yi v1 ? xi v2 = 0 zi v1 ? xi v3 = 0 ti v1 ? xi v4 = 0 :
(6)
Bi x = 0 ;
(7)
Let x = [D11 ; D12 ; : : : ; D44 ]T be the vector of the 16 parameters of the distortion matrix to be computed. It is easy to show that (6) is equivalent to the following equation: where
2
0 0 3 yi X~Ti ?xi X~Ti Bi = 4 ziX~Ti 0 ?xiX~Ti 0 5 : (8) T T ~ ~ ti Xi 0 0 ?xiXi Given n (n 5) correspondences (X~i ; x~ i ), we have n equations of the type (7). The problem
is then to estimate x by minimizing the following error function:
F= P
n X
n X
i=1
i=1
(Bi x)2 = xT
!
BTi Bi x :
(9)
Let B = ni=1 BTi Bi , which is a symmetric matrix. As D is only dened up to a scale factor, we can normalize x with kxk = 1. It is well known that the solution to (9) is the eigenvector of B corresponding to the smallest eigenvalue of B. Many other methods exist for estimating the projective distortion matrix [37]. The above method is simple, but it is not clear what is being minimized. We are currently evaluating dierent methods.
2.5 Recovering the Euclidean Camera Projection Matrices
Let the Euclidean camera projection matrices be we have:
M
and
M
0 . Following the pinhole model,
s [u; v; 1]T = M [X; Y; Z; 1]T ; s0 u0 ; v0; 1 T = M 0 [X; Y; Z; 1]T :
Figure 1: A two-plane model used in the classical calibration technique
Figure 2: The stereo pair of the environment taken by the same stereo system as that used in Fig. 1 Compare the above equations with (1) and (2), and we have at once the Euclidean camera projection matrices, which are given by M = PD ; M 0 = P0D : Now, the calibration of the binocular stereo is achieved. The calibration matrices are estimated with respect to the world coordinate system associated to the calibration object, but taking into account the information from the environment.
2.6 Experimental Results
Figure 3: Epipolar geometry estimated from the scene, applied to the calibration apparatus
Figure 4: Epipolar geometry estimated by using the matches from both the calibration appa ratus and the environment Here, we provide an extreme example, where the classical calibration does not work because the calibration apparatus occupies only a small part of the image. In the real situation, the result is fortunately much better, but using data from the environment always improves the validity of the calibration result (see [41] for a more realistic example). Figure 1 shows a stereo pair of a two-plane model which is used by the classical calibration technique [11]. The two images are taken simultaneously. The corners (indicated by crosses) are detected with very
high precision by intersecting the lines of the checkered pattern. Since the coordinates of their corresponding points are known in space, we can compute the perspective projection matrix for each camera, from which we can compute the epipolar geometry. The epipolar geometry is displayed with several epipolar lines. Actually, the right image was taken by a camera which is mounted on the top of the other camera (i.e. in vertical conguration), thus we expect to have parallel vertical epipolar lines. This means that the epipolar geometry estimated with the classical calibration technique is not good. Figure 2 shows a pair of images taken by the same stereo rig from the environment. The matches have been established automatically by the program image-matching as described in [39] without using knowledge of the camera conguration. The corresponding epipolar geometry is shown with the epipolar lines together with the matched points. We see that this estimation is much closer to what we expect. Now, let us check whether the estimated epipolar geometry is valid for the calibration apparatus. This is shown in Fig. 3, where the epipolar lines corresponding to the top left corner, the lower right corner, and two points of the scene are shown. The conclusion is clear: it is not valid for the calibration apparatus. Next, we combine both the set of matches of the two-plane model and that found from the scene, and compute another fundamental matrix which is denoted by Fall. The epipolar lines corresponding to the same four points are shown in Fig. 4. Please note the dramatic change in the quality of the epipolar geometry.
Figure 5: 3D reconstruction after going back to the Euclidean space From Fall and the knowledge of the two-plane model, we calibrate the stereo rig using the technique described in this paper. The calibration is now valid both for the two-plane model and for the head. The 3D reconstruction of the two-plane model is shown in Fig. 5. Note that the 3D reconstruction result is not very good for this example because of the poor quality of the original images of the calibration apparatus.
3 Correlation-Based Stereo The main diculty in stereo vision is to establish correspondences between image pairs. The constraints used for reducing the complexity of the correspondence problem are basically of three kinds [10]: 1. Geometric constraints imposed by the imaging system. The most important is the epipolar constraint thanks to the stereo calibration. 2. Geometric constraints arising from the objects being looked at. For example, the as sumption that their distance to the cameras varies slowly almost everywhere gives rise the disparity gradient constraint. 3. Physical constraints such as those arising from models of the way objects interact with the illumination. The simplest and most widely used such model is the Lambertian model [22]. The class of algorithms which has been selected among several is the class of correlation-based stereo algorithms because they are the only ones that can produce suciently dense depth maps with an algorithmic structure which lends itself nicely to fast implementations because of the simplicity of the underlying computation. The hypothesis underlying this class of algorithms is that the observed objects are approximately Lambertian. This technique has also been applied to planetary rovers by other researchers, e.g., [29].
3.1 Overview of the correlation-based stereo algorithm
The original algorithm is described in [12, 23], and the later improvements and two hardware implementations are described in [10]. In the following, we briey overview the main steps: Image rectication. All epipolar lines go through a point called the epipole. Under a general conguration of stereo imaging system, the epipole is not at innity. This makes the scanning task fairly complex and thus inecient. By reprojecting the images onto a plane parallel to the line joining the optical centers, we can align the epipolar lines with the image rows which greatly simplies the correspondence process. Computation of the correlation score. It is computed over a xed disparity search range. Several criteria have been tested, and the normalized cross correlation (with subtraction of intensity mean in each correlation window) has been found to give the best result. The candidate matches are those which maximize the correlation score with respect to the disparity. A recursive implementation of the score computation makes the method independent of the size of the correlation window. The disparity range can be computed from a priori knowledge of the maximum and minimum distance of objects in the scene to the cameras. Validation of matches. We perform the correlation twice by reversing the roles of the two images and consider as valid only those matches for which the reverse correlation has fallen on the initial point in the left image. This symmetric test greatly reduces the probability of error. Subpixel disparity estimation. A second degree curve is tted to the correlation scores in the neighborhood of the extrema and the optimal disparity is computed by interpolation.
3D reconstruction. By intersecting the optical rays of two matched pixels we reconstruct
the corresponding 3D point. To increase the density of our potentially sparse disparity map, we use windows of a xed size to perform the matching at several levels of resolution (computed by subsampling Gaussian smoothed images). We then merge the disparity maps by selecting, for every pixel, the highest level of resolution for which a valid disparity has been found.
3.2 Quality measures of the disparity estimation
Two additional measures for the disparity at each pixel are computed, which we call respec tively the precision and the condence. The precision is related to the prominence of the extrema of the curve of correlation score. The narrower the peak is, the more precise the localization of the matched point is. For example, The disparity estimate shown in Fig. 6a is evidently more precise than that in Fig. 6b. Quantitatively, we approximate the parabola of
0
correlation score
1
correlation score
1
i?1 i
i+1
subpixel disparity
(a)
disparity
0
i?1 i
i+1
subpixel disparity
disparity
(b)
Figure 6: Illustration of the precision of the disparity estimate the correlation curve locally by a Gaussian around the extrema. Let the parabola be described by S = ?a(d ? d0)2 + b with a; b > 0 ; where S is the correlation score, d is the disparity, and (a; b; d0 ) are the tted parameters. The maximum value of S is equal to b. For a Gaussian function, the value at (standard deviation) from the center is 1=e of the maximum value, where e is the base of the natural logarithm. The precision can then be dened as S such that b=e = ?aS2 + b ; which gives p S = (e ? 1)b=a : The above formula is obtained by identifying the same value of the function at one standard deviation. Other possibilities include the identication of the same area between ? and .
It is very interesting to really know if a validated match is reliable or not. Errors occur when a wrong peak slightly higher than the right one is chosen. So, if in the correlation curve we can notice the presence of several peaks with approximately the same height, the risk of choosing the wrong one increases, especially if the images are noisy. The condence of the disparity at each pixel is thus dened as
C = S1 S? S2 ; 1
where S1 is the correlation score of the best candidate match, and S2 is that of the second best candidate match. Thus, C is related to the matching ambiguity for the pixel under consideration. On repetitive patterns the correlation curve has a periodic-like shape and the condence will receive a very low value. To summary, a pixel is marked as having no disparity estimate if its correlation score is below a prexed threshold (say, 0.8), and if its condence measure is below a prexed threshold (say, 0.7). The uncertainty for each reconstructed point, modeled as a 3D covariance matrix, can be computed from the disparity precision and the uncertainty of the camera parameters.
3.3 An example
Figure 7 shows an original stereo pair of a rock scene used to test vision algorithms developed. Two corresponding epipolar lines are also shown, which are not aligned. Figure 8 shows the rectied stereo pair after estimation of the epipolar geometry, where we can observe two horizontally aligned epipolar lines. Finally, gure 9 shows a perspective representation of the 3-D reconstruction of the scene of gure 7.
Figure 7: Original stereo pair
Figure 8: Rectied stereo pair
Figure 9: Perspective view of the 3-D reconstruction
4 Registration of 3D Visual Maps The objective of this step is to compute precisely the displacement of the vehicle between successive views in order to register the 3-D visual maps reconstructed by the stereovision system. This step is indispensable for the following reasons: better localize the mobile vehicle,
eliminate errors introduced in stereo matching and reconstruction, build a more global Digital Elevation Map (DEM) of the environment. A wealth of work has been done in this domain [18, 14, 25, 2, 7, 38, 20] (see [38] for more references). However, geometric matching in general is a dicult unsolved problem. We have studied the state of the art, and there does not yet exist a robust algorithm. Fortunately, the problem can be simplied in our application. Indeed, our rover is equipped with several sensors such as an odometer and several inertial measurement systems which can provide an estimate of the displacement within a reasonable precision. We can rst apply the rough estimate of the displacement to the rst frame to produce an intermediate frame; then the
motion between the intermediate frame and the second frame can be considered to be small. The method to be described in the sequel then renes this initial estimate in order to register the two frames with a very good precision. The key idea underlying our approach is the following. Given that the motion between two successive frames is small, a point in the rst frame is close to the corresponding point in the second frame. By matching points in the rst frame to their closest points in the second, we can nd a motion that brings the two sets of points closer. Iteratively applying this procedure, the algorithm yields a better and better motion estimate.
4.1 Problem Statement
A parametric surface S is a vector function x : R2 ! R3 . In computer vision applications such as that described in this paper, the data of a surface are usually available in the form of a set of 3-D points. The points in the rst 3-D map are noted by xi (i = 1; : : : ; m), and those in the second map are noted by x0j (j = 1; : : : ; n). These points are sampled from S and S 0 . In the noise-free case, if S and S 0 are registered by a transformation T , then the distance of a point on S , after applying T , to S 0 is zero. The objective of registration is then to nd the motion between the two frames, i.e., R for rotation and t for translation, such that the following criterion m n X X 1 1 2 0 (10) F (R; t) = Pm p pi d (Rxi + t; S ) + Pn q qj d2 (RT x0j ? RT t; S ) i=1 i i=1 j =1 j j =1 is minimized, where d(x; S 0 ) denotes the distance of the point x to S 0 (to be dened below), and pi (resp. qj ) takes value 1 if the point xi (resp. x0j ) can be matched to a point on S 0 in the second frame (resp. S in the rst frame) and takes value 0 otherwise. The minimum of F (R; t) will be zero in the noise-free case. It is necessary to have the parameter pi because some points are only visible from one point of view and some are outliers. However, the minimization of F (R; t) is very dicult not only because d(Rxi + t; S 0 ) is highly nonlinear (the corresponding point of xi on S 0 is not known beforehand) but also because pi can take either 0 or 1 (an Integer Programming Problem ). As said earlier, we follow a heuristic approach by assuming the motion between the two frames is small or approximately known. In the latter case, we can rst apply the approximate estimate of the motion between the two frames to the rst one to produce an intermediate frame; then the motion between the intermediate frame and the second frame can be considered to be small. Small depends essentially on the scene of interest. If the scene is dominated by a repetitive pattern, the motion should not be bigger than half of the pattern distance. Otherwise, other methods based on more global criteria could be used to recover a rough estimate of the motion. The algorithm described below can then be used to obtain a precise motion estimate.
4.2 Iterative Pseudo Point Matching Algorithm
The algorithm is an iterative process, each iteration consisting of the following steps: 1. Find Closest Points. Let us rst dene the distance d(x; S 0 ) between point x and shape S 0 , which is used in
(10). By denition, we have
d(x; x0 ) ; d(x; S 0 ) = xmin 2S
(11)
d(x; S 0 ) = j2fmin d(x; x0j ) : 1;:::;ng
(12)
0
0
where d(x1 ; x2 ) is the Euclidean distance between the two points x1 and x2 , i.e., d(x1 ; x2 ) = kx1 ? x2k. In our case, S 0 is available as a set of points x0j (j = 1; : : : ; n). We use the following simplication: The closest point y in the second frame to a given point x is the one satisfying
d(x; y) d(x; z) ; 8z 2 S 0 : The worst case cost of nding the closest point is O(n), where n is the number of points in the
second frame. The total cost while performing the above computation for each point in the rst frame is O(mn), where m is the number of points in the rst frame. There are several methods which can considerably speed up the search process, including bucketing techniques and k-D trees (abbreviation for k-dimensional binary search tree ) [32]. k-D trees are implemented in our algorithm. 2. Match Points. For each point x we can always nd a closest point y. However, because there are some spurious points in both frames due to sensor error, or because some points visible in one frame are not in the other due to sensor or object motion, it probably does not make any sense to pair x with y. Many constraints can be imposed to remove such spurious pairings. For example, distance continuity in a neighborhood, which is similar to the gural continuity in stereo matching [30, 31, 16], should be very useful to discard the false matches. These constraints are not incorporated in our algorithm in order to maintain the algorithm in its simplest form. Instead, we can exploit the following two simple heuristics, which are all unary. The rst is the maximum tolerance for distance. If the distance between a point xi and its closest one yi , denoted by d(xi ; yi ), is bigger than the maximum tolerable distance Dmax , then we set pi = 0 in (10), i.e., we cannot pair a reasonable point in the second frame with the point xi . This constraint is easily justied since we know that the motion between the two frames is small and hence the distance between two points reasonably paired cannot be very big. In our algorithm, Dmax is set adaptively and in a robust manner during each iteration by analyzing distances statistics. See the description of next step. The second is the orientation consistency. We can estimate the surface normal (referred below as orientation vector) at each point. It can be easily shown that the angle between the orientation vector at point x and that at its corresponding point y in the second frame cannot go beyond the rotation angle between the two frames [40]. Therefore, we can impose that the angle between the orientation vectors at two paired points should not be bigger than a prexed value , which is the maximum of the rotation angle expected between the two frames. This constraint is not implemented for surface registration, because the computation of the surface normals from 3-D scattered points is relatively expensive. 3. Update the Matching. Instead of using all matches recovered so far, we exploit a robust technique to discard sev eral of them by analyzing the statistics of the distances. The basic idea is that the distances
between reasonably paired points should not be very dierent from each other. To this end, one parameter, denoted by D, needs to be set by the user, which indicates when the registra tion between two frames is good. In our implementation, D is set to 10 centimeters, which corresponds roughly twice the resolution of a 3-D map reconstructed by a correlation-based stereo for a depth range of about 10 meters. This gives us satisfactory results. I denote the maximum tolerable distance in iteration I. At this point, each point in Let Dmax the rst frame (after applying the previously recovered motion) whose distance to its closest I?1 is retained, together with its closest point and their distance. Let point is less than Dmax fxi g, fyi g, and fdi g be, respectively, the resulting sets of original points, closest points, and their distances after the pseudo point matching, and let N be the cardinal of the sets. Now compute the mean and the sample deviation of the distances, which are given by N X = N1 di ;
v u u =t
i=1
N 1X 2 N i=1 (di ? ) :
I as Depending on the value of , we adaptively set the maximum tolerable distance Dmax shown below:
if < D
/* the registration is quite good */
I = + 3 ; Dmax else if < 3D /* I = + 2 ; Dmax else if < 6D /* I =+ ; Dmax
else
I = Dmax
endif
;
the registration is still good */ the registration is not too bad */
/* the registration is really bad */
We are thus conservative for a bad registration from the statistical viewpoint. The value of is chosen automatically in order to include points associated with the global peak of the distance histogram. See [38] for more explanation of . I to update the matching previously recovered: a At this point, we use the newly set Dmax I . The remaining paring between xi and yi is removed if their distance di is bigger than Dmax pairings are used to compute the motion between the two frames, as to be described below. Because Dmax is adaptively set based on the statistics of the distances, our algorithm is rather robust to relatively big motion and to gross outliers. Even if there remain several false matches in the retained set after update, the use of least-squares technique still yields a reasonable motion estimate, which is sucient for the algorithm to converge to the correct solution. 4. Compute Motion. At this point, we have a set of 3-D points which have been reasonably paired with a set of closest points, denoted respectively by fxi g and fyi g. Let N be the number of pairs. Because N is usually much greater than 3 (three points are the minimum for the computed rigid motion
to be unique), it is necessary to devise a procedure for computing the motion by minimizing the following mean-squares objective function N X 1 F (R; t) = N kRxi + t ? yi k2 ; i=1
(13)
which is the direct result of (10) with the denition of distance given by (12). Any optimization method, such as steepest descent, conjugate gradient, or simplex, can be used to nd the least-squares rotation and translation. Fortunately, several much more ecient algorithms exist for solving this particular problem. They include quaternion method [9, 21], singular value decomposition [1], dual number quaternion method [36], and the method proposed by [6]. We have implemented both the quaternion method and the dual number quaternion one. They yield exactly the same motion estimate. 5. Apply the motion. Apply the motion estimated to the rst frame, and go to the next iteration until conver gence is reached.
4.3 Iteration-termination condition
Several remarks should be made here. First, the motion is computed between the original points in the rst frame and the points in the second frame. Therefore, the nal motion given by the algorithm represents the transformation between the original rst frame and the second frame. Second, the iteration-termination condition is dened as the change in the motion estimate between two successive iterations. The change in translation at iteration I is dened as t = ktI ? tI?1 k=ktI k : To measure the change in rotation, we use the rotation axis representation, which is a 3-D vector, denoted by r. Let = krk and n = r=krk, the relation between r and the quaternion q is q = [sin(=2)nT ; cos(=2)]T : We do not use the quaternions because their dierence does not make much sense. We then dene the change in rotation at iteration I as r = krI ? rI?1 k=krI k : We terminate the iteration when both r and t are less than 1%, or when the number of iterations achieves a prexed threshold (40 in our implementation). One could also dene the termination condition as the absolute change, i.e., r = krI ? rI?1 k and t = ktI ? tI?1 k. We stop the iteration if r is less than a threshold, say 0.5 degrees, and t is less than a threshold, say 0.5 centimeters. For more details of this method, the interested reader is referred to [38].
4.4 An Example
In Fig. 9, we already showed a 3D reconstruction of a rock scene. We moved the rover to a new position, and we ran again the correlation-based stereo algorithm to obtain a new set of 3D points. The cloud of 3D points corresponding to the second position is shown in Fig. 10. We show in Figs. 11 and 12 the result of registration between Fig. 9 and Fig. 10. The correlation-based stereo system reconstructs 71505 points for the rst position and 51503 points for the second position. Figure 11 shows the superposition of two sets of 3D points before registration, where the 3D points corresponding to the rst position are drawn in quadrangles, while those corresponding to the second position, in a shade surface. Between the two positions, there is a rotation of 20 degrees around an axis in the ground plane, and a rotation of 10 degrees around an axis perpendicular to the ground plane. The translation is
Figure 10: The Cloud of 3D points reconstructed in the second position 2.56 meters. The registration result is quite good, as shown in Fig. 12. We can clearly observe that the two surfaces overlap nicely but cross each other in a somewhat random fashion because of the random noise inherited from the stereo matching and 3D reconstruction.
5 Incremental DEM Building A depth map delivered by the correlation-based stereo algorithm cannot be directly used for path planning in a rough terrain. Indeed, path planning algorithms need a facet representation of the scene whereas the stereo algorithm only provides a more or less dense and noisy set of points in space which must be structured into a surface. This is a dicult problem because the 3-D points can sample space in a very irregular fashion. Moreover, the stereo algorithm sometimes makes errors which must be removed using robust techniques. Furthermore, a depth map obtained from one position is very local, and thus not very useful for navigation and environment understanding. We would like to fuse the sequence of data acquired during the exploration in order to build a global map of the surrounding environment. The technique described here builds a Digital Elevation Model of the environment, directly usable by a path planning module. In order for a robot to execute a relatively complex task, a single representation is usually not sucient. The approach adopted by LAAS [26, 4, 3] is to use jointly several types of representations: Bitmap (for fast terrain classication), DEM (for very rugged terrain), and Objects (for regular terrain with a few obstacles), in order to derive full benet from the information complementarity of dierent representations.
5.1 Algorithm
We call digital elevation models (DEM) a regular grid in the xy horizontal plane. At each node of this grid we compute a height z from the local cloud of neighboring 3D points measured by various sensors. Since we do not intend to model precisely vertical or overhanging areas, we can reasonably assume that there is only one height z for a given (x; y) pair in the horizontal plane. We can therefore model the ground and possible rocks as a surface z = f (x; y). Height z for a certain node xy is obtained by tting a surface to the local cloud of points using a
Figure 11: Superposition of two sets of 3D points before registration: Front view and top view
Figure 12: Superposition of two sets of 3D points after registration: Front view and top view standard least-squares technique. Intersection of the surface with the vertical straight line going through the node is then computed. Our algorithm can be broken into ve steps [24]: 1. Bucket grid point sorting. An initial sorting of 3D points eases addressing for subsequent local surface computations. 2. Elimination of multiple elevations. The stereo-correlation algorithm can fail in the binocular case and even generate entire regions with signicant erroneous disparity [33]. These errors are mainly generated along the
depth during 3D reconstruction and produce heaps of points situated at a wrong location in the DEM. They appear oating in space well above or below correctly reconstructed points. So, we have for some nodes two possible values of elevations, one of which is false. Since our local surface tting technique does not deal with such cases of multiple elevation, we check if the elevation histogram of the local cloud's points has at least two distinct peaks. We then look at the one-peak histograms in the neighborhood of the studied node to nd the valid peak. Finally, we eliminate the points which belong to the other possible peaks. 3. Least-squares local surface adjustment. Planar and quadric surface models are used. The latter are better suited to rock surfaces but are computationally intensive. Planar models have proved to be fast and suitable if one chooses a suciently ne DEM resolution. All 3D data required for computation of plane coecients and corresponding covariances can be condensed into a small number of moments (9 exactly). Once stored, such moments enable us to easily merge new 3D points. We simply take as least-squares criterion the dierence between the noticed elevations and those which would be reached if the points really belonged to the local surface tted. In the case of a planar surface z = ax + by + c, if we have the 3D coordinates (xi ; yi ; zi ) of n points in the local cloud at the node being considered, the criterion is:
C=
n X i=1
wi2 (zi ? axi ? byi ? c)2 :
The wi coecient expresses the condence that we have in the estimation of the ith 3D point; it is provided by the stereo-correlation module. We set to zero the three partial derivatives of C in a, b, and c to compute the values of the three parameters a, b, and c of the plane to be tted while minimizing criterion C . We notice that the linear system obtained is expressed as a function of the following nine moments: P P P Sx = i wi2 xi Sy = i wi2 yi S = i wi2 P P P Sxx = i wi2 x2i Sxy = i wi2 xiyi Sz = i wi2 zi P P P Sxz = i wi2 xi zi Syy = i wi2 yi2 Syz = i wi2 yi zi 2
3
2
3
2
3
Sxx Sxy Sx a Sxz 4 Sxy Syy Sy 5 4 b 5 = 4 Syz 5 : Sx Sy S c Sz In the same way the covariances of the a, b, and c coecients are expressed directly as a function of the nine moments and of the determinant of the system's 3 3 matrix: bb = 1 (SSxx ? Sx2 ) aa = 1 (SSyy ? Sy2 ) 2) 2) cc = 1 (SxxSyy ? Sxy ab = 1 S (Sxx Syy ? Sxy 2 ) bc = 1 S 3 (Sxx Syy ? Sxy 2 ): ac = 1 S 2 (Sxx Syy ? Sxy 4. Removal of erroneous points and surface correction. This optional stage is required when sensory 3D data includes large errors to which least-squares are very sensitive. We hypothesize that erroneous points are a minority among
local clouds and are hence distant to surfaces adjusted at the previous stage. A rst method adjusts surfaces by recursive iteration while assigning a coecient to each point until conver gence. Such a coecient is chosen inversely proportional to the distance from the surface (we choose in practice the coecient wi = 1=(1 + dist2i ), whose value is less than 1). A second faster method consists in directly removing all dubious points by examining point to surface discrepancies and then readjust local surface coecients. 5. Construction of a graph of neighboring local surfaces compatibility. The aim of this fourth step is to cancel most of the remaining global errors. We dene a criterion [13] indicating whether two neighboring local surfaces belong to a common global surface. We compute this criterion using 4-neighbors for all the local surfaces to obtain the graph of neighboring local surfaces compatibility. We keep only major connected parts of this graph. 6. Computation of altitudes. This is done by intersecting the local surfaces with vertical lines going through the nodes of the xy-plane. Small gaps in the terrain model are lled-in by closest neighbor interpolation. 7. Incremental approach. During real operations, sampled 3D points will not all be available simultaneously but will instead complement previous samples. The elevation model will be incrementally updated. Hence, we must plan for an incremental adjustment of local surfaces. This is done with least-squares while maintaining prior 3D data. We store for each node of the DEM the 9 moments dened in Step 3 and update them when new 3D points are available. Another method would be to update surface coecients by using a Kalman lter. The latter method has the advantage of taking 3D point uncertainty into account. Since the rover's memory capacity is limited, it progressively forgets the oldest DEM 3D data during motion. It rst stores only the local surface coecients and then only the elevations.
5.2 Results
We now show an experimental example. We have erected several rocks on a site to simulate the Mars. Images have been taken from four positions around the rocks, those taken by the rst camera being shown in Fig. 13. The stereo rig is at about 6 meters from the scene. Actually, the examples we saw in Sect. 3 (Fig. 7) and Sect. 4 (Fig. 10) correspond, respec tively, to the second and third position of this sequence. Figure 14 represents the DEMs obtained incrementally. It takes only a few seconds to generate one of these DEMs on a Sun Sparc 10 processor. The coloration of the facets are in function of their elevations. The fusion procedure is carried out as follows. First, the elevation of each node in the DEM is computed by tting a plane to the points located in its neighborhood, which requires to compute 9 moments. Besides the elevation, the values of these 9 moments are also saved in the DEM. When a new cloud of points is available, the registration procedure computes its displacement with respect to the previous DEM. The displacement estimate is then applied to all points in the new cloud. Finally, we update the moments for each node of the DEM if there are new points found in its neighborhood, and the elevation is eventually computed. Figure 14 shows clearly the successive lling of the occluded
First position
Second position
Third position
Fourth position Figure 13: Images taken by the rst camera
parts. The DEM obtained in the last position is almost complete and almost all obstacles are completely modeled. The DEM is now suitable for the path planning of the rover.
6 Conclusion We have presented an onboard stereovision system for a planetary rover. In particular, the following four problems have been addressed:
DEM generated in the rst position
DEM generated in the second position
DEM generated in the third position
DEM generated in the fourth position
Figure 14: Incremental building of the digital elevation model of the environment
On-site calibration: An eective technique has been developed which uses both the sur
rounding environment and the calibration apparatus. It can asymptotically produce an estimate of the epipolar geometry valid for the whole range, as long as the cameras are rigidly attached to each other.
Correlation-based stereo: The algorithm can produce sucient dense range maps. Two
measures are available for the precision and condence of the disparity estimation. Registration: The technique registers successive depth maps by iteratively matching closest points. A robust method based on the statistics of the distance distribution allows to deal with the important problems such as outliers, occlusion, appearance and disappearance.
Fusion: The registered maps are fused, the erroneous data are eliminated, and a global digital
elevation model is built incrementally. Future work will be on the characterization of the uncertainty in this complete vision system.
Acknowledgment The author thanks Rachid Deriche, Olivier Faugeras, Pascal Fua, Bernard Hotz, and Catherine Proy, for their contributions. This work was carried out in part in the French CNES program VAP and in the European project IARES, and has beneted from the discussions with people at CNES, ITMI, and LAAS.
References [1] K. Arun, T. Huang, and S. Blostein. Least-squares tting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):698700, Sept. 1987. [2] P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239256, Feb. 1992. [3] S. Betgé-Brezetz. Modélisation incrémentale et localisation par amers pour la naviga tion d'un robot mobile autonome en environnement naturel. PhD thesis, Université Paul Sabatier de Toulouse, France, 1996. [4] S. Betgé-Brezetz, R. Chatila, and M. Devy. Object-based modelling and localization in natural environment. In Int'l Conf. Robotics and Automation, Nagoya, Japan, May 1995. [5] L. Boissier and L. Petitjean. System design and architecture of the IARES mobile testbed project. In Proc. 2nd IARP Workshop on Robotics in Space, Montreal, Canada, July 1994. [6] R. Brockett. Least squares matching problems. Linear Algebra and Its Applications, 122/123/124:761777, 1989. [7] Y. Chen and G. Medioni. Object modelling by registration of multiple range images. Image and Vision Computing, 10(3):145155, Apr. 1992. [8] O. Faugeras. What can be seen in three dimensions with an uncalibrated stereo rig. In G. Sandini, editor, Proc. the 2nd European Conference on Computer Vision, pages 563578, Santa Margherita Ligure, Italy, May 1992. Springer-Verlag. [9] O. Faugeras and M. Hébert. The representation, recognition, and locating of 3d shapes from range data. The International Journal of Robotics Research, 5:2752, 1986. [10] O. Faugeras, B. Hotz, H. Mathieu, T. Viéville, Z. Zhang, P. Fua, E. Théron, L. Moll, G. Berry, J. Vuillemin, P. Bertin, and C. Proy. Real time correlation based stereo: algorithm implementations and applications. The International Journal of Computer Vision, 1996. To appear. [11] O. Faugeras and G. Toscani. The calibration problem for stereo. In Proc. Int'l Conf. Computer Vision and Pattern Recognition, IEEE Publication 86CH2290-5, pages 1520, Miami Beach, FL, June 1986. IEEE. [12] P. Fua. A parallel stereo algorithm that produces dense depth maps and preserves image features. Machine Vision and Applications, 6(1), Winter 1993.
[13] P. Fua and P. Sander. Reconstructing surfaces from unstructured 3d points. In Proc. Image Understanding Workshop, San Diego, California, Jan. 1992. [14] D. B. Gennery. Visual terrain matching for a Mars rover. In Proc. Int'l Conf. Computer Vision and Pattern Recognition, pages 483491, San Diego, CA, June 1989. [15] G. Giralt and L. Boissier. The French planetary rover VAP: Concept and current devel opments. In Proc. International Workshop on Intelligent Robots and Systems (IROS'92), pages 13911398, Raleigh, North Carolina, USA, July 1992. [16] W. Grimson. Computational experiments with a feature based stereo algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(1):1734, 1985. [17] R. Hartley and P. Sturm. Triangulation. In Proc. ARPA Image Understanding Work shop, pages 957966. Defense Advanced Research Projects Agency, Morgan Kaufmann Publishers, Inc., 1994. [18] M. Hébert, C. Caillas, E. Krotkov, I. S. Kweon, and T. Kanade. Terrain mapping for a roving planetary explorer. In Int'l Conf. Robotics and Automation, pages 9971002, 1989. [19] M. Hebert and E. Krotkov. 3D measurements from imaging laser radars. Image and Vision Computing, 10(3), Apr. 1992. [20] K. Higuchi, M. Hebert, and K. Ikeuchi. Building 3-D models from unregistered range images. In Int'l Conf. Robotics and Automation, San Diego, CA, May 1994. [21] B. K. Horn. Closed-form Solution of Absolute Orientation using Unit Quaternions. Jour nal of the Optical Society A, 4(4):629642, Apr. 1987. [22] B. K. P. Horn. Robot Vision. MIT Press, 1986. [23] B. Hotz. Etude de techniques de stéréovision par corrélation - application au programme véhicule autonome planétaire (V.A.P.). Rapport de stage de D.E.A TE/AE/SE/SR No 91/242, Robotique et Vision Articielle, Université de Nice, Sept. 1991. [24] B. Hotz, Z. Zhang, and P. Fua. Incremental construction of local DEM for an autonomous planetary rover. In Proc. Workshop on Computer Vision for Space Applications, pages 3343, Antibes, France, Sept. 1993. [25] I. Kweon and T. Kanade. High-resolution terrain map from multiple sensor data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):278292, Feb. 1992. [26] S. Lacroix, P. Fillatreau, F. Nashashibi, R. Chatila, and M. Devy. Perception for au tonomous navigation in a natural environment. In Proc. Workshop on Computer Vision for Space Applications, pages 4458, Antibes, France, Sept. 1993. [27] Q.-T. Luong. Matrice Fondamentale et Calibration Visuelle sur l'Environnement-Vers une plus grande autonomie des systèmes robotiques. PhD thesis, Université de Paris-Sud, Centre d'Orsay, Dec. 1992. [28] Q.-T. Luong and T. Viéville. Canonic representations for the geometries of multiple projective views. In J.-O. Eklundh, editor, Proc. 3rd European Conference on Computer Vision, pages 589599, Vol. 1, Stockholm, Sweden, May 1994. Springer-Verlag.
[29] L. Matthies. Stereo vision for planetary rovers: Stochastic modeling to near real-time implementation. The International Journal of Computer Vision, 8(1):7191, July 1992. [30] J. Mayhew and J. Frisby. Psychophysical and computational studies towards a theory of human stereopsis. Articial Intelligence, 17:349386, 1981. [31] S. Pollard, J. Mayhew, and J. Frisby. PMF : a stereo correspondence algorithm using a disparity gradient constraint. Perception, 14:449470, 1985. [32] F. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, New-York, 1985. [33] C. Proy, B. Hotz, O. Faugeras, P. Garnesson, and M. Berthod. Onboard vision system for a mobile planetary exploration robot. In Workshop on Computer Vision for Space Applications, pages 212, Antibes, France, Sept. 1993. [34] L. Robert, M. Bua, and M. Hebert. Weakly-calibrated stereo perception for rover navi gation. In Proc. 5th Int'l Conf. Computer Vision, pages 4651, Boston, MA, June 1995. IEEE Computer Society Press. [35] C. Rothwell, G. Csurka, and O. Faugeras. A comparison of projective reconstruction methods for pairs of views. In Proc. 5th Int'l Conf. Computer Vision, pages 932937, Boston, MA, June 1995. IEEE Computer Society Press. [36] M. W. Walker, L. Shao, and R. A. Volz. Estimating 3-D location parameters using dual number quaternions. CVGIP: Image Understanding, 54(3):358367, Nov. 1991. [37] Z. Zhang. Computing projective distortion matrix (collineation). Technical report, INRIA Sophia, Robotvis, Nov. 1993. [38] Z. Zhang. Iterative point matching for registration of free-form curves and surfaces. The International Journal of Computer Vision, 13(2):119152, 1994. [39] Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Articial Intelligence Journal, 78:87119, Oct. 1995. [40] Z. Zhang, O. Faugeras, and N. Ayache. Analysis of a sequence of stereo scenes contain ing multiple moving objects using rigidity constraints. In Proc. the Second Int'l Conf. Computer Vision, pages 177186, Tampa, Florida, Dec. 1988. IEEE. Also as a chapter in R. Kasturi and R.C. Jain (eds), Computer Vision: Principles, IEEE computer society press, 1991. [41] Z. Zhang, O. Faugeras, and R. Deriche. Calibrating a binocular stereo through projective reconstruction using both a calibration object and the environment. In R. Mohr and C. Wu, editors, Proc. Europe-China Workshop on Geometrical modelling and Invariants for Computer Vision, pages 253260, Xi'an, China, Apr. 1995.