Cooperative Stereo-Motion: Matching and Reconstruction

Report 4 Downloads 75 Views
Cooperative Stereo-Motion: Matching and Reconstruction

F. Dornaika and R. Chung Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, NT, Hong Kong E-mail: fdornaika,[email protected] Abstract|One interesting application of computer vi- [1], [2]. However, most of it concentrated on exploiting sion is the 3D structure recovery of scenes. Traditionally, the redundancy in the input data (stereo and motion two cues are used: structure from motion and structure data) in recovering 3D structure; other than redunfrom stereo, two sub elds with complementary sets of as- dancy what can be gained by combining them were sumptions and techniques. This paper introduces a new not explicitly addressed. In this paper, we present a method of cooperation between stereo and motion which new approach of combining the two cues which capcan be used with a calibrated or uncalibrated stereo rig. tures the advantages of both cues: easy corresponMore precisely, we combine the advantages of both cues: dence problem from motion and accurate 3D reconi) easy correspondence problem from motion and ii) accu- struction from stereo. In particular, we show how rate 3D reconstruction from stereo. We show how the stereo stereo correspondences can be recovered from motion matching can be recovered from motion correspondences us- correspondences using only geometrical constraints. ing only geometric constraints. Once the stereo correspon- This stereo matching can be performed without a pridences are recovered, the scene can be reconstructed using ori information about the stereo rig parameters. Once all stereo pairs. Experiments involving real stereo pairs in- the stereo matching has been recovered, either o -line dicate that rich and reliable information (about the scene or on-line parameters of the stereo head can be used and/or the stereo rig) can be derived from this cooperation. for recovering the 3D shape of the observed scene using They also indicate that robust 3D reconstruction can be ob- stereo and motion data. The method described here tained even with short image sequences. has been inspired by [3]. The primary di erence is that our method deals with a full perspective camera model not its ane approximation as in [3]. I. Introduction Several advantages can be attributed to the proposed method: i) the stereo correspondence can be One of the most interesting goals of computer vi- performed without the use of the area-based stereo sion is the 3D structure recovery of scenes. This re- matching algorithms, ii) no long image sequences are covery has many applications such as object recogni- required for accurate 3D structure reconstruction since tion and modeling, automatic cartography, and au- widely separated views are always in the image data, tonomous robot navigation. One way to reconstruct iii) the camera model covers a full perspective projec3D information of a scene is to use di erent views of tion, not its ane approximation, and iv) no knowlit. On this, two cues are possible. One is the visual edge about the stereo rig parameters (both the intrinmotion cue, in which 3D structure is recovered from sic and the extrinsic parameters of the cameras) or a moving camera. The other is the stereo vision cue, relative motion is required. in which 3D structure is recovered from two widely This paper is organized as follows. Section II de nes separated views of the same scene. Both cues requires the problem we focus on. Section III describes the prosolutions of two subproblems: i) the correspondence jective reconstruction from one image sequence. Secproblem and ii) the reconstruction problem. tion IV uses the results of section III and describes the The motion cue has the advantage that the corre- recovering of stereo correspondences from motion corspondence problem is relatively easy to solve, but it respondences. Section V discusses the problem of 3D requires a long image sequence for accurate reconstruc- reconstruction. Experimental results are presented in tion. The stereo cue has the advantage that the sepa- section VI. ration of the two views is signi cant enough to obtain accurate 3D reconstruction, however it has a dicult II. Problem definition correspondence problem especially when the two views have many features. Several factors make the stereo correspondence problem dicult: occlusions, photoWe adopt the pin-hole model for the cameras. This metric and gural distortions, deciding what window means that the projection of the 3D scene onto 2D size to use in area-based matching. There has been images will be described by a full perspective projecsome work on combining the stereo and motion cues tion (see Figure 1). A point feature of the scene will 1998 IEEE International Conference on Intelligent Vehicles

317

be denoted by M. For simplicity, the 3D projective coordinates of this point, a 4-vector, will also be denoted by M. The image of this point will be denoted by m in the left image, and m0 in the right image. We assume that the stereo rig moves in front of the scene such that we can obtain f stereo pairs. Thus, we have 2 sequences each one containing f images. By applying classic tracking methods to each sequence, one can obtain 2 di erent sets of feature points. We point out that the features in these 2 sets are not in one-to-one correspondence since they are obtained independently. Our goal is two-fold. First, we have to establish the matching between corresponding features in the 2 sequences, i.e. stereo matching. Second, we have to compute the 3D structure of these features, i.e. 3D reconstruction. In the next section, we will show how point-to-point correspondences in one single sequence can be represented in another useful form. This form will be given by the 3D projective representation [4]. The 3D projective structure is not the Euclidean one. Additional knowledge is needed to upgrade this projective structure to the Euclidean one [5]. 3D scene M

Left sequence

Right sequence

m

m/

Motion

R, t Stereo geometry

Fig. 1. 3D reconstruction from stereo-motion.

III. Projective reconstruction

We show in this section how a projective reconstruction of the scene can be obtained from feature correspondences in one sequence. In other words, we estimate the position of features in a 3D projective space given their projections in at least two perspective images. We assume that features are correctly tracked (most of them) in the sequence. Thus a tracked feature will be represented by a compact form, namely its projective coordinates instead of its 2D coordinates across the sequence. There are several methods for computing a projective reconstruction [6], [7], [8]. In this section, we focus on two cases: i) the case of two images and ii) the case of multiple images.

A. Two images 1) Camera matrices: We have adopted the following method which is considered to produce the best results since the epipolar geometry (and hence the camera matrices) is estimated from a large number of image correspondences [6]. It is well-known that point-topoint correspondences between two images de ne the epipolar geometry between the rst image and the second image. Algebraically, the epipolar geometry is de ned by a 33 homogeneous matrix of rank 2 which is known as the fundamental matrix. This matrix has been thoroughly studied and robust algorithms exist for estimating it from image data [9]. In order to estimate the projective coordinates of any observed feature we have to derive a solution of the camera matrices which is consistent with the fundamental matrix. Let P1 and P2 be two 34 projective matrices describing the mappings between the 3D projective space to the rst and second images respectively. Since the reconstruction is performed in the projective space these matrices can be written as ( = stands for equality up to a scale factor): P1 = , I 0  and P2 = , P p 

where I is the 33 identity matrix, P is a 33 nonsingular matrix, and p is a 3-vector. Matrix P1 is already known and matrix P2 can be estimated from the fundamental matrix as follows. The fundamental matrix F12 can be written as a function of the matrix P2 : F12 = S (p) P where S (p) is the skew-symmetric matrix associated with the 3-vector p. Therefore, P2 can be derived from F12 by decomposing the latter into a skew-symmetric matrix and a non-singular matrix. Notice however that this decomposition is not unique. Indeed, matrix P + p aT satis es the decomposition for any arbitrary 3-vector a because S (p) p = 0. To nd a particular solution of P2 we use the following. Let UDVT be the singular value decomposition of F12 . D is a diagonal matrix and since by de nition, F12 has rank 2, D = D( ; ; 0). Thus, we have: T F12 = U D( ; ; 0)VT = UZU | {z T} UYD | ( ;{z ; )V }

p

S(

)

P

where is a positive number ( can be set to ( + )=2) and with: 0 0 1 01 0 0 ,1 0 1 Z = @ ,1 0 0 A and Y = @ 1 0 0 A 0 0 0 0 0 1 To conclude, the most general form of the projection matrix P2 is P2  = (P + paT j a p) where a is an arbitrary 3-vector, and a is an arbitrary non-zero scalar.

1998 IEEE International Conference on Intelligent Vehicles

318

s1 (u1 ; v1 ; 1)T = P1 x and s2 (u2 ; v2 ; 1)T = P2 x

Another method, which is simpler than (2), consists of computing 2 projective matrices (e.g. P1 and Pf ) using the 2-frame scheme. Then, the remaining camera matrices, i.e. P2 ;    ; Pf ,1 will be recovered from 3D/2D correspondences between a reconstructed subset of features and their images in the corresponding frame. Once the camera matrices have been estimated, additional or updated projective coordinates can be estimated using all camera matrices. This will allow us to derive a reliable projective reconstruction. Once again, closed-form solution can be applied as well as non-linear method (both of them are a generalization of the methods described in paragraph III.A.2).

By eliminating the 2 arbitrary scale factors s1 and s2 in these equations, one can obtain the following four homogeneous linear equations:

IV. Stereo correspondences from motion correspondences

2) Computing the projective structure:

Closed-form solution Now given a point-to-point correspondence (m $ m ), one can compute the 3D projective coordinates of the corresponding scene point using the two projection matrices, P and P , that have been estimated above. Let x = 1

2

1

2

(x; y; z; t)T be the unknown 3D projective coordinates, and (u1 ; v1 ) and (u2 ; v2 ) be the image coordinates of m1 and m2, respectively. The vector x can be computed using these two vector equations:

Ax = 0

(1)

Since the projective coordinates are de ned up to a scale factor, we can impose kxk = 1, then the solution to (1) is well-known to be the eigenvector of the matrix AT A associated to the smallest eigenvalue.

Non-linear methods The previous approach has

the advantage of providing a closed-form solution, but it has a disadvantage that it does not minimize a physically meaningful quantity. Indeed, the quantity we want to minimize is the error measured in the image plane between the observation and the projection of the reconstruction, that is:

p12 x 2 p21x 2 p22x 2 x2 (u1 , p11 T ) +(v1 , T ) +(u2 , T ) +(v2 , T ) T

T

T

T

13

13

23

23

p x p x p x p x with p k and p k being the kth rows of matrices P and P , respectively. 1

1

2

2

In this case, we may use standard iterative minimization techniques. We use the LevenbergMarquardt's technique [10]. The initial guess of the solution is provided by the closed-form method. B. Multiple images In this case, one can estimate simultaneously the projective mappings Pj ; j = 1    f and the 3D projective coordinates xi ; i = 1    n by minimizing a global error function which has the following form (we have 11  f + 3  n unknowns): f X n X

Min f

j =1 i=1

(uij , u^ij )2 + (vij , v^ij )2 g

(2)

where f is the number of images and n the number of tracked features, and (^uij ; v^ij ) the reprojection of feature xi using the mapping Pj .

In the previous section, we have shown how a 3D projective reconstruction of the scene can be obtained from the motion of one camera. Thus, using the 2 sequences obtained by the 2 cameras one can estimate two 3D projective reconstructions. The rst representation is expressed in a some left projective basis, the other one in a some right projective basis. Our goal is to put into correspondences the extracted features in the left and right views. A. Mapping projective space - image plane

We consider now one stereo pair and without loss of generality, we assume that this stereo pair corresponds to the rst position of the stereo rig. We assume that we have established a few point-to-point correspondences between the left image and the right image. To perform that one can use the method described in [11] that combines proximity and similarity criteria in one single criterion. More precisely, we denote by si ; i = 1    m the left image points and by s0i their corresponding points in the right image. Let Si and Sri be the projective coordinates of the corresponding 3D points in the left and right projective bases, respectively. Since points are tracked in both the left and right sequences, we obtain m  f point-to-point correspondences (f is the number of the stereo pairs). Since the fundamental matrix of the stereo rig F does not vary, we can use these correspondences in order to obtain an initial estimation of the fundamental matrix F. Therefore an on-line computation of the fundamental matrix is possible (m  f > 7). It is straightforward that we have m 3D to 2D pointto-point correspondences between the left (right) projective basis and the right (left) image. Let Pl be the 34 projective matrix describing the mapping between the left projective reconstruction of the scene and the rst right image (see Figure 2). Thus

1998 IEEE International Conference on Intelligent Vehicles

319

we have the following constraints (i = 1    m):

s0i = Pl Si

(3)

si = Pr Sri

(4)

The mapping Pl can be linearly estimated from (3) provided that the number of initial stereo correspondences is equal to or greater than 6 (m  6). It is important to stress the fact that the estimation of the mapping Pl does not amount for camera calibration since the mapping is from 3D projective space and not from the 3D Euclidean space. Similarly, the mapping Pr between the right projective reconstruction and the left rst image can be estimated from the following constraints (i = 1    m):

Therefore, if some right extracted features belong to the de ned neighborhood, the stereo match of m will be the closest feature to the associated epipolar line. Similarly, for any feature m0 that is tracked in the right sequence one can predict the 2D location m? of its correspondence in the left image as: m? = Pr Mr (6) where Mr represents the 3D projective coordinates of m0 in the right projective basis. In brief, matrices Pl and Pr will be used to establish the stereo correspondences between the two views. This separation is very useful in case where the features are tracked in only one sequence. Right image

B. Stereo correspondences recovery

Stereo match

Predicted location /

m* Left projective basis

Projective reconstruction from one sequence

M

Epipolar line

S

(1) Construct in

Neighborhood

projective space

(3) Project

P

m

/

m*

s Left image

s

Extracted features

Fig. 3. The predicted location m ? together with the epipolar line de ne a small neighborhood in the right image. Therefore, the stereo match will be the extracted feature which belongs to this neighborhood. In case we have more than one, the stereo match will be the closest feature to the epipolar line. 0

l

/

Right image

V. Scene reconstruction

(2) Initial stereo matches

Fig. 2. Prediction of the 2D location of the corresponding feature. First, the left feature is projectively reconstructed (1). Second, it is projected into the right image using the mapping between the left projective basis and the right image (3) (this mapping is recovered from a few initial stereo correspondences (2)).

Now consider any feature m that is tracked in the left sequence. Let M be the 3D projective coordinates of this point in the left projective basis. The 2D location m0? of its correspondence in the right image can be predicted by (see Figure 2):

Once the stereo correspondences are established across the stereo pairs, the 3D Euclidean coordinates of these correspondences can be determined. Two cases arise according to the stereo rig calibration. The rst case concerns a calibrated stereo rig, i.e. the stereo rig parameters are known. In this case, we estimated the 3D structure of a subset of features using one stereo pair. Then, camera motions are recovered from 3D/2D correspondences using pose estimation algorithms [12], [13]. Eventually, the 3D structure will be computed using all stereo pairs. The second case concerns an uncalibrated stereo rig. In this case, we can apply the self-calibration techniques [14] in order to recover both the stereo rig parameters and the 3D shape (up to a scale factor).

m0? = Pl M (5) In the presence of noise, m0? will not coincide with the real location of the stereo match of m. Therefore, we use this location together with the epipolar line (repreVI. Experiments sented by F m) to determine a small neighborhood in the right image in which one expects to nd the stereo First experiment The proposed method was apmatch of m (see Figure 3). Since feature extraction plied to real stereo pairs. The rst experiment is peris done initially in a separate manner for the left and right sequences, some of the tracked features may not have the corresponding match in the other sequence.

formed on 2 stereo pairs (see Figure 4) of a laboratory scene in which an oscilloscope and a soda can were observable, using 2 CCD cameras with 50 mm lenses.

1998 IEEE International Conference on Intelligent Vehicles

320

These 2 cameras have a baseline of 60 cm. The motion of the stereo rig is approximately 7 cm. Image points were extracted in the rst pair using Moravec's Interest Operator. Altogether 242 and 281 feature points are successfully tracked in the two stereo pairs. A total of 21 initial stereo matches were located in the left and right images. By applying the proposed method we can nd 120 successful stereo matches. Figure 4 shows the two stereo pairs and the recovered stereo matches (the rst 2 images). Figure 5 illustrates a top view of the reconstructed scene using these 2 pairs. left image ( rst pair)

Second experiment The second experiment was performed on a bowl that is 1.5 m away from the cameras whose baseline is 40 cm. The stereo rig performed ve small motions, allowing us to obtain 6 stereo pairs of the bowl. The whole displacement of the stereo rig is 5 cm. A total of 18 initial stereo matches were located in the left and right images. By applying the proposed geometrical constraints, we can nd 204 successful stereo matches whose positions are shown in Figure 6 (top two images). The third plot of this gure illustrates a top view of the reconstructed bowl using the six stereo pairs (o -line calibration parameters of the stereo rig have been used). The fourth plot illustrates the 3D reconstruction using the on-line computed parameters. Using a special metric that does not depend on the scale factor, we have computed the distance between the two reconstructed shapes and found that the average is equal to 0.000129 which means that the on-line reconstruction is accurate.

right image ( rst pair)

VII. Conclusion

In this paper, we have investigated the cooperation between stereo and motion. We have developed a method that combines the advantages of both cues: easy correspondences in motion cue and accurate 3D reconstruction in stereo cue. We have proposed a geometrical approach that recovers stereo correspondences from motion correspondences and have avoided the area-based stereo matching and its associated problems. One interesting feature of the method is that no long image sequences are required. Furthermore, the approach can be applied with an uncalibrated stereo rig and without any condition on the spatial position of the scene with respect to the cameras. We have tested the method over real stereo pairs and shown that rich and reliable information can be derived from such a cooperation.

left image (second pair)

Acknowledgment This research was supported by Hong Kong Research Grants Council (RGC) under the 1997-8 Earmarked Grant for Research. It is also part of the project \A Next-generation Intelligent Robot with Creativity" under the Strategic Research Programme of the Chinese University of Hong Kong.

right image (second pair) Fig. 4. The 2 stereo pairs and the recovered stereo correspondences ( rst 2 images).

1998 IEEE International Conference on Intelligent Vehicles

321

4600

4500

Z−axis / mm

4400

4300

4200

4100

4000 −200

−100

0 100 X−axis / mm

200

300

Fig. 5. A top view of the reconstructed scene using 2 stereo pairs.

left image

References

right image 1780

Z−axis / mm

1760

1740

1720

1700

−60

−40

−20

0 X−axis / mm

20

40

60

0.1 0.05 Y−axis

[1] A. M. Waxman and J. H. Duncan, \Binocular image ows: Steps toward stereo-motion fusion," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, pp. 715{729, 1986. [2] A. Mitiche, \A computational approach to the fusion of stereo and kineopsis," in Motion Understanding: Robot and Human Vision, pp. 81{95. Kluwer Academic Publishers, 1988. [3] P. K. Ho and R. Chung, \Stereo-motion that complements stereo and motion analyses," in Proc. IEEE Conference on Computer Vision and Pattern Recognition, June 1997, pp. 213{218. [4] O. Faugeras, \Three-Dimensional Computer Vision: a Geometric Viewpoint", The MIT Press, 1993. [5] B. Boufama, R. Mohr, and F. Veillon, \Euclidean constraints for uncalibrated reconstruction," in Proceedings of the Fourth International Conference on Computer Vision, 1993, pp. 466{470. [6] C. Rothwell, G. Csurka, and O. Faugeras, \A comparison of projective reconstruction methods for pairs of views," in Proc. of the International Conference on Computer Vision, 1995, pp. 932{937. [7] R. I. Hartley, \Projective reconstruction and invariants from multiple images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 10, pp. 1036{1041, 1994. [8] R. Mohr, F. Veillon, and L. Quan, \3D relative reconstruction using multiple uncalibrated images," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1993, pp. 543{548. [9] Z. Zhang, \Determining the epipolar geometry and its uncertainty: A review," International Journal of Computer Vision, 1997. [10] W. H. Press, S. A. Teukolsky, W. T. Wetterling, and B. P. Flannery, Numerical Recipes, The Art of Scienti c Computing., Cambridge University Press, New York, 1992. [11] M. Pilu, \A direct method for stereo correspondence based on singular value decomposition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, June 1997. [12] F. Dornaika and C. Garcia, \Pose estimation using point and line correspondences," Journal of Real-Time Imaging, 1998. [13] D. DeMenthon and L. Davis, \Model-based object pose in 25 lines of code," International Journal of Computer Vision, vol. 15, pp. 123{141, June 1995. [14] Z. Zhang, Q. Luong, and O. Faugeras, \Motion of an uncalibrated stereo rig: self-calibration and metric reconstruction," IEEE Trans. on Robotics and Automation, vol. 12, no. 1, pp. 103{113, 1996.

0 −0.05 −0.1 0.15 0.1 0.05 0

2.76

−0.05

2.78 2.8

−0.1

2.82 2.84

−0.15 2.86 X−axis

−0.2

2.88

Z−axis

Fig. 6. One among the six stereo pairs and the recovered stereo matches (the top 2 images). The third plot illustrates the 3D reconstruction using the six stereo pairs (top view). The fourth plot illustrates the 3D structure obtained using the on-line stereo parameters. In the latter case, the structure is de ned up to a scale factor.

1998 IEEE International Conference on Intelligent Vehicles

322