A New Structure From Motion Ambiguity - CiteSeerX

Report 2 Downloads 130 Views
A New Structure From Motion Ambiguity John Oliensis ([email protected]) NEC Research Institute 4 Independence Way Princeton, N.J. 08540 Abstract

We demonstrate the existence of a new approximate ambiguity in structure from motion which occurs as generically as the bas{relief ambiguity but applies more strongly to scenes with larger depth variation. It occurs for moderate translations and eld of view (as for the bas{relief ambiguity) and applies to multi{frame, nite{motion sequences where the camera moves roughly along a line, as well as to optical ow. Previous work on the bas{relief ambiguity gave a partial characterization of the error sensitivities in recovering the camera heading, assuming that the scene was non{planar and that the heading was suciently di erent from the view direction. Our analysis completes the understanding of the error sensitivities under these conditions.

1 Introduction This paper demonstrates the existence of a new approximate ambiguity in structure from motion (SFM) which occurs as generically as the bas{relief ambiguity but di ers from it in important ways|for instance, the new e ect grows stronger as the depth variation in the scene increases. The new e ect is noticeable in the data of [19], and [18][17] contain a related discussion, but this paper is the rst to explain it and demonstrate its signi cance. Previous analyses [9][8][6] of ambiguity in SFM focused on the bas{relief ambiguity, assuming that the scene was non{planar and the camera heading suciently di erent from the view direction. Our paper complements this previous work. Under the same assumptions, it combines to give a complete understanding of the ambiguities in SFM and has important implications for algorithms. In a companion paper [10], we show how an algorithm can exploit our analysis to robustly recover the heading from a multi{image sequence. We consider the standard SFM problem of reconstructing the camera's direction of motion, either from a multi{image sequence of tracked points, from two frames, or from optical ow. For the multi{frame case, we assume that the camera translates roughly along a line, with non-constant velocity and arbitrary rotations. As for the bas{relief ambiguity, the new e ect strengthens for smaller translations and eld of view (FOV). Thus we also assume that the camera moves a moderate distance compared to its distance to the 3D points, with (jTj =Z ) < 1=2, and that the FOV is moderate, with   90 . Let T^ ; ^z denote the heading and viewing directions. The essence of the bas{relief ambiguity is that motion sequences determine the T^ {^z plane more reliably than the position of T^ within this plane, at least when the x{y projection of T^ is not too small. [9][8][6] proved this for optical ow and small FOV, but it is also true more generally (e.g., [12]). But these analyses do not address the ambiguity of determining T^ within the T^ {^z plane. Such a study is the aim of this paper. Without loss of generality, assume from now on that the true T^ is in the x{z plane. Figure 1 shows a typical plot of the least{squares image error as a function of T^ =T^ . Note there is a local minimum for T^ =T^ < 0 in addition to the global minimum at T^ =T^ > 0. Since the error increases away from the x{z plane [9][8][6], the error as a function of the full T^ typically has a corresponding local minimum. Figures 2, 3, and 12 demonstrate this F

x

z

x

x

z

2

z

−4

5

Reprojection error for 6 images

x 10

4.5

4

3.5

3

2.5

2 −40

−30

−20

−10

0 Tx/Tz

10

20

30

40

Figure 1: The full reprojection error for six images. explicitly. We refer to this as the ipped local minimum since it occurs on the wrong side of the z^ axis. This same minimum also appeared in the data of [19] and [18]. We explain it in this paper and show that it occurs generically, causing a signi cant robustness problem for SFM algorithms. In addition, we note that there can be many local minima very near T  ^z. Though these occur with high errors and thus do not lead to real ambiguity, they can cause problems for naive optimization algorithms. However, a sophisticated algorithm should easily avoid these local minima.

1.1 Previous Work

[18][17] discuss a \rubbery" local minimum, which they claim corresponds to an invalid reconstruction of the structure at negative depths. We have veri ed experimentally that the ipped minimum typically produces a positive{depth structure interpretation (often attened into a rough frontal plane compared to the true structure). [18][17] also state that the \rubbery" structure interpretation violates the rigidity constraint over multiple frames; in contrast, the ipped motion is approximately consistent with a rigid structure. They explain the \rubbery" solution as an example of the familiar illusion for a transparent rotating cylinder, which can appear to depth{reverse and rotate backward when seen from a distance. The cylinder illusion corresponds to 1

1

The discussion of [17] is for optical ow and a sub{optimal error.

3

the alternative reconstruction of the Tomasi/Kanade algorithm [20], which has an exact two{fold ambiguity. It results partly from the exact bas{relief ambiguity for orthographic projection, but it is also an intrinsically multi{ frame phenomenon|with 2 images, the alternative solution would be just one among a continuous range of ambiguous solutions. On the other hand, the ipped local minimum disappears as the scene becomes increasingly shallow and orthographic projection a better approximation. It occurs for just two images, and the ipped structure is typically not depth{reversed, unlike that for the alternative Tomasi/Kanade solution. Thus the two local minima clearly have di erent explanations. We will show that the ipped minimum is produced by the opposition between the bas{relief ambiguity and the e ect of a nontrivial depth range in the scene.

1.2 Outline of the paper

 

Section 2.1 de nes the error function E T^ that we explicitly study. In Section 2.2, we motivate our claim that this error function gives a good approximation to the  ^ correctly weighted coplanarity error. We also ^argue that minimizing E T gives an approximately unbiased estimate of T and that for two frames E gives a good approximation everywhere to the full image{reprojection error. In Section 3, we restrict to zero rotations and explain why this causes no loss in generality. This section also begins our analysis of the properties of E . Section 3.1 derives the standard bas{relief ambiguity for E and presents our basic approximation leading to an analytical understanding of its properties. In addition, we complete our argument from Section 2.2 that E gives a good approximation to the coplanarity error. Section 3.2 describes the other half of the ipped ambiguity: the e ect of depth variation in the scene. We demonstrate how the bas{relief ambiguity combines with this e ect to produce the ipped local minimum. After giving an intuitive argument, we use a simple approximation to derive the ipped minimum more concretely. We also explain why our result applies not just to E but also to the full image{reprojection error. Lastly, we note that ^ can have many local minima (with large errors) when T^  ^z. E T Our initial discussion assumed that the true T^ was neither parallel nor perpendicular to the viewing direction ^z. Section 3.2 goes to on to analyze the cases when T^  ^z or T^  ^z  0. Experiments on synthetic sequences 4

 

con rm the correctness of our qualitative picture of E T^ , which according to our previous argument holds for the full reprojection error as well. Our initial discussion also assumed that the scene was far from planar. Section 3.3 describes how our analysis changes for nearly planar scenes and presents results for two real sequences where the scene is close to planar. Section 3.4 brie y discusses how the ipped minimum a ects the recovered structure. Finally, Section 4 summarizes our conclusions.

2 Preliminaries

2.1 Error De nition

We begin by de ning an error which we will argue captures the qualitative behavior of the true reprojection error for SFM. By the reprojection error, we mean the standard least{squares image error computed by projecting the estimated 3D points into the images for the given motions and summing over the squared discrepancies between their projected and observed positions. This error encapsulates all information apart from the constraint that the 3D points must be in front of the camera. (Unless the motion is very small or there are many distant 3D points, the positive{depth constraint typically has little e ect on the reconstruction beyond disambiguating its sign. Properly combining the positive{depth and reprojection constraints into one error function is messy, and we know of no work on this in the literature.) Choose the zeroth image as a base image. Let p  (x; y) denote the i-th point in the h-th image, where i = f1; 2; ::N g and h = f0; 2; :::N ? 1g, and also let p  (x; y)  (x ; y ) = p . With focal length 1 and neglecting the noise, the image displacements between the base and h-th images are h

h

i

i

p

0

T

i

0

i

d p ?p = h

h

i

i

i

T

0

i

i

h Zi?1 (T^z

p ? T ) + f (R ; p ); i

1 ? Z ?1 





i

f

h

T^

2

h

(1)

h i

z

where T  (T^ ; T^ ) , T^  T^ ; T^ ; T^ represent the translation direction,  gives the magnitude of the translation from the base image to image h, Z ? is the inverse depth in the base coordinate system, and f (R ; p ) is the rotational displacement. (1) is exact with no optical{ ow approximation. For moderate translations, the results of [16] demonstrate that one can accurately recover and compensate the rotations from the base image to all subsequent 2

x

y

T

T

x

y

z

h

1

h

i

5

h i

images. Thus, without loss of generality, we take the residual rotations as small. De ne the unit vector ^ ^ T p ?T ; u^  ^ T p ? T^ z

i

2

z

i

2

Ti

and let

  h



i

?d

dy



 u^

h

x

i

Ti

:

 to have elements  and the (N ? 1)  N De ne the length{N vectors   . Let matrix  to have rows  h

p

h i

f

p

h

r

(1) i

r

(2) i

r

(3) i

!

?xy ? (1 + y )



;

2

1+x



xy

?y



x

!

2

i

;

!

i

i

 denote the rst{order rotational ows, and de ne 3 length{N vectors V with elements   V  r ?r  u^ : p

(b)

(b)

(b)

i

y

x

(1;2;3)

;

Ti

i

 , and let P be a Let V be a N  3 matrix whose columns are the three V  . P is a function of T^ and the projection matrix annihilating the three V measured base{image points p and can be written explicitly as (b)

p

(b)

3

3

i

  P = 1 ? V V V ? V 1

T

3

Np

T

;

where 1 is an N  N identity matrix. Let C be a (N ? 1)  (N ? 1) matrix given by C =  + 1. Finally, the error we consider is Np

p

p

f

0

0

hh

hh

E

^

T  trace



6

 P  :

C ?1 

3

T

f

(2)

2.2 Discussion

 

Apart from the factor C ? , E T^ equals the unbiased, least{squares coplanarity/epipolar error for optical ow. As we show below, this is because the projection P eliminates the contribution from the rotational ows, and the remaining translational contribution in  measures the translational ow normal to the epipolar directions T^ p ? T^ . The error is unbiased since u^ is a unit vector and P is a projection. The matrix C ? compensates for the noise in the base image [14]; it does not appear for optical ow since there all noise is in the ow. [4][1] derived the optical{ ow error in a form similar to (2), but the advantage of (2) is that it is computable in o(N N ) steps and leads to a fast, ecient algorithm for recovering  ^  the heading from multiple frames [10]. For nite motion, E T gives a good approximation to the exact coplanarity/epipolar error, and minimizing it gives an almost unbiased estimate of the true T^ . The main reason for this is that the nite{motion corrections to the translational optical ow, from the denominator in (1), factor out of ^ , so that these corrections enter only through the rotation term and E T are scaled by the sizes of the residual rotations. Also, the optical{ ow approximation is clearly reasonable for sequences with moderate translations. A more explicit argument goes as follows. Write 1

3

z

2

i

1

3

p

 

 T^

h i

=



d

?d



h

 u^

h

f

+



ry

1 0  T^ Z ?  ; !Z ? ; ! A ; +o @ ^ T x ? T Ty

Tx

i

T Gi

1

z

1

?r



h

x

i

 u^

(3)

Ti

2

2

z

where d represents the translational displacements including noise, r is a rotational ow, i.e., a linear combination of the r , u^ is u^ evaluated at the ground truth for the x , and !, Z; ,  represent the typical sizes of the residual rotations, depths, translations, and image noise. From [16], o (! )  o (Z ? ). The rst term on the right is the exact epipolar error   assuming zero rotations, which we write  T^ . The corrections come from the base{image point noise in u^ and the higher order rotational displacements. (Note that our rough estimate of the u^ correction exaggerates the error when the denominator is small, i.e., when an image point is close to the T

i

i

1

E

T

T

7

(1;2;3)

h

i

T Gi

T

focus of expansion (FOE). As long as most points are far from the FOE, the corrections error (2) will be small. Also, we have neglected a in the summed ^ factor T x ? T in the numerator of the corrections in (3), where T is the ground truth.) Then G2

Gz

G

P = C ?

C ?1=2 

1=2

3

1 0  T^ Z ?  ; !; !Z ? ; ! A :  T P + o @ ^ T x ? T E

^ 

1

z

1

3

(4)

2

2

z

 

Since P eliminates the rotational contributions from the d , C ?  T^ P is essentially the coplanarity error for unknown rotations. Letting H be the square root of P ; with P = H H (H can be de ned precisely and computed as the product  of Householder matrices), the error (2) can be? written as trace   . One can show [10] that the covariance of   C H is approximately proportional to the identity matrix, 1=2

h

3

i

2

3

R

3

R

E

T

3

R

R

R

T

R

R

Cov ( ) =  R

2

0

 hh jj 0

1=2

T

R

1 0 ? ; ! ) ( ; Z A; +  o @T^ ^ T x ? T 2

1

z

z

2

so minimizing (2) gives an approximately unbiased estimate of T^ . (4) plus the further discussion in the next section show that (2) does give a good approximation to the exact coplanarity error. Also, for two frames, the coplanarity error accurately approximates the full reprojection error, since the rigidity constraint doesn't apply and the reprojection error just re ects the coplanarity constraint. The discrepancy between the two errors comes just from the o () noise terms due to the asymmetry in the treatment of the two images, as exempli ed by the rst correction term in (3). Thus, for two frames, (2) accurately approximates the reprojection error everywhere. Figures 2 and 3 show experimentally generated plots of (2) and the reprojection error for a two{frame sequence. The two are visually indistinguishable, and the average and standard deviation of their ratio are .97 and .05. We also argue in the next section that (2) approximates the reprojection error essentially throughout the T^ {^z plane even for multi{frame sequences. 2

See the discussion in the next section.

8

−3

x 10 1.8 1.6

Optical flow Error

1.4 1.2 1 0.8

5

0.6 0.4 0

0.2 0 5

4

3

2

1

0

−1

−2

−3

−4

Tx

−5 −5

Ty

Figure 2: Error (2) versus translation direction (T

x

; Ty ; 1)

for two images.

−3

x 10 1.8 1.6

Full Reprojection Error

1.4 1.2 1 0.8

5

0.6 0.4 0

0.2 0 5

4

3

2

1

0

−1

−2

−3

−4

−5

Tx

−5

Ty

Figure 3: Full reprojection error versus translation direction (T same two images as for Figure 2. 9

x

; Ty ; 1)

for

3 The Flipped Local Minimum For speci city, we focus on a sequence of two images, but our analysis applies also to multiple images. We assume zero rotation, which is legitimate for small residual rotations since they cancel in (2). Assuming zero rotation makes sense also because we are interested in the intrinsic diculty of SFM. Given a rotation between two images, one can always counter{rotate and set the rotation to zero. The rotated positions of the image points depends only on the points themselves, not on the structure, so the rotation doesn't signi cantly change the information in the images or alter the ambiguities of the reconstruction problem. The only result is an e ective change in the noise distribution, and, since SFM problems are typically strongly overconstrained, this produces an unimportant bias unless the rotation is large . Our analysis for unrotated images thus describes the ambiguities of the original reconstruction problem. We assume that the FOV is moderate, with   90. For the mo ment, we also assume T^  T^ . The image displacements are now given by the rst term on the right of (1). For just two images, the denominator in (1) can be absorbed into the inverse depths by de ning Z ? 0  Z ? = 1 ? Z ? T^ : For convenience, we omit the prime below, using Z ? to denote the denominator{corrected inverse depths. But the denominator factors out of (2) in any case, so its e ects can be neglected in analyzing this error even for multiple frames.   Neglecting noise, the error (2) satis es E T^ = 0 for the true translation   direction T^. Now consider E (T0) for a new direction T0  T 0 0 T . From its de nition, E (T0) equals the minimum value of P  (T0 ; d0 ) over all ows d0  d +r di ering from the true ow d by a rotational ow r: 3

F

x

z

1

i

1

i

1

i

1

z

i

T

z

x

2

i

i

i

X

i

i

 (T0; d0 ) T0 ) = min fr g

E(

i

i

i

 X  d + r ?d ? r  min  u ^ fr g i

y

i

(5)

2

y

i

x

x

0

i

T i

2

:

(The matrix C is irrelevant and we omit it.) This explains the nding in [3] that the importance of the bas{relief ambiguity depends mainly on the translation, not on the rotation. 3

10

3.1 The Bas{Relief E ect.

Suppose rst that all points are at approximately the same depth, with Z ?  1. Since we are focusing on two{frame SFM, we take T  T^ ; so jTj  1.   Adding a y{axis rotational ow r0  (T ? T 0 ) 1 + x xy to d gives 1

i

d0

i

?

x

?

i

!

?

Tz x Tx0 + (Tx Tx0 ) x2 Tz y + (Tx Tx0 ) xy



T

2

x

i

(6)

; i

which, for moderate FOV  < 90 with x; y  1; mimics a ow with translation T0 . The error contribution just comes from F





 (T0; d0)  (T ? T 0 ) xy ?x u^ = ?Nd? T 0 (T ? T 0 ) x y ; Ri

x

2

1

i

  jd j  T x ? T 0 0

Nd0 i

i

x

0

x

x

z

T i

x

i

(7)

T

x

i

Tz y

T 0i

 ; i

where the subscript R indicates the added ow, and for moderate FOV this is small. This suggests that r0 should roughly give the minimizing rotational

ow and that the error (2) is initially small, with a local minimum at T 0 = 0; and grows proportional to jT ? T 0 j . (The local minimum becomes a true global minimum if Z ? = 1 exactly, re ecting the standard two{fold ambiguity for planar scenes.) Intuitively, the e ect of a rotation can be approximated for moderate FOV as a constant overall shift in the image point positions. For 3D points at constant depth, one can reproduce the true translational ow for any translation in the T {^z plane simply by adding the appropriate rotational ow. This is the standard bas{relief e ect. Because of the quadratic corrections to the rotational ow, the ambiguity is not exact and the error increases slowly as the translations deviate from the ground truth. Since the quadratic corrections are less signi cant at small FOV, the bas{relief ambiguity holds more nearly and the error is smaller at smaller FOV. Figure 4 illustrates the measured behavior of E (T 0 =T  ) for a randomly  generated two{frame sequence with 55  Z  65, T = 2 0 2 , small residual rotations, 1 pixel noise, and an FOV of 30. As one would expect from the bas{relief e ect, the error grows slowly away from the global minimum. i

x

2

x

1

x

i

x

z

T

11

−3

2

−3

x 10

2

1.5

1.5

1

1

0.5

0.5

0 −40 −20 0 20 40 Error (solid line) and one−−vector approximation

0 −40

−3

2

−20 0 20 Approximate combined error

40

−20 0 20 Error from depth variation

40

−3

x 10

2

1.5

1.5

1

1

0.5

0.5

0 −40

x 10

−20 0 20 Error from quadratic terms

x 10

0 −40

40

Figure 4: Errors for T  (1; 0; 1) and shallow scene. (T 0 =T 0 ), with T 0 = 0. Upper right: E~ + E~ , lower left: line), lower right E~ . T

x

z

V

y

C

x{axis represents ~C (dotted E and E

V

Our analysis is clearly oversimpli ed. At least for large jT ? T 0 j, r0 cannot give the minimizing ow, since this would imply E (T0)  jd + r0 j  jr0j ! 1 as jT ? T 0 j ! 1 . But the growth of E (T0) with T 0 eventually must cut o : the coplanarity error E is certainly nite for T0P x^. More P 0 concretely, as r ! 1 one can clearly achieve  (T0; r)   (T0; r0) by taking jrj  jr0j. A more careful analysis goes as follows. The rotational contribution to  is generated by the three vectors x

x

i

2

2

x

x

x

2

2

R

i

h







i

 Nd? (?yT ? xyT 0 ) ; i  h   Nd? T x ? T 0 1 + y ;  i h    Nd? T y + x ? xT 0 :

(1)

0

i

(2)

0

i

(3)

0

i

1 1

1

z

x

(8)

i

2

z

x

2

z

i

2

x

i

Since  (T0 ; d0)  Nd? xy;  projects dominantly onto  . This projection is small as long as T 0 jfxygj  T jfygj, but it cuts o the growth of E (T0 ) for larger jT 0 j. We can derive a good, simple approximation of E (T0) by representing the rotational projection just by the projection onto  (but see the discussion 0

R

1

(1)

R

x

z

x

(1)

12

of approximately planar scenes below). This gives

T0)  (T ? T 0 ) AA +B 2?A (AB +BB) ; A  T 0 fxy=Nd g ; B T fy=Nd g :

E(

2

2

x

x

2

2

0

x



(9)

2

2

0

z



As jT 0 j ! 1; E (T0) ! T P y ? (P xy ) = P x y , i.e., the error in incorrectly presuming T0  x^ comes essentially from the y{projections of the true image displacements. We next expand E (T0) around jT 0 j  1. Note that approximating the rotation projection by  instead of  is 0   increasingly accurate as jT j ! 1, since asymptotically   . Expanding gives 2

x

2 2

2

2

2

z

x

(1)

(1;2;3)

(1)

x

0 +o@

1 (10) ( =2) A ; A  x y + 2T 0 x y T0 1 ! X ! 0 ! X T B  TT 0 y +2 0 xy + o @ T 0 ( =2) A : T T 1 0 ! ! X T T X T A  B  T0 x y +2 0 x y + o @ 0 ( =2) A : T T 2

X

2

2

i

i

Tz

x

i

X

3

2

i

i

Tz

!

2

6

F

x

i

2

2

3

2

z

z

i

i

x

x

i

2

4

z

F

i

x

i

2

2

z

i

x

z

i

x

i

2

2

i

i

5

z

F

x

i

Thus up to small corrections

0 ) (T P y ) f ( T ? T ~  0 P EE P P P T + y T = x y + 2 xy T T 0 = x y 2

x

C

2

2

x

where

x

2

2

z

2

2

z

2

2

P x y P y ? (P xy ) f  Px y Py 2

2

x

2

2

(11)

;

2 2

2

2

z

2

2

:

Since one can always rotate so that the image points are centered around P the viewing direction, ( xy ) can be assumed small due to cancellations so P P that f  1. x y = y is typically small, e.g., it is  =12 for uniformly distributed image points. At moderate jT 0 j the second term in the denominator the rst, since it is relatively enhanced by  ? in (11) typically dominates 0 . Thus the (T ? T ) factor causes E (T0) to grow with jT 0 j up to o  2

2

2

2

2

F

x

2

2

F

x

x

x

13

moderately large values. The maximum of E~ (T 0 ) occurs at P y + T P xy T ?T T P x y + P xy T : (12) Assuming jT j  jT j and that cancellations suppress the P xy term, E (T0 ) grows roughly until jT 0 =T j  jT =T j ? ;with  = 12 for uniformly distributed image points. Strictly, E~ approximates E only for large jT 0 j, but it e ectively gives   a good approximation everywhere, since for small jT 0 j E  o jT 0 j (from (7)) and E~  T P x y , so that they are both small. As indicated in (6), the rotation needed to compensate for a wrong guess for T0 in the T{^z plane is scaled by the size of the translation (relative to the depths). For moderate translations, the compensating rotation is also small and given approximately by its rst{order form. This is what we need to complete our argument in Section 2.2 that (2) gives a good approximation to the coplanarity error everywhere. As discussed above, multiplying by P in (2) implements a minimization over all rotational ows, but for the exact coplanarity error one must minimize the squared deviation from coplanarity over all nite rotations. If the minimizing rotation in (5) is small, as we have just argued, then the multiplication by P does give a good approximation to the nite{rotation minimization. This, together with our argument that we can assume that the true rotations are zero, implies that (2) approximates the coplanarity error for T0 within the T{^z plane. For T0 outside this plane, the bas{relief e ect no longer applies [9] [8][6], i.e., rotations and translations are no longer easily confusable. Thus no rotation whether in nitesimal or nite will reduce the coplanarity error signi cantly beyond that computed for the true zero rotation, and again (2) gives a good approximation to the exact coplanarity error. As stated previously in Section 2, for two frames the coplanarity error accurately approximates the reprojection error. Thus (2) gives a good approximation to the reprojection error everywhere for two frames, as Figures 2 and 3 show. C

2

z

z

2

x

x

x

2

x

2

2

z

2

z

z

x

z

2

x

F

C

x

2

x

C

2

2

x

2

x

3

3

3.2 Depth Variation E ects.

We now consider the e ect of a signi cant depth range in the scene, with Z ? deviating signi cantly from its mean value. Without loss of generality, take 1

i

14

d

d’

d d’

d d’

FOE

Incorrect FOE

Figure 5: This Figure illustrates why scene depth variation causes a large coplanarity error when T0  z^. The heavy arrows label the actual translational ows d. Attempting to add an approximately constant rotational ow (the dotted lines) to mimic the ow for an incorrect FOE near the viewing direction fails, since the modi ed ow vectors d0 vary in direction and have no common intersection point.

15

the mean of Z ? to be 1. When one adds a rotational ow as before in (6) to mimic a translational ow for the new direction T0, the depth variation causes the directions of the d0 to vary so that they do not agree on a single best value for T 0 . This is illustrated in Figure 5. Explicitly, 1

i

i

x

d0

i

=

?

?

Z ?1 (Tz x Tx0 ) + (Tx Tx0 ) (1 Z ?1 Tz y + (Tx Tx0 ) xy:

?

!

? Z? + x ) 1

2

(13)

; i

where the term (T ? T 0 ) (1 ? Z ? ) causes the variation. Neglecting the terms quadratic in the x, y, 1

x

x



 (T0 ; d0)  ?Nd? (T ? T 0 ) 1 ? Z ? 0

Vi

1

x

i

1

x



i

(14)

Tz yi

where the subscript V labels the contribution of the depth variations to  . Assuming Z ? is uncorrelated with the image point coordinates x , y (but see the discussion below for planar scenes),  is approximately given by (14) even after canceling out the rotational contribution. Thus i

1

i

i

i

V

EV

(T0)  (T

x

? T 0 )2 T 2 x

X





1 ? Z? y (T x ? T 0 ) + T 1

2

i

2

z

i

z

i

2

i

x

2

z

yi2

(15)

;

and, for moderate FOV, E (T0) is sharply peaked near jT 0 =T j   and smaller and approximately constant for jT 0 =T j   . This holds also for the total E including the quadratic terms, since: 1) as discussed previously, the quadratic e ects are suppressed for small{to{moderate jT 0 j; 2) the cross terms between the  and the quadratic contributions in E~ are typically small assuming the image coordinates are uncorrelated with the depth variations. Combining the previously discussed slow growth of E with jT 0 j, due to the quadratic terms in the rotational ow, with the maximum of E at small jT 0 j and sharp fall o at moderate jT 0 j, due to the depth variation, we infer that E must have a local minimum at moderate jT 0 j with T 0 T < 0. Note that we used the true inverse depths Z ? to mimic the true ow even for the incorrect translation T0. The mimicry succeeds, i.e., gives a small error, outside the central peak at jT 0 =T j  o ( ). Thus for a multi{ frame sequence, apart from the denominator corrections which are small for small translations, the true depths are roughly consistent with each image
 . This means that the rigidity constraint that the depths are xed over the sequence adds little new information beyond the coplanarity constraint encapsulated in (2), so (2) accurately approximates the reprojection error for T0 6 ^z in the T{^z plane. Thus the ipped local minimum for (2) also occurs for the reprojection error and captures an intrinsic, algorithm{independent ambiguity even for multi{frame SFM. Figure 1 shows this for a typical sequence. (For this Figure, we computed the reprojection errors via a standard multi{frame Levenberg{Marquardt (LM) steepest descent algorithm for xed translation direction, using a two{frame algorithm to initialize the rotations and structure.) To demonstrate the ipped minimum more concretely, approximate (15) by E D   (1 ? Z? ) X ~ (T0)  (T ? T 0 ) T ; (16) E y T 0 + T hx + y i where the angled brackets denote the average. One e ect of this approximation is to underestimate the central peak at T 0 = 0, since essentially it replaces the average of the denominator in (15) by this function evaluated at its average, and f (z)  z? is concave up. The other main e ect is to smooth out the error near T 0 = 0, since (15) typically has multiple peaks due to the multiple denominators for di erent image points. These peaks are (barely) visible within the central peak in Figure 1. Intuitively, they occur when the FOE approaches individual image points. They are important since they produce multiple local minima within the central peak, though at high errors. (As long as T0  ^z all the denominators in (15) remain small making the error large.) We brie y discuss these minima in our Conclusion. The approximations E~ and E~ have the same functional form shown in Figure 6. The crucial di erence comes from the constant terms in the denominator. The maximum of E~ occurs at D E T 0 = T y =T ; whereas that of E~ occurs at the much larger scale (12). Their maximum values are: E  D ~ : T X y (1 ?  ) T hx + y i + T ; E T hx + y i P  X  T y + 2T T P xy + T P x y ~ P E : T y f P P  : T y ? ( xy ) = x y z

x

F

1 2

2

V

x

2

2

z

x

2

2

x

z

2

2

x

1

x

C

V

V

2

x

2

x

z

C

V

2

2

2

2

2

2

z

x

2

2

z

C

2

z

2

2

2

z

2

x

z

2

2

z

17

z

2

2

2

2

x

2 2

2

2

2

5

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0 −30

−20

−10

0

10

20

30

Figure 6: The functional form of E and E~ (exaggerated for display). C

V

Typically the maximum of E~ isD much greater than that of E~ and also E P than its asymptotic value T y (1 ?  ) , but E~ is asymptotically larger than E~ . Correspondingly, E~ (T 0 ) falls o sharply to its asymptotic value as jT 0 j exceeds the value giving its maximum, while meanwhile E~ continues to grow until its maximum at the larger scale (12). Because of the large size and swift descent of E~ from its maximum, it initially dominates, but eventually the growth in E~ dominates as E~ quickly reaches its asymptote. V

2

C

2

2

C

z

V

V

x

C

x

V

This characteristic interplay between the two e ects explains the local minimum. So far we have assumed that jT j  jT j. For large jT j > jT j = ; the C

V

x

z

x

z

F

maximum in E~ no longer occurs at large T 0 . Rather than growing slowly, E~ increases quickly to its maximum at a small value of jT 0 j and then decreases sharply past its maximum to its asymptotic value. The E~ maximum still dominates the E~ maximum at small jT 0 j. But because E~ grows quickly, even if the descent of E~ from its maximum dominates initially so that a local minimum does occur between the positions of the two maxima, this minimum tends to occur at errors already comparable to the asymptotic value of E^ . On the other hand, the asymptotic values of both E~ and ~ are proportional to T and thus small. In e ect, the ipped minimum E is absorbed into the standard bas{relief ambiguity, which makes the error C

C

x

x

V

C

x

C

V

C

V

C

2

z

18

uniformly small over a large range so that noise can signi cantly alter the position of the global minimum. For moderately large jT j > jT j ; the maxima in both E~ and E~ occur at large values of jT 0 j and are well separated from each other. As long as jT j  jT j = , the maximum of E~ is larger than its asymptotic value and also larger than the maximum value of E~ . Correspondingly, E~ falls o sharply past the jT 0 j giving its maximum, and as before this typically produces a local minimum between the E~ and E~ maxima. But since the error is scaled by T , the local minimum again occurs at relatively large error values and may not cause signi cant ambiguity. For very large jT j  jT j = , the peak in ~ no longer dominates that of E~ ; and the maximum value of E~ is close E to its asymptotic value, so that it no longer declines steeply past the jT 0 j giving its maximum. In addition, though strictly E~ attains its maximum at the large value (3.2), it is clear from its form in (16) that it has already risen to close to its maximum value for jT 0 j  jT j hx + y i . For T 0 =T just outside the eld of view, the total error at rst climbs swiftly with jT 0 j due to the E~ contribution. When E~ attens as it reaches towards its maximum value, the growth in E~ takes over so that the total error typically continues to increase monotonically with jT 0 j. Moreover, if jT j is large enough so that jT P x y j < jP xy T j in (12), then the maximum of E~ occurs at relatively small jT 0 j rather than at a large value as before, which implies that the growth in E~ is large and dominates E~ . All these factors tend to eliminate the local minimum. Figures 4 and 7{11, display the measured behavior of E (T0) and its approximations by E~ and E~ for typical two{frame sequences. For these sequences, we chose the 3D points randomly in a cone of square cross{section. For Figures 7{11 we rst chose the depths uniformly in the range 20  Z  100 and then chose X and Y uniformly within ?L  X; Y  L, with L  Z (L =100) where the constant L depended on the speci ed FOV. The residual rotations are small (after compensation as in [14][16]), the noise was 1 pixel Gaussian (based on a 512  512 image assuming the speci ed FOV), and the translation had size 2. Figures 7, 8, 9, 10 show typical results for a nominal FOV of 60, with T =T given respectively by 1, 3, 1=3, 1=10. The FOVs actually spanned by the image points in the two images were respectively 45, 47, 53, and 51. Figure 11 shows a typical result for T =T = 1 with a nominal FOV of 30 and an actual FOV of 24 . z

x

V

C