Computer Vision and Image Understanding 101 (2006) 166–176 www.elsevier.com/locate/cviu
A column-space approach to projective reconstruction W.K. Tang *, Y.S. Hung Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong Received 5 May 2004; accepted 15 July 2005 Available online 14 October 2005
Abstract The problem of projective reconstruction for multiple views is considered using a factorization method. A common difficulty of existing formulations of the factorization problem is that they do not adequately constrain the depth parameters thus allowing the algorithm to converge to Ôview-deficientÕ solutions with entire views being suppressed. We propose to include a variance measure with an adaptive weighting parameter in the formulation of the factorization problem to overcome this difficulty. Algorithmic solutions with guaranteed convergence are provided to perform factorization under the condition that there may be missing data in the images. 2005 Elsevier Inc. All rights reserved. Keywords: Multiple views; Projective reconstruction; Structure from motion; Subspace method; Factorization
1. Introduction Projective reconstruction from uncalibrated images is a necessary step in most methods for 3D reconstruction (e.g., see [1–8]). The factorization approach to projective reconstruction has received considerable attention in recent years [9–18]. An advantage of factorization-based methods is that any number of images can be handled simultaneously without preferential treatment for any subgroup of views. Consider multiple cameras with projection matrices Pi viewing a set of 3D points Xj captured as image points wij on the image planes of the cameras. In the factorization approach to projective reconstruction, a basic problem is to estimate the depths kij for the image points wij so that the scaled measurement matrix [kijwij] can be factorized as a rank-4 product PX, where P is the joint projection matrix and X is a shape matrix containing the 3D points. Different approaches have been proposed for performing the factorization, such as the iterative eigen algorithm of Chen and Medioni [13], the subspace method of Heyden et al.
*
Corresponding author. Fax: +852 2559 8738. E-mail addresses:
[email protected] (W.K. Tang), yshung@eee. hku.hk (Y.S. Hung). 1077-3142/$ - see front matter 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2005.07.007
[14,19], and the generalized eigenvalue approach of Mahamud and Herbert [9]. Almost all the factorization methods are directly or indirectly related to a minimization of the error in the equation [kijwij] = PX. In this respect, if the depths are not constrained, there is a tendency for the optimization algorithm to attempt to reduce the error by making the depths (and correspondingly either P or X) as small as possible. For this reason, some kind of constraint must be imposed on the depths to prevent the optimization algorithm from converging to a trivial solution where all the depths in all the views are zero. Despite imposition of such constraints in various existing formulations of the factorization problem, these constraints only rule out the trivial solution where all the depths are zero, but do not stop the depths of some (but not all) images becoming zero. We have encountered occurrences of this problem in many reconstruction examples and are not aware of the issue being addressed in existing work. In this paper, we shall use a column space method which is an adaptation of [9] for the factorization of the scaled measurement matrix. The method of [9] suffers from the difficulties described above allowing the reconstruction to approach an undesirable solution while the value of the objective function appears to be converging. To overcome this difficulty, we propose to include an additional measure
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
in the formulation of the factorization problem to prevent the divergence of the depths. The issue of missing points has not been dealt with in [9]. Since missing points are inevitable in multi-view problems (e.g., due to occlusion), it is important that missing points be catered for in a practical solution. We will propose two alternative methods for handling missing points. Algorithmic solutions with guaranteed convergence will be provided for solving the factorization problem from multiple views containing missing data. The paper is organized as follows. In Section 2, we will formulate the factorization problem as a constrained least squares problem. In Section 3, we will introduce a variance measure to prevent the divergence of depths and hence ensure that the optimization converges to a desired solution. In Section 4, missing points are considered and two alternative algorithmic solutions are developed to perform depth estimation with or without estimating the missing points. Examples are given in Section 5 to illustrate the performance of the proposed methods in comparison with existing methods. Section 6 contains some concluding remarks. Notation: Given a vector v 2 Rp1 and a matrix P 2 Rpr , v can be decomposed into two components along and orthogonal to the subspace spanned by the columns of P. The orthogonal component is [I P(PTP)1PT]v. We shall denote the squared norm of this component as 1
F ðP ; vÞ ¼ vT ½I P ðP T P Þ P T v.
ð1Þ
We will extend this notation where necessary to the case of v 2 Rpm being a matrix. By a rank-4 SVD of a matrix M, we mean a rank-4 approximation obtained by performing a singular value decomposition of M, i.e., M . USVT, where S 2 R44 is a diagonal matrix containing the largest 4 singular values of M, and U and V contain the corresponding singular vectors.
Suppose a set of 3D points with homogeneous coordinates Xj = [xj yj zj 1]T (j = 1, . . . , n) are projected onto m cameras with projection matrices Pi (i = 1, . . . , m). Let wiT j = [uij vij 1] be the projection of the jth point on the ith view, i.e., kij wij ¼ P i X j ;
ð2Þ
where kij represents the projective depth of Xj measured along the optical axis of camera i. The projection of all the 3D points onto all the cameras can be represented as ½kij wij ¼ PX ;
to be estimated such that the scaled measurement matrix is of rank-4 and factorizable in the form of (3). In practice, due to image noise and other uncertainties, (3) will not be exactly satisfied. In the factorization approach, it is natural to consider estimating {kij} by minimizing the algebraic error associated with (3) subject to the rank-4 constraint on P and X 2
min min k½kij wij PX kF .
P 2R3m4 4n
ð3Þ
where ½kij wij 2 R3mn is the scaled measurement matrix, P ¼ ½P T1 ; P T2 ; . . . ; P Tm T 2 R3m4 , is the joint projection matrix, and X ¼ ½X 1 ; X 2 ; . . . ; X n 2 R4n is the projective shape. Since P and X are rank-4, [kijwij] is at most rank-4 too. In general, the depths {kij} are unknown. In the factorization approach, a consistent set of projective depths need
ð4Þ
kij
X 2R
The least-squares solution for X which minimizes (4) is given by 1
X ¼ ðP T P Þ P T ½kij wij ;
ð5Þ
T
where (P P) is assumed to be non-singular. Substituting (5) into (4) yields 2 min min ðI P ðP T P Þ1 P T Þ½kij wij .
P 2R3m4
kij
ð6Þ
F
Note that the jth column of the scaled measurement matrix can be written as wjkj where wj ¼ diagfw1j ; . . . ; wmj g 2 R3mm ; T
m1
kj ¼ ½k1j kmj 2 R
ð7Þ ð8Þ
.
The minimization problem (4) is not sufficiently constrained admitting a trivial solution with all optimization parameters set to zero. Hence, additional constraints need to be imposed to turn (4) into a sensible minimization problem. To avoid the trivial solution, we may impose the condition that each column of [kijwij] has unit length, i.e., kwj kj k ¼ 1; 8j.
ð9Þ
Under this condition, (6) becomes 2 n X 1 min min ðI P ðP T P Þ P T Þwj kj .
P 2R3m4 kwj kj k¼1 j¼1
2. Problem formulation
167
F
ð10Þ
Using the notation introduced in (1), (10) may be written n X min min kTj F ðP ; wj Þkj . ð11Þ
P 2R3m4 kwj kj k¼1 j¼1
Note that (10) is equivalent to the objective function considered in [9] where a generalized eigenvalue approach is used to solve the minimization problem in an iterative manner. The approach of [9] has the advantages of fast convergence and higher chances of finding a better solution. Since problem (10) is obtained by substituting the least-squares solution for X from (5) into (4), estimating kj to minimize (10) for a given P has the effect of simultaneously estimating X (as defined by (5)) along with kj. This is in contrast to factorization methods based on the problem (4) where kj is estimated alone under the assumption that P and X are fixed. As a result, the approach of [9] not only has better computational efficiency, but is less prone to be trapped at a local
168
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
minimum because of the extra freedom in the searching process of changing kj and X simultaneously. We have conducted comparison studies of factorization-based algorithms formulated using (4) and (10) and found empirically that the method of [9] based on the problem (10) often converges faster and sometimes to a better solution (with a smaller 2D reprojection error) than algorithms based on the problem (4). This motivates us to adopt the generalized eigenvalue approach of [9] for projective reconstruction. There are however two issues that remain unresolved in the problem formulation (10). First, missing image points (e.g., due to occlusion) have not been taken into account. Second, despite condition (9), the problem (10) is still insufficiently constrained. Specifically, consider a set of projective depths {kj} satisfying the constraint iwjkji = 1. The cost (11) can be regarded as the sum of m components each of which represents the error associated with one view. Suppose the ith view is the one with the largest error. One way of reducing the cost (11) is to suppress the ith view by setting Pi = 0 and
views will be large. We shall refer to this as ÔdivergenceÕ of the depths. We may avoid this by including a term in (10) to penalize Pmlarge variance of kij. Let kj ¼ m1 i¼1 kij be the mean depth of the jth 3D point in all the views, and r2j be the variance: m 1 X 2 ðkij kj Þ r2j ¼ m i¼1 !2 m m 1 X 1 X 2 ¼ k kij m i¼1 ij m i¼1
kij ¼ 0
where c > 0 is a weighting factor. Note that in (11), we have simplified the constraint on kj to ikji = 1 as there is no particular reason to include wj as a weighting factor in the constraint on the depths.
ðj ¼ 1; . . . ; nÞ
ð12Þ
and then rescaling kj so that iwjkji = 1. It can be shown that the cost (11) is generically reduced by the above operation. This shows that the problem (11) is ill-posed with solutions which tends to suppress views with larger errors in favour of retaining views with smaller errors. If a set of depths {kj} has one or more views suppressed as in (12), it will be referred to as view-deficient. In the extreme case, a solution to (11) achieving zero cost is given by k1j ¼
1 ; kw1j k
kij ¼ 0;
P 1 ¼ ½I 0;
X j ¼ k1j w1j ; for j ¼ 1; 2; . . . ; n
P i ¼ 0 for i ¼ 2; . . . ; m; j ¼ 1; 2; . . . ; n.
ð13Þ
Basically, this solution suppresses all image points in all the views except the first. In the first view, the depths of points are chosen to satisfy the constraint (9). Since there is effectively only one view, the 3D points can be readily taken to satisfy the projection equations exactly and therefore this solution has a zero cost for the minimization problem (10). Clearly, this is not a desired solution. Iterative algorithms for solving (11) (e.g., in [9]) will tend to converge to view-deficient solutions unless the cost function is trapped in some local minimum. It is therefore necessary to include further conditions in the depth estimation problem to prevent this from happening. 3. Measure to prevent divergence of depths By condition (9), any image point Xj must have a nonzero projective depth in at least one view. If the image point is suppressed in some of the views by the minimization procedure, then some kij fi 0. In this case, the variance of the depths kij (for i = 1, . . . , m) of the 3D point in different
¼kTj Rkj ;
ð14Þ
where R is a positive semi-definite matrix 1 1 T I m cm cm ; R¼ m m
ð15Þ
T
in which cm ¼ ½11 1 2 Rm1 . Incorporating (14) as a penalty term in (11) gives n X kTj ½F ðP ; wj Þ þ cRkj ; ð16Þ min min
P 2R3m4 kkj k¼1 j¼1
3.1. Choice of c To avoid the divergence of depths in a solution to (16), c must be chosen sufficiently large. However, a large c imposes a heavy weighting on the variance constraint and may result in unnecessary clustering of the projective depths. Hence, there is a need to find a compromise for the value of c. A possible strategy is to solve a sequence of problems of the form (16) initially with conservative values of c and then reduce c while ensuring that divergence of depths will not occur. Suppose (16) is solved with a certain c = c 0 for a set of depths fk0j g. We wish to determine a new value of c ( j Rkj . j¼1
Let Pn c¼
0T 0 j¼1 kj F ðP ; wj Þkj Pn 0T 0 b j¼1 kj Rkj
> 0.
ð19Þ
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
Then, for {kj} 2 D, we have n n X 0 X k0T F ðP ; w Þ þ cR k < kTj F ðP ; wj Þ þ cR kj . j j j j¼1
j¼1
This ensures that the optimization problem (16) with c set by (19) will not converge to any undesirable solution {kj} 2 D. We will use (19) for updating c whenever c < c 0 (by a sufficiently large margin, say n). However, if c > c 0 , there is no guarantee that (16) will not converge to an undesirable solution in D. In this case, c = c 0 will be taken to be the final value of c for solving (16). The choice of the set D is a reflection of the conservatism in the strategy for reducing c. Some possible choices of D are: 1. D = Set of all view-deficient depths {kj} having property (12) for some i. In this case, b is given by b ¼ n=m2 . 2. D = Set of all view-deficient depths generated from fk0j g by deleting one or more of the m views but otherwise maintaining the ratios of the depths in the remaining views. In this case, b can be computed as the smallest variance of the m view-deficient depths obtained by deleting one view of fk0j g at a time.
4. Missing points We will propose two different approaches for handling missing points. In the first method, the missing points are estimated as part of the algorithm and the missing data in the measurement matrix are then filled in using the estimates. In this case, the singular value decomposition will be used to estimate the joint projection matrix P. The idea is similar to the approaches based on expectation-maximization (EM), where missing data are filled in and the measurement matrix is then factorized so that its rank is as close as possible to rank-3 for orthogonal projection or rank-4 for perspective projection [18,20]. In the second method, the depth estimation problem (16) will be modified to exclude the missing points from the objective function of the minimization problem. Hence, there is no need to estimate the missing points as they do not enter into the problem formulation. Without estimating the missing points, the scaled measurement matrix is incomplete and the singular value decomposition method cannot be applied. In this case, a bilinear iterative method is used to estimate P. 4.1. Depth estimation incorporating estimation of missing points Let the available image points be indexed by ordered pairs of the set A ¼ fði; jÞj wij is observed as point j on view ig.
ð20Þ
169
If there are image points missing from some of the views, the measurement matrix will contain unknown entries that need to be filled in before the minimization problem (16) can be solved. Accordingly, (16) is extended to include the estimation of missing points as min
fwij jði;jÞ62Ag
min
min
n X
P 2R3m4 kkj k¼1 j¼1
kTj ½F ðP ; wj Þ þ cRkj .
ð21Þ
The form of (21) suggests the following procedure for the minimizing the objective function where kj, P, and fwij jði; jÞ 62 Ag are estimated successively in an iterative algorithm. In Algorithm 1, the variables computed in the kth iteration are labelled with superscript k. Algorithm 1. Depth estimation incorporating estimation of missing points 1. Set k = 0; If initial values k0ij , P0, and X0 are provided, proceed to Step 2, otherwise, fill in the missing entries of the measurement matrix using the centroids of the visible 2D points in each image; put k0ij ¼ 1; 8i; j; obtain an initial guess of P0 and X0 by the means of a rank-4 SVD of the scaled measurement matrix. Set c to some large value (e.g., 10). 2. Put k = k + 1; For j = 1, . . . , n, perform an SVD of k ½F ðP k1 ; wk1 j Þ þ cR and set kj to be the singular vector corresponding to the minimum singular value; update the depths in the scaled measurement matrix using kkj . k T 3. Perform a rank-4 SVD: ½kkij wk1 ij ’ USV ; let P = U and k T X = SV . 4. Estimate missing points as wkij ¼ k1k P ki X kj for ði; jÞ 62 A; ij P 5. Compute ek ¼ nj¼1 ðkkj ÞT ½F ðP k ; wkj Þ þ cRkkj ; If |ek ek 1| is small, set c 0 = c and k0j ¼ kkj , otherwise return to Step 2. 6. Compute c using (19); If c < c 0 n, return to Step 2; otherwise, stop. Proof of convergence of Algorithm 1. We will first show that for a fixed c each of the Steps 2, 3, and 4 in Algorithm 1 results in a reduction in the value of the objective function (21). In Step 2, assuming that Pk1, wk1 are fixed, j the optimal kj that minimizes kTj ½F ðP k1 ; wk1 Þ þ cRkj is j given by the singular vector kkj corresponding to the minimum singular value, and hence k ðkkj ÞT ½F ðP k1 ; wk1 j Þ þ cRkj T
k1 k1 6 ðkk1 ; wk1 ¼ ek1 . j Þ þ cRkj j Þ ½F ðP
ð22Þ
In Step 3, under the condition that ½kkij wk1 ij is fixed, (P, X) = (Pk, Xk) represents the best rank-4 factorization 2 k that minimizes k½kkij wk1 ij PX kF . Note that X from the rank-4 SVD is the same as the least-squares solution given by (5). By the equivalence of (4) and (6), P = Pk minimizes T k ðkkj Þ F ðP ; wk1 j; Þkj . It follows that
170
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176 T
T
k k k1 ðkkj Þ ½F ðP k ; wk1 ; wk1 j Þ þ cRkj 6 ðkj Þ ½F ðP j Þ
þ c Rkkj .
ð23Þ
In Step 4, replacing the missing points in ½kkij wk1 ij by the corresponding entries in PkXk by defining wkij ¼ k1k P ki X kj , ij
we have 2 2 k k k k P X ½kij wij P k X k 6 ½kkij wk1 . ij F
F
ð24Þ
It follows that ek ¼
T ðkkj Þ ½F ðP k ; wkj Þ
þ
k k1 T k1 1 k1 T ~ j ~kj . X kj ¼ ½ðP~ j Þ P~ j ðP~ j Þ w
cRkkj
T
k 6 ðkkj Þ ½F ðP k ; wk1 j Þ þ cRkj .
ð25Þ
4.2. Depth estimation without estimating missing points An alternative approach for handling missing points is to modify the minimization problem to exclude the missing points from the objective function. In this case, since wij are available for ði; jÞ 2 A only, the objective function 2 k½kij wij PX kF in (4) should be replaced by X 2 kkij wij P i X j k . ð26Þ ði;jÞ2A
The least-squares solution for Xj (j = 1, . . . , n) which minimizes (26) is given by T
T
ð27Þ
~ j and ~ where w kj are defined as in (7) and (8) with the missing points removed and the dimensions reduced correspondingly, and P~ j 2 R3mj 4 represents that part of the joint projection matrix P consisting only of cameras (say mj of them) for which the jth point is visible. Accordingly, (16) becomes min
min pffiffiffiffiffiffiffiffi
P 2R3m4 k~ kj k¼
n X
T ~ ~ j Þ þ cRj ~ kj ½F ðP~ j ; w kj ;
ð29Þ
k
Combining (22), (23), and (25) gives ek 6 ek1. For a fixed c, the algorithm will be looping in an inner-loop consisting of Steps 2, 3, 4, and 5. By the above argument, {ek} is monotonic decreasing in the inner loop. The algorithm quits the inner loop at Step 5 if the change in ek is small. In Step 6, only when c is reduced will the iteration return to Step 2 of the algorithm, in which case ek will also be reduced. This ensures that ek is monotonic decreasing throughout the algorithm and hence convergent. h
~ j~ kj ; X j ¼ ðP~ j P~ j Þ1 P~ j w
1. Set k = 0; If initial values k0ij and P0 are provided, proceed to Step 2, otherwise, estimate k0ij and P0 as in Step 1 of Algorithm 1. Set c to some large value (e.g., 10). 2. Put k = k + 1; For j = 1, . . . , n, perform an SVD of k1 ~k ~ j Þ þp ½F ðP~ ; w cRffiffiffiffiffiffiffiffiffiffiffi j and set kj to be the singular vector k (with k~kj k ¼ mj =m) corresponding to the minimum singular value; for ði; jÞ 2 A, update the depths in the k scaled measurement matrix using ~kj . 3. For j = 1, . . . , n, compute X kj using (27), i.e.,
ð28Þ
mj =m j¼1
where Rj 2 Rmj mj is defined as in (15) with m replaced by pffiffiffiffiffiffiffiffiffiffiffi mj. Note that ikji = 1 in (16) is replaced by k~ kj k ¼ mj =m so that points are given a weighting proportional to the number of views in which the point is visible. (28) can be solved by means of the following Algorithm. Algorithm 2. Depth estimation without estimating missing points
4. Solve for P from the linear least-squares problem: X k min kkij wij P ki X kj k2 . ð30Þ P k 2R3m4 ði;jÞ2A
Pn k T k k ~ j Þ þ cRj ~ 5. Compute ek ¼ j¼1 ð~kj Þ ½F ðP~ j ; w kj ; if 0 k 0 |ek ek1| is small, set c = c and kj ¼ kj , otherwise return to Step 2. 6. Compute c using (19); If c < c 0 n, return to Step 2; otherwise, stop. It can be shown that each of the three Steps 2, 3, and 4 in the above algorithm reduces the value of the objective function (28), and hence by an argument similar to that given in the proof of convergence of Algorithm 1, ek is monotonic decreasing and is guaranteed to converge. The least-squar~ i be es problem (30) can be solved as follows. Let X~ i and W that part of the shape matrix and the scaled measurement matrix, respectively, consisting only of points P visible by the ith camera. Then, Pi minimizing ði;jÞ2A kkij wij 2 P i X j k is given by y
~ i X~ Pi ¼ W i
ði ¼ 1; 2; . . . ; mÞ;
y X~ i
where is the pseudo inverse of X~ i . Since Algorithm 2 minimizes an objective function defined on available data only, there is no need to introduce additional variables for missing data as is necessary in Algorithm 1. Consequently, the minimization problems in Algorithm 2 have fewer optimization variables compared with Algorithm 1. This is significant if there is a large proportion of missing points. The bilinear method for estimating X and P in Algorithm 2 is found to converge faster than the method of Algorithm 1. For these reasons, Algorithm 2 is preferable to Algorithm 1. 5. Experimental results The proposed method is first evaluated using synthetic data and compared with the method of [9] on the issue of convergence. Then, we apply the proposed method to two real image sequences with missing data. The convergence of the proposed method on a real data is presented for showing the effect of the variance measure. In the examples, all visible points for each view are transformed by an isotropic transformation [10] so that
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
the centroid of all visible points is at the origin of that view and the p mean distance of all visible points from the cenffiffiffi troid is 2.
inside each of two spheres C1 and C2 of radius 0.125 m in 3D space. Five cameras are placed randomly inside each of two boxes B1 and B2 of dimensions 0.5 m · 1 m · 1 m. All the cameras have fixed intrinsic parameters and are pointed towards the centroid of the set of points. All points are visible on all the image planes within an image area of 1000 · 1000 pixels. The image points are contaminated by different Gaussian noise levels (with standard deviation rn ranging from 0 to 4 pixels in increments of 0.5 pixel).
5.1. Synthetic data To test the robustness of the proposed method, a scene having large depth variation is constructed as shown in Fig. 1A, where two sets of 30 random points are generated
A
171
B2
0.2m
C2
0.2m
0.125m
0.125m 0.5m
C1
0.5m
B1
1m
B
C
4.5
2D Reprojection Error [log (pixel)]
3.5
10
Our method (RMSE) Our method (MAX 2DRPE) Method of [9] (RMSE) Method of [9] (MAX 2DRPE)
2
10
RMS 2D reprojection error/pixel
4
3 2.5 2 1.5 1
10
1
0.5 0
0
0.5
1
1.5
2
2.5
3
3.5
4
100
200
300
noise level/pixel
0.7
RMS 3D error/mm
0.6 0.5 0.4 0.3 0.2
E
5
RMS 2D reprojection errors/pixel
D 0.8
4.5
0.5
1
1.5
2
2.5
noise level/pixel
600
700
4
3 2.5 2 1.5 VP ND (Visible Point compared with Noisy Data) VP GT (Visible Point compared with Ground Truth) MD GT (Missing Data compared with Ground Truth )
0.5 0
500
3.5
1 0.1 0
400
Iterations
3
3.5
4
0 0
0.5
1
1.5
2
2.5
3
3.5
4
noise level/pixel
Fig. 1. Results of the synthetic data example. (A) Synthetic scene configuration. (B) 2D reprojection error. (C) Convergence. (D) 3D error. (E) Missing data estimation.
172
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
For each noise level, Algorithm 2 is applied to 20 sets of randomly generated noisy data and the mean values of various error measures are computed. 5.1.1. Performance on 2D reprojection error Fig. 1B shows the 2D reprojection errors of the reconstructions obtained using the proposed method, plotted against different noise levels. The results of the proposed method are generated by simply running Algorithm 2 until convergence, and there is no need to monitor the 2D reprojection error. This shows that the variance measure is able to prohibit the tendency of the optimization algorithm to converge to a view-deficient solution. The RMS 2D reprojection errors (RMSE) is almost equal to the added noise levels. 5.1.2. Comparisons of convergence Fig. 1C compares the convergence of our method with the method of [9] by plotting the RMS 2D reprojection error (RMSE) versus iterations. The 2D points are contaminated by Gaussian noise (with rn = 5 pixels). The maximum 2D reprojection error (MAX 2DRPE), obtained from all 2D points among all views, is also shown. It should be noted that when applying the method of [9] to any particular data set, the 2D reprojection error typically decreases initially to a minimum point but then begins to increase as the algorithm runs into the kind of problem described in Section 2 and eventually converges to a viewdeficient solution for which the 2D reprojection error is large. From Fig. 1C, the RMSE and the maximum 2D reprojection error (MAX 2DRPE) of the method of [9] starts to diverge from the 280th iteration whereas the proposed method is convergent. 5.1.3. Performance on 3D error The evaluation on 3D error is performed by upgrading the reconstructed 3D points X^ j in the projective space to an Euclidean space by means of a collineation T 2 R44 which is obtained by minimizing the 3D error
A 3
x 10
e3D
where Xj is the ground truth 3D point and aj is a scaling factor for normalizing the 4th component of T X^ j to 1. Fig. 1D shows the RMS 3D errors plotted against different noise levels. The RMS 3D errors are small compared with the depth of the points. 5.1.4. Missing data estimation To evaluate the performance of our algorithm for missing data estimation, we use the same synthetic data sets as before but remove 20% of 2D points in the measurement matrix randomly to simulate missing data. The RMS 2D reprojection errors for both visible points and missing points are plotted against different noise levels in Fig. 1E, which shows that the errors of the estimated missing points (compared with ground truth) are comparable with the levels of the added noise. 5.1.5. Remarks on the effect of the variance measure and the choice of c It is natural to ask whether the inclusion of the variance measure in the objective function for minimization will have the side effect of forcing the projective depths to minimum variance. This depends on the relative magnitudes the two components (namely kT F k and ckTRk) in the objective function (21). The strategy of choosing c given by (19) is to keep the relative weighting of the regularization small so that it only has a mild effect of preventing the depths from diverging, rather than forcing them into minimum variance. To illustrate this, the two components of (21) in the iterative solution for the data set considered in Section 5.1.2 are plotted separately in Fig. 2A. From Fig. 2A, the minimization of the cost function can be divided into two stages. In the initial stage, the algebraic error is reduced with a sharp slope when the depths settle down from the initial guess to some natural depths. At this stage, since the algebraic error kT F k is large and the vari-
B
–3
γλTΣλ +λT Fλ γλTΣλ λT Fλ
2.5
2DRPE × 10
3
2.5
2
1.5
1.5
1
1
0.5
0.5
100
200
300 400 Iterations
500
600
x 10
–3
λT Fλ 2DRPE × 10 –4
–4
2
0
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1X ¼ min kX j aj T X^ j k2 ; T n j
700
0
100
200
300 400 Iterations
500
600
700
Fig. 2. Minimization of the algebraic error. (A) The variance measure and (B) without the variance measure (i.e., c = 0).
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
173
ance measure ckTRk is comparatively small, the convergence of depths to a rough solution is not hindered by the variance term. At the 26th iteration, the sharp drop in kT F k triggers off a reduction in c so that ckTRk exhibits a discontinuous jump and is consequently kept small compared with kT F k. This ensures that the regularization is just sufficient to avoid view-deficiency and will not have a dominant effect of forcing the depths into minimum variance. For the purpose of comparison, Algorithm 2 is run without the variance term (i.e., c = 0), and a plot of kT F k and the 2D reprojection error are shown in Fig. 2B. A comparison of Figs. 2A and B shows that with the variance measure, kT F k stabilizes at a level which is close to what it would be without the variance measure (but before the depths start to diverge at about the 450th iteration). This shows that the variance measure has only marginal effect on the minimization of the algebraic error up to the point before divergence of depths. Another indication that the variance measure has no adverse effect on clustering the depths is that the 2D reprojection error given in Fig. 1B is almost of the same level as the added noise. This is about the best one can expect (even after bundle adjustment) and therefore the depths could not have been unnecessarily constrained by the variance measure. To illustrate how the final value of c depends on the scene structure we consider illuminating the points in the scene of Fig. 1A in three different ways. In the first scene, only points inside the sphere C1 are illuminated. In the second scene, only points in C2 are illuminated, and in the third scene, all points are illuminated. The values of c at the end of the solution processes are 2.27, 3.99, and 0.22 for scenes 1, 2, and 3, respectively. For scene 3, the variance of the depths is large and hence c can afford to take on a smaller value while the regularization term ckTRk will still be effective for preventing divergence to a view-deficient solution; whereas for scenes 1 and 2, the variance of depths is relatively small, and hence c needs to be large to ensure that the solution does not diverge. This example shows that (19) is effective in adjusting c to cater for the overall scene structure, so that if the scene has large depth variations, c will become small.
of 154 corresponding points are matched manually across the five images. None of the points are visible in all five images and most of them are visible only in two views. Fig. 3B shows the missing data map of the measurement matrix. There are 378 missing points among the five views (i.e., 49.1% missing data). The algorithm (implemented using Matlab 6.5 running on a Pentium-4 2.4 GHz PC) takes 141 iterations to converge in 18 s. The maximum 2D reprojection error of the visible points among all the views is 3.45 pixels and the RMS 2D reprojection error is 0.985 pixels. To visualize the reconstructed space, we use the normalization method of Han and Kanade [6], to upgrade the projective space to the Euclidean space, assuming that the principal point is fixed and the skew ratio is zero across all views. A view of the scene of the reconstructed Euclidean space is shown in Fig. 3C and a close-up in Fig. 3D.
5.2. Real images with missing data
5.3. Minimal configuration for reconstruction and level of missing data
The proposed method is evaluated using two real image sequences with large percentages of missing points. The method of [9] is not applicable to these examples because of the missing data. 5.2.1. Castle model image sequence Images of a toy castle model are captured using a Canon D-30 digital camera with fixed intrinsic parameters. The image size is 2160 · 1440 (pixels). Five images (including three front and two rear views) are taken around the castle at an elevation angle of around 30. Fig. 3A shows the first image of the sequence. A total
5.2.2. Wadham College In this example, we make use of an image sequence of the ÔWadham CollegeÕ (courtesy of the Visual Geometry Group at University of Oxford). There are 5 images of size 1024 · 768. Fig. 4A shows the first image in the sequence and Fig. 4C shows the missing data map. There are a total of 1331 3D object points and the percentage of missing data is 54.6%. Our method takes 93 iterations to converge in 194 s. The RMS 2D reprojection error for visible points is 0.227 pixels and the maximum 2D reprojection error is 1.031 pixels. After upgrading the projective space to the Euclidean space, a view of the Euclidean space is shown in Fig. 4D. The top view given in Fig. 4E shows that the angle between the two front walls is approximately 90. 5.2.3. Convergence on ÔWadham CollegeÕ image sequence Fig. 4B shows the benefit of using c to achieve convergence in the proposed method. For this purpose, we run Algorithm 2 with c set to 0. This effectively removes the variance measure from (28). Without the variance measure, the depths show large fluctuations and the algorithm was unable to produce an acceptable solution before it eventually diverges.
The issue of minimal data necessary for projective reconstruction when there are missing points in some views has been considered in [21]. These minimal data configurations require only 3 or 4 cameras and between 7 and 9 points, some of which are missing from some views. We have applied Algorithm 2 to these minimal data configurations. Since solutions in the case of minimal data are not necessarily unique (some configurations admitting as many as 27 possible solutions), Algorithm 2 may converge to one of the many possible solutions that may or may not match the ground truth. However, if the
174
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
A
B
Missing Points Visible Points
C
D
Fig. 3. Castle model image sequence. (A) An image from the castle image sequence. (B) The distribution of 2D missing points. (C) A view of the Euclidean reconstruction. (D) A close-ip of the reconstructed castle model.
number of points is increased (by just one point) above the minimal data to eliminate non-unique solutions, Algorithm 2 is capable of generating a solution matching the ground truth. The experiments suggest that the variance measure remains a valid method for preventing the divergence of depths even in cases of minimal data. The difficulty with minimal data configuration lies in the nonuniqueness of solutions, rather than the variance measure. We have also run experiments where Algorithm 2 is applied to the synthetic scene of Fig. 1A with varying levels of missing data. Empirical results suggest that Algorithm 2 is capable of obtaining a projective reconstruction of both the structure and the camera projection matrices under conditions of high levels (up to 50%) of missing data. We have applied Algorithm 2 to real sequences with several hundred points and up to 90% missing data and the method still works. 6. Conclusion In this paper, iterative solutions have been developed to estimate projective depths for multiple images in the case where not all the points are visible in all the views. A common problem in existing methods is that while the
objective function converges to some optimal value, the reprojection error may get larger and larger while the reconstruction approaches a view-deficient solution. We have addressed this issue by incorporating in the minimization problem a term to penalize large variance of depths. This ensures that while the objective function converges, the projective reconstruction also converges to a sensible solution. Furthermore, our method works for images containing missing data. The proposed method evolved from many different approaches we have tried. For example, one method to ensure that problem (10) does not return a view-deficient solution is to impose the physical constraint that all points visible by a camera should be at some distance (greater than the focal length) from the camera centre. Hence, the depths of all visible points should be greater than a certain threshold, i.e., for all ði; jÞ 2 A, kij > e, where e > 0 is a constant. As a result, (10) becomes a constrained least-squares problem, which is solvable by iterative methods. However, it is found that the depths in some of the views would tend towards the minimum value of e (for the same reason that they would approach zero if unconstrained). These depths will subsequently be stopped from further reduction by the constraint and the solution to the minimization problem is
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
B
A
RMSE (varying γ) MAX 2DRPE (varying γ) RMSE (γ=0) MAX 2DRPE (γ=0)
5
2D Reprojection Error [log10(pixel)]
175
10
4
10
3
10
2
10
1
10
0
10
0
10
1
10
2
10 Iterations [log ]
3
10
10
C Missing Points Visible Points
D
E
Fig. 4. Wadham College image sequence. (A) First view of the Wadham College image sequence. (B) Convergence. (C) The distribution of 2D missing points. (D) A view of the Euclidean reconstuction. (E) Top view of ÔWadham College.Õ
then trapped at some local minimum on the boundary of the feasible set of depth values. The proposed method does not require a good initial solution and is less prone to be trapped in local minima compared with bundle adjustment techniques. We have applied bundle adjustment to the ÔWadham CollegeÕ image sequence starting from the same initial condition as in the Example of Section 5.2.2 and was unable to obtain any acceptable reconstruction. However, the solution obtained using the proposed method can be further refined using bundle adjustment [7,22,23] or other methods [24,25] which minimize the 2D reprojection error. Acknowledgments The work described in this paper was partially supported by a grant from the Research Grants Council of Hong
Kong Special Administrative Region, China (Project No. HKU7058/02E) and partially by the CRCG of the University of Hong Kong. We thank the reviewers for valuable comments which help improve the paper in many aspects. References [1] R. I. Hartley, Euclidean reconstruction from uncalibrated views, in: J. Mundy, A. Zisserman (Eds.), Applications of Invariance in Computer Vision, Lecture Notes in Computer Science, vol. 825, SpringerVerlag, 1993, pp. 237–256. [2] O.D. Faugeras, Stratification of 3-dimensional vision: projective, affine, and metric representations, J. Opt. Soc. Am.-A 12 (3) (1995) 465–484. [3] M. Pollefeys, L.V. Gool, M. Proesmans, Euclidean 3d reconstruction from image sequences with variable focal lengths, in: Eur. Conf. on Computer Vision, 1996, pp. 31–42. [4] P. Beardsley, A. Zisserman, D. Murray, Sequential updating of projective and affine structure from motion, Inter. J. Comput. Vis. 23 (3) (1997) 235–259.
176
W.K. Tang, Y.S. Hung / Computer Vision and Image Understanding 101 (2006) 166–176
[5] M. Pollefeys, L.V. Gool, Stratified self-calibration with the modulus constraint, IEEE Trans. Pattern Anal. Mach. Intell. 21 (8) (1999) 707–724. [6] M. Han, T. Kanade, Scene reconstruction from multiple uncalibrated views, Tech. Rep. CMU-RI-TR-00-09, Robotics Institute, Carnegie Mellon University, 2000. [7] B. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle adjustment—A modern synthesis, in: W. Triggs, A. Zisserman, R. Szeliski (Eds.), Vision Algorithms: Theory and Practice, Lecture Notes Computer Science, vol. 1883, Springer Verlag, 2000, pp. 298– 375. [8] G.Q. Chen, G.G. Medioni, Practical algorithms for stratified structure-from-motion, Image Vis. Comput. 20 (2002) 103–123. [9] S. Mahamud, M. Hebert, Iterative projective reconstruction from multiple views, in: Inter. Conf. on Computer Vision and Pattern Recognition, vol. 2, 2000, pp. 430–437. [10] P. Sturm, B. Triggs, A factorization based algorithm for multi-image projective structure and motion, in: Eur. Conf. on Computer Vision, Cambridge, England, 1996, pp. 709–720. [11] B. Triggs, Factorization methods for projective structure and motion, in: Inter. Conf. on Computer Vision and Pattern Recognition, IEEE, San Francisco, 1996, pp. 845–851. [12] T. Ueshiba, F. Tomita, A factorization method for projective and euclidean reconstruction from multiple perspective views via iterative depth estimation, in: Eur. Conf. on Computer Vision, 1998, pp. 296– 310. [13] G. Chen, G.G. Medioni, Efficient iterative solutions to m-view projective reconstruction problem, in: Inter. Conf. on Computer Vision and Pattern Recognition, vol. II, 1999, pp. 55–61. [14] A. Heyden, R. Berthilsson, G. Sparr, An iterative factorization method for projective structure and motion from image sequences, Image Vis. Comput. 1713 (1999) 981–991. [15] P.M.Q. Aguiar, J.M.F. Moura, Weighted factorization, in: IEEE Inter. Conf. on Image Processing, Vancouver BC, Canada, 2000. [16] S. Mahamud, M. Hebert, Y. Omori, J. Ponce, Provably-convergent iterative methods for projective structure from motion, in: Inter.
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
Conf. on Computer Vision and Pattern Recognition, Kauai, Hawaii, 2001, pp. 1018–1025. R.F.C. Guerreiro, P.M.Q. Aguiar, Factorization with missing data for 3D structure recovery, in: IEEE International Workshop on Multimedia Signal Processing MMSPÕ02, St. Thomas, USA, 2002. R.F.C. Guerreiro, P.M.Q. Aguiar, Estimation of rank deficient matrices from partial observations: two-step iterative algorithms, in: Energy Minimization Methods in Computer Vision and Pattern Recognition, Lectures Notes in Computer Science, Springer-Verlag, Lisboa, Portugal, 2003. A. Heyden, Projective structure and motion from image sequences using subspace methods, in: Scandinavian Conference on Image Analysis, Lappenraanta, 1997, pp. 963–8. M. Maruyama, S. Kurumi, Bidirectional optimization for reconstructing 3d shape from an image sequence with missing data, in: IEEE Inter. Conf. on Image Processing, vol. 3, 1999, pp. 120–124. L. Quan, A. Heyden, F. Kahl, Minimal projective reconstruction with missing data, in: Inter. Conf. on Computer Vision and Pattern Recognition, vol. 2, IEEE Computer Society, Fort Collins, Colorado, USA, 1999, pp. 210–216. D.D. Morris, K. Kanatani, T. Kanade, Uncertainty modeling for optimal structure from motion, in: Vision Algorithms Theory and Practice, Lecture Notes in Computer Science, Springer, 1999. H.Y. Shum, Q. Ke, Z. Zhang, Efficient bundle adjustment with virtual key frames: a hierarchical approach to multi-frame structure from motion, in: Inter. Conf. on Computer Vision and Pattern Recognition, 1999. W.K. Tang, Y.S. Hung, A factorization-based method for projective reconstruction with minimization of 2D reprojection errors, in: L.V. Gool (Ed.), Proceedings of 24th Annual Pattern Recognition Symposium DAGM 2002, Lecture Notes in Computer Science, vol. 2449, Springer, Zurich, Switzerland, 2002, pp. 387–394. Y.S. Hung, W.K. Tang, Projective reconstruction from multiple views with minimization of 2D reprojection error, Inter. J. Comput. Vis. in press, doi:10.1007/s11263-005-3675-0.