Depth from Defocus via Discriminative Metric Learning Qiufeng Wu1,2 , Kuanquan Wang1 , Wangmeng Zuo1 , and Yanjun Chen1 1
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China 2 College of Science, Northeast Agricultural University, Harbin, 150030, China
[email protected] Abstract. In this paper, we propose a discriminative learning-based method for recovering the depth of a scene from multiple defocused images. The proposed method consists of a discriminative learning phase and a depth estimation phase. In the discriminative learning phase, we formalize depth from defocus (DFD) as a multi-class classification problem which can be solved by learning the discriminative metrics from the synthetic training set by minimizing a criterion function. To enhance the discriminative and generalization performance of the learned metrics, the criterion takes both within-class and between-class variations into account, and incorporates margin constraints. In the depth estimation phase, for each pixel, we compute the N discriminative functions and determine the depth level according to the minimum discriminant value. Experimental results on synthetic and real images show the effectiveness of our method in providing a reliable estimation of the depth of a scene. Keywords: Depth estimation, Discriminative learning, Discriminative metric, Sub-gradient descent.
1
Introduction
Depth from defocus (DFD) is to infer the depth of each point in the scene from multiple defocused images which are captured with different camera settings. Compared with other image-based depth estimation approaches, e.g., depth from stereo and structure from motion, DFD can avoid correspondence problems [1]. Since the introduction of DFD for depth estimation [2], many DFD approaches, e.g., Markov random field-based approach [3], spatial domain-based approach [4], partial differential equation-based approach [5, 6], and variational approach [7–9], have been developed. All these approaches are faced with the problem of choosing the appropriate Point Spread Function (PSF) (e.g., Gaussian function and Pillbox function) in the imaging process. However, bypassing this problem, Favaro et al. [10, 11] develop a learning-based approach, which repeatedly learns the linear operators from a training set of blurred images of a certain depth using singular value decomposition (SVD) for a number of depths. Nevertheless, there still exist some weaknesses for this learning-based approach. For example, learning the linear operator of a certain depth is only from the B.-L. Lu, L. Zhang, and J. Kwok (Eds.): ICONIP 2011, Part III, LNCS 7064, pp. 676–683, 2011. c Springer-Verlag Berlin Heidelberg 2011
Depth from Defocus via Discriminative Metric Learning
677
training set of blurred images of the same depth using SVD, but it does not consider the influence of the training sets of blurred images of other depths. This possibly leads a new image of a certain depth to belong to the null spaces of the linear operators of other depths. In this paper, we proposed a novel approach based on discriminative learning, which consists of a discriminative learning phase and a depth estimation phase. For simplicity, we assume that the surface of the scene is equifocal and that dimension of admissible depth level (class) set S = {s1 , s2 , · · · , sN } is finite. Based on this hypothesis, DFD is formalized as multi-class classification. In the discriminative learning phase, the results of [10] and [11] are extended to learn the discriminative functions and the synthetic training set are used to learn N discriminative metrics (the values of parameters) in N discriminative functions by minimizing a criterion function, which is solved by the sub-gradient descent method [12]. In the depth estimation phase, we compute the values of the N discriminative functions and select the class label (depth level) corresponding to the minimum discriminant value. Once the discriminative metrics are learned via labeled training set, they can be used to estimate the depth of the various scenes in depth estimation phase. Since the whole computation only involves simple matrix-vector multiplication operations and can be performed independently at each pixel, allowing for a high level of parallelism, the algorithm can be performed efficiently. The experimental results on synthetic and real images show that the depth map of the scene is significantly enhanced with greater accuracy and robustness over the state-ofthe-art method.
2
Related Work
Discriminative learning approaches have recently received much attention, which have been used in face identification [13], 3D scan data segmentation [14] and image annotation[15] etc.. In [11], learning approach is used in DFD. Favaro and Soatto[11] select the equifocal imaging model and the least-squares criterion to formulate DFD as an optimal inference. In the Hilbert space, the least-squares criterion is stated as: φ(s, r) = I − Hs r2
(1)
2 2 where s and r is the depth and radiance of the scene respectively, s Hs : L (R ) → s P R is the linear operator such that Hs r=h ˙ (·, y), r = hv (y, x)r(x)dx , the observed image I = [I1 ; I2 ; · · · ; IK ] ∈ RP is a column vector of dimension P = M N K by stacking K images of the same scene captured with K different parameter settings v = [v1 , v2 , · · · , vK ]T on top of each other, hsv = [hsv1 , hsv2 , · · · , hsvK ]T is the PSF corresponding to each parameter setting. Using the geometry of operators in Hilbert spaces, Favaro et al. [11] proved that the extremum of (1) is identical to the extremum of the following function:
ψ(s) = Hs⊥ I2 where
Hs⊥
is the orthogonal operator.
(2)
678
Q. Wu et al.
If the surface is constrained to an N dimensional set of admissible depth levels and the orthogonal operators Hs⊥ corresponding to each depth level are known, the corresponding discriminative function for each depth level is determined. As a result, for each point in the measured images, the class (depth level) corresponding to the smallest discriminant value is selected by computing (2) (the discriminative functions). Favaro et al. [11] learn the orthogonal operator Hs⊥ for each depth level from training set by SVD. For more detailed description of learning procedure please refer to [11]. A basic limitation of using SVD to obtain the orthogonal operators is that Hsˆ⊥ obtained by SVD at a given depth level sˆ can only guarantee that the value of Hsˆ⊥ Isˆ2 for the defocused images Isˆ at given depth level sˆ is small, but can not guarantee that the value of Hsˆ⊥ Isˆ2 is smaller than the value of Hs˜⊥ Isˆ2 at other depth level s˜. Therefore, it is easier to determine the depth level for the defocused images Isˆ to be s˜. To avoid this scenario, it is proposed to learn the discriminative metrics by minimizing the criterion function that takes both within-class and between-class variations into account.
3
Learning Discriminative Metric for DFD
In this section, we first present the core of model for discriminative metric learning, and then describe the sub-gradient descent algorithm for learning the discriminative metric, and finally provide the procedure for depth estimation. 3.1
Problem Formalization
As discussed in Section 1 and Section 2, DFD can be formalized as multiclass classification. Using the finite dimension of admissible depth level set S = {s1 , s2 , · · · , sN } as the number of category, we extend (2) to represent the discriminative function for each depth level. Therefore, using notation of Euclidean and Mahalanobis distance, the N discriminative functions can be written as: gsi (I) = Li I2 = I T Mi I, i = 1, 2, · · · , N
(3)
where I is an unclassified sample image that is a column vector of dimension P (as is shown in Section 2), Li is a matrix of size P ×P , the matrix Mi = LTi Li . As Li and Mi appear in discriminative function, Li and Mi are called discriminative metric and Mahalanobis discriminative metric at depth level si respectively. For all admissible depths, if L = (L1 , L2 , · · · , LN ) or M = (M1 , M2 , · · · , MN ) is known, the class of I corresponding to the smallest discriminant value is selected by computing (3). The subsequent step shows how to reconstruct the criterion function and learn the N discriminative metrics for DFD. According to (3), for arbitrary training sample Iij (which denotes the jth training sample at ith depth level si ), if Iij is classified into the depth level si correctly, the value of Li Iij 2 must be smaller than the values of Lk Iij 2 (∀k = i), that is, Li Iij 2 ≤ Lk Iij 2 (∀k = i).
Depth from Defocus via Discriminative Metric Learning
679
Moreover, to increase the robustness of classification, the large margin constraint is incorporated such as Lk Iij 2 ≥ Li Iij 2 + 1(∀k = i). Obviously, if (Li Iij 2 + 1) − Lk Iij 2 (∀k = i, ∀i, j) are smaller than zero, a robust and perfect classification accuracy would be obtained. Therefore, to obtain higher classification accuracy, the N discriminative metrics can be obtained by minimizing the following criterion function: (1 − yik )[1 + Li Iij 2 − Lk Iij 2 ]+ (4) i,j,k
where the term [z]+ = max(z, 0), a variable yik = 1 is introduced if and only if i = k; otherwise yik = 0. Additionally, by Favaro’s work [11], the training set at any depth level should be the null space of the corresponding discriminative metric. In other words, for any training sample Iij in the training set at the depth level si , the value of Li Iij 2 should be as small as possible. Therefore, we further introduce a term into (4) and obtain the following criterion function: ε(L) = (1 − μ)
Li Iij 2 + μ
i,j
(1 − yik )[1 + Li Iij 2 − Lk Iij 2 ]+
(5)
i,j,k
where μ ∈ [0, 1] is balance parameter trading off the effect between two terms. The criterion function in (5) is multi-modal and non-convex in the matrix elements of L. To avoid these difficulties, we reformulate the optimization of (5) as a semi-definite programming (SDP). We firstly convert (5) into the criterion function with respect to M: T T T Iij Mi Iij + μ (1 − yik )[1 + Iij Mi Iij − Iij Mk Iij ]+ (6) ε(M) = (1 − μ) i,j
i,j,k
Furthermore, we convert (6) into a SDP by introducing nonnegative slack variables {ξijk } for all triplets which do not satisfy the large margin constraint. The SDP is stated as: T Mi Iij + μ i,j,k (1 − yik )ξijk (7) min(1 − μ) i,j Iij ⎧ T T ⎨ Iij Mk Iij − Iij Mi Iij ≥ 1 − ξijk s.t. ξijk ≥ 0 . (8) ⎩ Mi 0 3.2
Sub-gradient Descent
The core of learning the discriminative metrics is to minimize the criterion in (6). Here we adopt the sub-gradient descent method [12], providing that it converges to the correct solution. In the sub-gradient descent method, M iterates along the gradient of ε(M) in (6) with respect to M to reduce the criterion function and then projects this result onto the cone of all positive semi-definite matrices
680
Q. Wu et al.
at each iteration, which converges to a matrix M while the gradient step-size is sufficiently small. The core of the sub-gradient descent method is gradient computation. Suppose T and define the set of triples Φt and Ψ t respectively. Both (i, j, k) ∈ Φt Cij = Iij Iij and (i, j, k) ∈ Ψ t result in triggering the second term of (5) and (6) by the indices (i, j, k). But notice the fact that (i, j, k) ∈ Φt implies that the large margin constraint 1 + Li Iij 2 ≤ Lk Iij 2 does not hold, while (i, j, k) ∈ Ψ t implies that the large margin constraint 1 + Lk Ikj 2 ≤ Li Ikj 2 does not hold. With those definitions, at tth iteration, we can write the gradient of ε(Mt ) with respect to Mit as: ∂ε = (1 − μ) j Cij + μ( i,j,k∈Φt Cij − i,j,k∈Ψ t Ckj ) (9) ∂M t i
To accelerate the sub-gradient descent method, results by SVD are used as a starting point for the SDP. For more detailed description of the sub-gradient descent method please refer to [12]. 3.3
Recovery of Depth
After L or M is obtained, (3) is used to estimate the depth information of the scene. Given a collection of K defocused images of the scene, for each pixel x, the patch centered in x for each defocused image is extracted, and then a column vector I(x) = [I1 (x); I2 (x); · · · ; IK (x)] is obtained by stacking them on top of each other. We compute the N discriminative functions according to (3). At each pixel, the depth level s corresponding to the minimum among the values of the discriminative functions is the actual depth level s∗ : s∗ (x) = arg min gsi (I(x)) si ∈S
4 4.1
(10)
Experimental Results Experiments with Synthetic Images
In the discriminative learning phase, the 51 synthetic scenes are obtained by placing a synthetic scene of 100 × 100 pixels obtained randomly equidistantly in the range of 520mm and 850mm in front of a camera with a 35mm lens and F-number 4. For each scene, two defocused images are captured by bringing the plane at 520mm and 850mm into focus respectively. 196 patches of 7 × 7 pixels are collected from each defocused images, which constitute the training set with 196 × 51 labeled samples. The discriminative metrics learned from this training set are used on two synthetic testing sets to obtain Fig. 1 and Fig. 2. Fig. 1 shows the depth estimation performance when we use the discriminative metrics learned from training test with ranks of 45, 60, 70, 80 and 90 by SVD and the proposed algorithm respectively on synthetic images. Both mean and
Depth from Defocus via Discriminative Metric Learning
681
Fig. 1. Performance test for SVD and the proposed algorithm. Top (a.1-e.1): Results using the orthogonal operators from training set with the truncated ranks of 45, 60, 70, 80 and 90 by SVD on the testing set. Bottom (a.2-e.2): Results using the discriminative metrics from training set by the proposed algorithm on testing set. Both mean and standard deviation of the estimated depth are plotted over the ideal curve.
standard deviation of the estimated depth (solid line) are plotted over the ideal characteristic curve (the dotted line). Notice that, for SVD, when the chosen rank does not correspond to the correct rank of the matrices, the performance degrades rapidly and the value of the mean error (RMS) varies from 1.7mm to 36.1mm. However, for the proposed algorithm, this case is not worried and the RMS varies only from 1.3mm to 7.0mm. Therefore, the rank without consideration is one of the most important merits. To estimate further performance of the proposed algorithm, we synthesize a set of two novel images of 51 × 2601 pixels that are segmented in horizontal stripes of 51×51 pixels (see Fig. 2). Every stripe has been generated by the same random radiance but with equifocal planes at decreasing depths as we move from top to the bottom of the image. Fig. 2 shows that depth map and mesh of the reconstructed depths are obtained by the proposed algorithm. 4.2
Experiment with Real Images
In the experiment with real data, the training set consists of 196 × 51 labeled samples, which are captured by the camera with a 12mm lens and F-number 2 . The far and near focused plane is identical to Section 4.1. The discriminative metrics learned from this training set are used on Fig. 3(a) to obtain Fig. 3(b). Fig. 3 shows the two defocused images and the depth map before and after median filtering. The discriminative metrics learned by SVD and our proposed algorithm in Section 4.1 are used on Fig. 4(a) to obtain Fig. 4(b) and Fig. 4(c) respectively. Fig. 4 shows that the estimated depth map by SVD is very similar to the estimates obtained by our proposed algorithm.
682
Q. Wu et al.
Fig. 2. Performance test for the proposed algorithm with synthetic data. Left: two synthetic defocused images are shown. The surface in the scene is a piecewise constant function (a stair) such that each horizontal stripe of 51 × 51 pixels corresponds to an equifocal plane. Depth levels decrease moving from the top to the bottom of the images. Middle: the depth map of the stair scene before and after a 7 × 7 pixels median filtering respectively. Right: mesh of the reconstructed depth of the stair scene.
Fig. 3. (a) Detail of two 240 × 320 defocused images. For more details on the scene and camera settings, please refer to [6]; (b) and (c) Depth map recovered by the proposed algorithm before and after post-processing respectively.
Fig. 4. (a) Detail of two 238 × 205 defocused images. For more details on the scene and camera settings, please refer to [11]; (b) and (c) Depth map recovered by SVD and the our proposed algorithm respectively.
5
Conclusion
We present a novel method for recovering the depth of a scene from multiple defocused images, which consists of a discriminative learning phase and a depth estimation phase. Experimental results on synthetic and real images demonstrate effectiveness of our method in providing a reliable estimate of the depth
Depth from Defocus via Discriminative Metric Learning
683
of a scene. The method is robust, and the discriminative metrics learned from synthetic defocused images can be used effectively on real images. The method only involves simple matrix-vector multiplication computations in the depth estimation phase and can be performed independently at each pixel, and thus the algorithm is high level of parallelism and real-time performance. Acknowledgments. This work was supported in part by the Natural Science Foundation of China (NSFC) under Contract No.s 60872099 and 60902099.
References 1. Schechner, Y.Y., Kiryati, N.: Depth from defocus vs. stereo: how different really are they? Int. J. Comput. Vision 39(1), 141–162 (2000) 2. Pentland, A.: A new sense for depth of field. IEEE Trans. Pattern Anal. Mach. Intell. 9(4), 523–531 (1987) 3. Rajagopalan, A.N., Chaudhuri, S.: An MRF model-based approach to simultaneous recovery of depth and restoration from defocused images. IEEE Trans. Pattern Anal. Mach. Intell. 21(7), 577–589 (1999) 4. Ziou, D., Deschenes, F.: Depth from defocus estimation in spatial domain. Computer Vision and Image Understanding 81, 143–165 (2001) 5. Namboodiri, V.P., Chaudhuri, S.: On defocus, diffusion and depth estimation. Pattern Recognition Letters 28, 311–319 (2007) 6. Favaro, P., Soatto, S., Burger, M., Osher, S.J.: Shape from defocus via diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 518–531 (2008) 7. Rajagopalan, A.N., Chaudhuri, S.: A variational approach to recovering depth from defocused images. IEEE Trans. Pattern Anal. Mach. Intell. 19(10), 1158–1164 (1997) 8. Jin, H., Favaro, P.: A Variational Approach to Shape from Defocus. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 18–30. Springer, Heidelberg (2002) 9. Favaro, P.: Recovering thin structures via nonlocal-means regularization with application to depth from defocus. In: CVPR, pp. 1133–1140 (2010) 10. Favaro, P., Soatto, S.: Learning Shape form Defocus. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 735–745. Springer, Heidelberg (2002) 11. Favaro, P., Soatto, S.: A geometric approach to shape from defocus. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 406–417 (2005) 12. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, 207–244 (2009) 13. Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? Metric Learning approaches for face identification. In: ICCV, pp. 498–505 (2009) 14. Anguelov, D., Taskar, B., Chatalbashev, V.: Discriminative learning of Markov Random Fields for segmentation of 3D scan data. In: CVPR, pp. 169–176 (2005) 15. Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: TagProp: Discriminative metric learning in nearest neighbour models for image auto-annotation. In: ICCV, pp. 309–316 (2009)