Robust contrast invariant stereo correspondence - CiteSeerX

Report 10 Downloads 35 Views
Robust contrast invariant stereo correspondence Abhijit S. Ogale and Yiannis Aloimonos Center for Automation Research, Dept. of Computer Science University of Maryland at College Park College Park, MD 20742, USA {ogale,yiannis}@cfar.umd.edu

Abstract— A stereo pair of cameras attached to a robot will inevitably yield images with different contrast. Even if we assume that the camera hardware is identical, due to slightly different points of view, the amount of light entering the two cameras is also different, causing dynamically adjusted internal parameters such as aperture, exposure and gain to be different. Due to the difficulty of obtaining and maintaining precise intensity or color calibration between the two cameras, contrast invariance becomes an extremely desirable property of stereo correspondence algorithms. The problem of achieving point correspondence between a stereo pair of images is often addressed by using the intensity or color differences as a local matching metric, which is sensitive to contrast changes. We present an algorithm for contrast invariant stereo matching which relies on multiple spatial frequency channels for local matching. A fast global framework uses the local matching to compute the correspondences and find the occlusions. We demonstrate that the use of multiple frequency channels allows the algorithm to yield good results even in the presence of significant amounts of noise. Index Terms— stereo, contrast invariance, gabor, diffusion, occlusions

I. I NTRODUCTION There has been significant progress in the understanding of the stereo correspondence problem. Scharstein and Szeliski [1] have provided an exhaustive comparison of the best dense stereo correspondence algorithms. Local measurements such as image intensity (or color) are generally utilized, and information is aggregated from multiple pixels using smoothness constraints. The simplest method of aggregation is to minimize the matching error within rectangular windows of fixed size [2]. Better approaches utilize multiple windows [3], [4], adaptive windows [5], [6] which change their size in order to minimize the error, shiftable windows [7], [8], or predicted windows [9], all of which give performance improvements at discontinuities. Global approaches rely on the extremization of a global energy function, which includes terms for local matching (‘data term’), additional smoothness terms, and in some cases, penalties for occlusions. Depending on the form of the energy function, the most efficient energy minimization scheme can be chosen. These include dynamic programming [10], simulated annealing [11], [12], relaxation labeling [13], non-linear diffusion [14], maximum flow [15] and graph cuts [16], [17]. Recently, mutual information has also been used [18] for contrast invariant matching. Some of these algorithms treat the images symmetrically

and explicitly deal with occlusions (eg. [17]). Egnal and Wildes [19] have provided comparisons of various approaches for finding occlusions. It is a well known fact that human observers can easily perceive depth even when the contrast of the image seen by one eye is quite different from the contrast of the image seen by the other eye [20]. Such an ability is crucial as well for robots which employ stereo cameras, since the images obtained by the two cameras will inevitably differ in contrast. Due to the different viewpoints of the two cameras, different amounts of light typically enter each camera, causing the internal parameters such as aperture, exposure time and gain to be adjusted differently. (For example, one camera may have a bright light in its field of view, while the other does not.) Even if these parameters are adjusted accurately, the differences in the sensors themselves can lead to differences in contrast. Obtaining and maintaining precise intensity or color calibration between the two cameras is an extremely difficult task, especially in a practical application. Many of the best algorithms present in the literature rely on the equality of intensity or color of corresponding points in the two images from a stereo system. Techniques which compute disparity by making use of phase differences explicitly [21], [22], [23], [24], [25] or implicitly via phase correlation [26] do not have this limitation, but often lack a global framework for deciding the correspondences and occlusions. In this paper, we shall develop an approach for contrast invariant matching which also relies on using multiple frequency channels and phase differences, but unlike these algorithms, we use these only as a local measure of matching. A separate global framework which is more in the spirit of the machine vision approaches mentioned earlier is used to find the final correspondences and the occlusions. One advantage of this approach is that since we are not using the phase to explicitly compute the disparity but only as a local matching metric, we do not need filters with large spatial extent in order to deal with large disparities. This allows us to find boundaries which are clearly better than phase based methods. We also show that the use of the multiple frequency channels not only makes this technique contrast invariant, but robust to the addition of significant amounts of noise in one of the channels. II. L OCAL M ATCHING Let us examine the stereo matching problem in one dimension. This is possible since we use rectified images

left response for a filter with frequency ω at position x and disparity d is denoted by ∆φω,d (x), then ei∆φω,d (x) =

Fig. 1. Real and imaginary parts of a complex Gabor filter in two dimensions with ω ~ in the horizontal direction.

as an input, and the only disparity present is the horizontal disparity. Thus, given two corresponding scanlines Il (x) and Ir (x) from the left and right images respectively, our task is to assign a horizontal disparity d to each pixel, where d ∈ {d1 , d2 , ..., dn }, where the range of possible disparities is specified as an input. We begin by applying complex valued Gabor filters of the form gx0 ,ω (x) to both the images, where the filter is centered at x0 in space, and at ω in the frequency domain. For example, the filter centered at x0 = 0 having a frequency ω is given by: 2

g0,ω (x) = e−x

/2σ 2 iωx

e

(1)

The real and imaginary parts of this complex valued filter form a quadrature pair (see Figure 1). We select σ to ensure a constant one octave bandwidth in the frequency domain. The space frequency domain can be sampled by a complete set of functions obtained by translation and scaling of this basic filter. In the two dimensionsal case (which we use in our implemention), rotation is also present, since the filters are oriented. The output of the filter with frequency ω (written as a convolution) on the left image is denoted by

Lω (x)Rω∗ (x + d) |Lω (x)Rω∗ (x + d)|

Notice that by using Rω∗ (x+d), we are explicitly shifting the right image by the candidate disparity d before taking the product, and hence if the left and right images locally match for this disparity d in this frequency channel, we would expect the phase difference to be zero. As we shall see below, the deviation of the phase difference from zero can be implicitly used as a measure of local matching. This is unlike previous approaches which utilize phase, since they use the phase differences either explicitly or through correlation to directly find the unknown disparity. Fleet [26] discusses how phase correlation can be thought of as a voting scheme, such that when we take the inverse Fourier transform, each channel casts a vote in a sinusoidal manner in the spatial domain. The inverse Fourier transform using the outputs of filters centered at a spatial position x0 is given by (ignoring the spatial extents of the filters for the moment): Z Fx0 ,d (x) = ei∆φω,d (x0 ) eiω(x−x0 ) dω (5) Ideally, the real parts of all the sinusoids would sum up to create a single peak at a certain position, and the imaginary parts would all cancel out. To find the degree of local matching for a pixel x0 for the disparity d, we want to measure the likelihood that this peak lies at the center position of the applied filters, i.e. at x = x0 . To achieve this, we can simply use the real part of the function Fx0 ,d (x0 ) computed at x = x0 as a measure of the likelihood that the peak lies at x = x0 . Thus, if we are applying N filters to the images, and the phase difference at location x from channel ω for disparity d is denoted by ∆φω,d (x) derived in (4), then as per the above discussion, we can define a function which sums the real parts of the inverse Fourier transform in the discrete case: H(x, d) =

1 N

X

(2)

and on the right image by Rω (x) = g0,ω (−x) ⊗ Ir (x)

(3)

Now assume that we are dealing with some disparity candidate d. If the phase difference between the right and

cos(∆φω,d (x))

(6)

N channels

Note that the factor 1/N is used to ensure that H(x, d) has the same range as the cosine function, i.e. [−1, 1]. Since phase relationships can become unreliable if the power in the selected frequency channels is close to zero, we define another function J(x, d) as follows: = W (x, d) · H(x, d) +(1 − W (x, d)) · (1) W (x, d) = exp(−αP (x, d)) X P (x, d) = |Lω (x)Rω∗ (x + d)| J(x, d)

Lω (x) = g0,ω (−x) ⊗ Il (x)

(4)

(7) (8) (9)

ω

Here, P (x, d) denotes the sum of the magnitudes (power) of all the filter response products, and W (x, d)

(A) Local matches M(x,d) 1

1

1

0

1

1

1

that we can extract a local measure M (x, d) which tells us how well pixels I1 (x) and I2 (x + d) match for the disparity d. This measure M (x, d) may be I1 (x)I2 (x + d) (correlation), |I1 (x) − I2 (x + d)| < t (thresholded absolute intensity differences), or our contrast invariant measure mentioned above in (10). We also define a conductivity C(x, d) at each pixel for every disparity, which allows us to propagate the influence of a pixel to its neighbors in the following manner:

1

(B) Put Conductivity C(x,d) = M(x,d) 1

1

1

0

1

1

1

1

(C) Left to Right (GLEFT) GLEFT (x,d) = GLEFT(x-1,d)*C(x,d) + M(x,d) 1

2

3

0

1

2

3

4

1) A pixel is influenced by pixels on its left (via GLef t (x, d)) and pixels on its right (via GRight (x, d)).

(D) Right to Left (GRIGHT) GRIGHT(x,d) = GRIGHT(x+1,d)*C(x,d) + M(x,d) 3

2

1

0

4

3

2

1

2) From the left to the right, we can compute GLef t (x, d) as follows:

(E) Total influence G(x,d) = GLEFT(x,d) + GRIGHT(x,d) – M(x,d) 3

3

3

0

4

4

4

GLef t (x, d)

4

Fig. 2. Sample computation of a measure of the influence of matching pixels on each other for some toy values of M (x, d) and C(x, d). Notice how a matching pixel influences another matching pixel by propagating its influence through intermediate matching pixels. In [27], we have shown that this diffusion process is a generalization of a previously developed connected components approach for computing the disparity.

= GLef t (x − 1, d)C(x, d) +M (x, d)

(11)

3) Similarly, from the right to the left, GRight (x, d)

= GRight (x + 1, d)C(x, d) (12) +M (x, d)

4) Hence, the total influence from the left and the right is is a weight which exponentially decays to zero as P (x, d) increases. Hence, as P (x, d) tends to zero, the weighting function reduces the importance of the phase difference function H(x, d) and forces J(x, d) closer to the value 1, which indicates good matching. Thus, if the phase responses are unreliable due to the nonexistence of local variations in the intensity, we take the default position that there exists a local match. Note that J(x, d) also lies in the range [−1, 1]. Since we are interested in the local probability that a pixel x in the left image matches the pixel x + d in the right image for disparity d, we would like a function with values in the range [0, 1]. Hence, we define the function

M (x, d)

=

J(x, d) + 1 2

(10)

where M (x, d) lies in the range [0, 1] and is a local measure of the probability that pixel x matches for disparity d. III. G LOBAL A LGORITHM A. Fast diffusion along scanlines Having obtained a contrast invariant measure of local matching, we shall now proceed to develop a fast global process which decides the final correspondences and finds the occlusions. Given two scanlines I1 (x) and I2 (x), assume that we wish to compute a goodness measure G(x, d) for each pixel x and each disparity d, such that for each x, we can choose the disparity d which maximizes G(x, d). First, we need a method of performing local matching, so

G(x, d)

= GLef t (x, d) + GRight (x, d) −M (x, d)

(13)

Note that in (13), we subtract M (x, d) since it was counted twice, once in GLef t and once in GRight . Note that like M (x, d), the conductivity C(x, d) is also a local metric. This diffusion process is fast since it involves one left-right pass to compute GLef t and one right-left pass to compute GRight , and is not iterative. To achieve contrast invariant matching, we use M (x, d) from (10), and set C(x, d) = M (x, d)

(14)

Figure 2 shows an example computation of GLef t (x, d), GRight (x, d) and G(x, d) for given values of M (x, d) and C(x, d). In [27], we have shown that this diffusion process is a generalization of a previously developed connected components approach for computing the disparity. This relationship can be easily seen in Figure 2 (by deliberately choosing the input M (x, d) to contain only zeros and ones in this example), since the final result G(x, d) is such that each pixel has been assigned a value which equals the size of the connected component containing it, e.g. for the group of three one-valued pixels on the left, the value of G(x, d) assigned is three. For the group of four one-valued pixels on the right, the assigned value of G(x, d) is four. In reality, the values of M (x, d) and C(x, d) will not be either 0 or 1, but lie in a range from [0, 1], but the relationship with the connected components is useful because it gives us an intuitive idea of the general process.

B. Matching algorithm The final matching algorithm is shown below. It takes as input the two image scanlines I1 (x) and I2 (x), and set of possible disparities S, and outputs the disparity map δ(x). The uniqueness constraint is also enforced to find the halfocclusions in a single-pass through all the disparities. 1) for each shift d ∈ S, do a) Find M (x, d) and C(x, d) using (10) and (14). b) Compute G(x, d) using (11), (12), (13) 2) For each pixel x, find δ(x) = argmax[G(x, d)] d

under the uniqueness constraint (to find occlusions). IV. E XPERIMENTS Figure 3 shows the results of the above algorithm on a few stereo pairs with different left and right image contrasts. We perform the Gabor convolutions in two dimensions using an efficient implementation by Nestares et al [28] at four different scales and four different orientations. Each row shows a stereo pair with a different type of contrast mismatch, such as a constant contrast mismatch, a smooth but variable contrast mismatch, or a contrast mismatch in only one part of the image. The computed disparity maps are shown along with occluded regions which are colored white. In Figure 4, we have added 25% noise in the high frequency region of the right image in addition to a change in contrast. The results indicate that the algorithm is relatively stable even if significant amounts of noise are added to one of the frequency channels. Existing quantitative comparisons of stereo algorithms and the associated datasets do not deal with the issue of contrast changes, hence such comparisons will have to form a part of future work. V. C ONCLUSION We have presented an algorithm for contrast invariant stereo matching, which relies on multiple spatial frequency channels for local matching, and a fast non-iterative leftright diffusion process for finding a global solution. Since we use phase differences for local matching only and not for explicitly computing the disparity, we do not need filters of large spatial extent in order to compute large disparities, which prevents the degradation of boundaries. Occlusions are found by enforcing the uniqueness constraint. The algorithm is able to handle significant changes in contrast between the two images, even if the changes are nonuniform over the image. We also showed that the algorithm performs robustly even if noise is added to one of the frequency channels. We believe that the development of such robust algorithms will be useful for many practical applications involving robots which rely on stereo vision for sensing depth. R EFERENCES [1] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” IJCV, vol. 47, no. 1, pp. 7 – 42, April 2002. [2] M. Okutomi and T. Kanade, “A multiple baseline stereo,” IEEE Trans. PAMI, vol. 15, no. 4, pp. 353–363, April 1993.

[3] D. Geiger, B. Ladendorf, and A. Yuille, “Occlusions and binocular stereo,” ECCV, pp. 425–433, 1992. [4] A. Fusiello, V. Roberto, and E. Trucco, “Efficient stereo with multiple windowing,” CVPR, pp. 858–863, June 1997. [5] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: theory and experiment,” IEEE Trans. PAMI, vol. 16, no. 9, pp. 920–932, 1994. [6] Y. Boykov, O. Veksler, and R. Zabih, “A variable window approach to early vision,” IEEE Trans. PAMI, vol. 20, no. 12, pp. 1283–1294, Dec 1998. [7] A. F. Bobick and S. S. Intille., “Large occlusion stereo,” IJCV, vol. 33, no. 3, pp. 181–200, Sept 1999. [8] H. Tao, H. Sawhney, and R. Kumar, “A global matching framework for stereo computation,” ICCV, vol. 1, pp. 532–539, July 2001. [9] J. Mulligan and K. Daniilidis, “Predicting disparity windows for real-time stereo,” Lecture Notes in Computer Science, vol. 1842, pp. 220–235, 2000. [10] Y. Ohta and T. Kanade, “Stereo by intra- and inter-scanline search using dynamic programming,” IEEE Trans. PAMI, vol. 7, no. 2, pp. 139–154, March 1985. [11] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, and the bayesian restoration of images,” IEEE Trans. PAMI, vol. 6, no. 6, pp. 721–741, Nov 1984. [12] S. T. Barnard, “Stochastic stereo matching over scale,” IJCV, vol. 3, no. 1, pp. 17–32, 1989. [13] R. Szeliski, “Bayesian modeling of uncertainty in low-level vision,” IJCV, vol. 5, no. 3, pp. 271–302, Dec 1990. [14] D. Scharstein and R. Szeliski, “Stereo matching with nonlinear diffusion,” IJCV, vol. 28, no. 2, pp. 155–174, 1998. [15] S. Roy and I. Cox, “A maximum-flow formulation of the n-camera stereo correspondence problem,” ICCV, pp. 492–499, 1998. [16] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. PAMI, vol. 23, no. 11, pp. 1222–1239, Nov 2001. [17] V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using graph cuts,” ICCV, pp. 508–515, July 2001. [18] J. Kim, V. Kolmogorov, and R. Zabih, “Visual correspondence using energy minimization and mutual information,” International Conference on Computer Vision, vol. 2, pp. 1033–1040, 2003. [19] G. Egnal and R. Wildes, “Detecting binocular half-occlusions: empirical comparisons of five approaches,” IEEE Trans. PAMI, vol. 24, no. 8, pp. 1127–1133, Aug 2002. [20] B. Julesz, Foundations of Cyclopean Perception. University of Chicago Press, Chicago., 1971. [21] T. Sanger, “Stereo disparity computation using gabor filters.” Biological Cybernetics, vol. 59, pp. 405–418, 1988. [22] M. Jenkin and A. Jepson, Computational process in Human Vision. (ed.) Z. Pylyshn, Ablex Press, NJ, 1988, ch. The measurement of binocular disparity. [23] D. Fleet, A. Jepson, and M. Jenkin, “Phase-based disparity measurement.” CVGIP: Image Understanding, vol. 53, pp. 198–210, 1991. [24] J. Weng, “Image matching using windowed fourier phase,” IJCV, vol. 11, pp. 211–236, 1994. [25] N. Qian, “Computing stereo disparity and motion with known binocular cell properties,” Neural Computation, vol. 6, pp. 390–404, 1994. [26] D. Fleet, “Disparity from local weighted phase-correlation,” IEEE International Conference on SMC, pp. 48–56, October 1994. [27] A. S. Ogale, “The compositional character of visual correspondence,” Ph.D. dissertation, University of Maryland, College Park, August 2004. [28] O. Nestares, R. Navarro, J. Portilla, and A. Tabernero, “Efficient spatial-domain implementation of a multiscale image representation based on gabor functions,” J. Electronic Imaging, vol. 7, pp. 166– 173, 1998.

Fig. 3. Row 1: Tsukuba stereo pair with a quadratic contrast variation across the left image. The disparity map is shown on the right. Row 2: Sawtooth stereo pair with different image contrasts. Row 3: Random dot pair with a Gaussian contrast variation across the left image. Row 4: Leopard stereo pair with different contrast in a square patch in the right image. In all cases, the occlusions are shown in white color.

(a)

(b)

(c)

(d)

Fig. 4. (a) Left image from the map sequence. (b) Right image with lower contrast and the addition of noise in the high frequency channel. The noise causes upto 25% variation in the intensity. (c) shows a portion of a scanline in the right image, where the solid line shows the intensity values before addition of the noise, and the dotted line shows the values after addition of the noise. (d) shows the results. Occlusions are shown in white.