floor detection based depth estimation from a ... - Semantic Scholar

Report 4 Downloads 68 Views
FLOOR DETECTION BASED DEPTH ESTIMATION FROM A SINGLE INDOOR SCENE Changhwan Chun1 , Dongjin Park1 , Wonjun Kim2 , and Changick Kim1 Korea Advanced Institute of Science and Technology1 Department of Electrical Engineering Yuseong-gu, Daejeon 305-732, Republic of Korea Email: {changhwan chon, dj park, changick}@kaist.ac.kr

Samsung Advanced Institute of Technology2 Advanced Media Lab., Future IT Research Center Yongin-si, Gyeonggi-do 446-712, Republic of Korea Email: [email protected]

ABSTRACT Estimating depth information from a single image has recently attracted great attention in various vision-based applications such as mobile robot navigation. Although there are numerous depth map generation methods, little effort has been done on the depth estimation from a single indoor scene. In this paper, we propose a novel method for estimating depth from a single indoor image via nonlinear diffusion and image segmentation techniques. One important advantage of our approach is that no learning scheme is required to estimate a depth map. Based on the proposed method, we obtain visually plausible depth estimation results even with the presence of occlusions or clutters in the single indoor image. From experimental results, we confirm that the proposed algorithm provides reliable depth information under various indoor environments. Index Terms— Depth estimation, Floor detection, Nonlinear diffusion, Image segmentation, Monocular vision 1. INTRODUCTION With the recent considerable interest in 3D image analysis, estimating depth information has become a rapidly evolving topic in computer vision research. Many researchers have studied to estimate 3D information to achieve various visionbased applications such as mobile robot navigation, surveillance systems, 3D television, and so on. This is because 3D information can reduce ambiguities in understanding scenes with various structures. Recently, several techniques have been proposed for estimating relative depth information [1]–[4]. Hoiem et al. [1] propose a outdoor scene reconstruction method for recovery of 3D geometry via a classification technique. In [2], their work has been modified by adding vertical subclasses (i.e., left, center, right, porous, and solid) and extended for indoor scenes. However, there exist a variety of structures or objects in generic indoor scenes. In addition, the scenes may contain very similar color and textures (Fig. 1(a)1 ). Thus, such 1 The

figures in this paper are best viewed in color.

978-1-4799-2341-0/13/$31.00 ©2013 IEEE

(a)

(b)

(c)

Fig. 1. Indoor scenes with various geometries. properties of indoor images can hinder the classification of indoor scenes into geometric classes (i.e., ceiling, vertical, and ground). As a more general solution to generic scenes, Nedovic et al. [3] propose a technique for inferring 3D scene geometry. They introduce various 3D scene classes (i.e., stages) for representing 3D geometry from a single image. Most recently, Jung and Kim [4] improve the Nedovic’s method in terms of accuracy and processing time. However, the methods which classify images to global depth profile models often fail in dealing with indoor scenes. In addition to above mentioned reasons, scene structure is hardly understood due to crowd (Fig. 1(b)). Furthermore, indoor objects (e.g., pillar, light, tile, door, window, etc.) may hinder the estimation of 3D scene geometry due to the reflection of light, illumination change, or occlusions (Fig. 1(b) and (c)). In this paper, we address the problem of depth estimation from a single indoor scene without using any learning scheme. Firstly, we assume that the scene geometry can be divided into three components (i.e., floor, wall, and ceiling). It is also observed that the floor regions play the most important role in inferring the scene geometry from the indoor image [5]. So, we detect the floor regions first. To do that, we exploit nonlinear diffusion [6] and image segmentation [7] techniques. Finally, we generate a depth map from the detected floor regions. From experimental results, we confirm that the proposed algorithm provides reliable depth information in the indoor scene. The remainder of the paper is organized as follows: The proposed method for estimating a depth map from a single indoor image is introduced in Section 2. In section 3, experimental results are presented, followed by concluding remarks in section 4.

3358

ICIP 2013

2. THE PROPOSED DEPTH ESTIMATION METHOD The proposed algorithm is composed of three steps. The first step is to detect the floor candidate regions by exploiting the nonlinear diffusion method [6]. After that, we detect floor regions using the image segmentation technique introduced in [7]. Finally, we estimate a depth map from the detected floor regions.

Fig. 2. An example of primary layouts representation

2.1. The Floor Regions Detection (b)

2.1.1. Detection of Floor Candidate Regions In this subsection, we introduce a method for detecting floor candidate regions. The basic idea is to use textural boundary for representing primary layouts, and it is motivated by Kim and Kim in [6]. Based on the total variation (TV) flow, we conduct the nonlinear diffusion, which is to reduce the smoothing in the presence of edges, using the diffusion equation defined as: [8]: uk+1 = uk + div(g(| 5 uk |) 5 uk ), u(k = 0) = I,

1 N

X

|uL (i, j) − µ(x, y)|2 ,

(c)

Fig. 3. The results of detected floor candidate regions: (a) The labeled image. (b) Original images. (c) Corresponding floor candidate regions.

(1)

where k denotes the number of iteration, g(·) is the diffusivity function, and I is the input gray scale image. In this paper, we use the TV flow, which removes unwanted detail while preserving important details (e.g., edges), as the diffusivity function. This is because the TV flow removes oscillations and leads to piecewise constant results [6]. Moreover, it is not required for determining the additional parameters. Such diffusion methods using the TV flow are very useful since non-informative components (e.g., small clutter elements) are suppressed while preserving relevant edges such as plane boundaries. This TV flow is simply given as g(| 5 uk |) = 1/(5uk + ε), where ε is the small positive constant [8]. To obtain diffused images, we solve the diffusion equation by using the AOS method [9], since it is not only stable but also fast in diffusion. Then, we use the higher order statistics (HOS) for generating the textural boundary (TB) map from the diffused image [6]. In our method, the secondorder moments are computed as follows: TBL (x, y) =

(a)

(2)

the diffusion space) while suppressing noise information. Additionally, we need to determine the optimal number L of iteration adaptively based on textural patterns of given images. Specifically, it is determined using the difference between consecutive TB maps: dk =

1 M

X

|TBk (x, y) − TBk−1 (x, y)|, k ≥ 1,

(4)

(x,y)∈I

where M is the total number of pixels in a given image I. When dk is less than τ , we determine L = k (i.e., the image is sufficiently diffused). In our work, we set the parameter τ to 0.5. The final TB map is shown in Fig. 2. Figure 2 shows an example of primary layouts representation. The TB map is first binarized, and then reversed for representing primary layouts using the connected components analysis method. Finally, a connected region is labeled as a floor candidate region as far as part of the connected region belongs to the lower half of the labeled image. Figure 3(c) shows the detected floor candidate regions in indoor images.

(i,j)∈B(x,y)

where B(x, y) is the set of neighbor pixels of the position (x, y), L is the optimal number of iteration, and µ(x, y) is the sample mean written as: µ(x, y) =

1 N

X

u(i, j)

(3)

(i,j)∈B(x,y)

For implementation we use a window of 3×3 pixels (i.e., N = 9). From (2), we can obtain TBL which highlights highfrequency components at the region boundary(i.e., edges on

2.1.2. Detection of Floor Regions From detected floor candidate regions in the Section 2.1.1, we determine floor regions using the superpixel segmentation method [7]. Firstly, we assume that the floor regions of indoor scenes can be divided into two categories: simple floor (Fig. 4(a)) and cluttered floor (Fig. 4(c)) that is occluded by objects (e.g., human, columns, etc.). In the latter case, it is difficult to detect accurate floor regions. Thus, we apply different methods to each case. To compute the clutteredness of the floor candidate regions, we use the spectral residual (SR)

3359

(a)

(b)

(c)

(d)

Fig. 4. The SR maps of input images: (a),(c) Original images ((a):simple, (c):Complex), (b),(d) Corresponding SR maps

(a)

(b)

(c)

(d)

Fig. 6. The process of depth generation from the detected floor regions.

Fig. 5. Segmented regions in the floor candidate regions

method [10], since the method globally emphasizes the complicated texture regions. Thus, the clutteredness is calculated as follows: P (x,y)∈F C S(x, y) , (5) Csr = FC where FC denotes the total pixels in the floor candidate regions, and S(x, y) is the SR value of the pixel (x, y). Based on the clutteredness Csr , we can classify the given image into the simple or the cluttered case. When Csr is larger than the predefined thresholding value, it denotes that floor candidate regions of a given image are cluttered, otherwise they are regarded simple. In our work, we set the thresholding value ϕ to 0.85. Figure 4(b) and (c) show the SR maps of the simple and the complex cases. After that, we perform image segmentation using the superpixel segmentation method [7] for efficiently determining the floor regions. This is because the accurate floor regions can be detected by incorporating segmented regions which have the similar color and position. The segmented maps are shown in Fig. 5. In our work, we apply the K-means clustering method for merging the segmented regions. Specifically, we employ features for this task as follows: the mean of color channels (i.e., red, green, blue, hue, saturation, and intensity), the variance of each RGB channel, and the center position of a segmented region. As shown in Fig. 5, the cluttered floor has more segmented regions than the simple floor. Therefore, it is desirable to reduce the number of clusters in the cluttered floor case. Specifically, we conduct K-means clustering using K (K = 5), and the biggest three merged regions are determined to the floor regions. In contrast, we perform K-means clustering using the half of the total number of segmented regions in the simple floor. Results of the detected floor regions in a given image are shown in Fig. 7, and the proposed method can detect the floor regions reliably.

2.2. Depth map estimation In this section, we present a method for generating a depth map from detected floor regions. Firstly, we assume that the farthest position from the camera in a given image is the highest pixel point (xh , yh ) in the detected floor regions. Based on this pixel position, depth values for floor regions can be gradually assigned according to vertical directions of the image as follows: DF (x, y) = 255 × (y − yh )/yh , for 0 ≤ x ≤ width (6) where (x, y) denotes a pixel point in the floor regions. In indoor images, the verticals (i.e., walls) are shown above floor regions. The depth values are copied from the depth value of the highest floor pixel and assigned along the corresponding column (see Fig. 6(b)). We call it depth f or vertical, DV (x, y). The farthest point is also an important cue in estimating depth. We also create a depth map to denote the relative distance from the point. To do that, the distance between the farthest point and the center point of the segmented region, which is addressed in subsection 2.1.2, is computed as follows: q di = (xSi − xh )2 − (yiS − yh )2 , (7) where (xSi , yiS ) denotes the center position of ith segmented region. Based on this distance, the depth map for relative distance, DR is expressed as follows: DR (x, y) = 255 × (di − dmin )/(dmax − dmin ),

(8)

where dmin and dmax are the minimum and maximum distance between the farthest position (xh , yh ) and the segmented regions, respectively. Finally, the final depth map is generated by the linear combination of three depth maps as follows:

3360

D(x, y) = α · DF (x, y) + β · DV (x, y) + γ · DR (x, y), (9)

Fig. 8. Results of the proposed floor detection and depth estimation methods from a single indoor scene (from top to bottom: original input images, floor regions from the proposed method, and depth from the proposed method)

Fig. 7. Comparisons of floor regions from a single indoor image (from top to bottom: original input images, floor regions from ground truth, floor regions from Hoiem et al. [2], and floor regions from the proposed method)

various resolutions were resized to 640x480 pixels. The algorithm is implemented in C++, and its running time is around 0.5 seconds on an Intel Core2Duo 3.0Ghz and 2GB RAM. To show the efficiency and the robustness of the proposed floor detection based method, our algorithm has been compared with Hoiem’s method [2] in Fig. 7. From Fig. 7, we can see that our method provides perceptually plausible results thanks to the proposed reliable floor detection method. In particular, when the colors of floor and ceiling are similar or the textures are complicated, Hoiem’s method [2] often fails to estimate the accurate floor regions as shown in Fig. 7. In contrast, our method can reliably detect the floor regions regardless of the color of floor or the complicated texture. Figure 8 shows the results of depth generation from the detected floor regions. From Fig. 8, we can see that the proposed method provides reliable and perceptually plausible depth maps regardless of structures of the floor. More experimental results and supplementary materials are shown in our website2 .

4. DISCUSSION where α, β, and γ are weights for the final depth map. In our work, we set the parameter α = 1, β = 0.5, and γ = 0.3. Then, D(x, y) is normalized for the gray-scale representation. Figure 6 shows the process of the proposed depth generation method from the detected floor regions. 3. EXPERIMENTAL RESULTS To evaluate the performance of the proposed depth estimation method, we have conducted experiments on various images collected from the web and Stanford corridor dataset [11]. In detail, the images are collected through the Google image search using keywords such as subway station, airport, corridor, library, laboratory, and so on. The original images with

This paper introduces a novel depth estimation method from a single indoor image. In our work, any learning scheme is not required. Therefore, our method can be applied for a wide range of indoor scenes. As shown in the experimental results, it should be noted that our floor detection based method tends to provide rough estimation on the ceiling area, compared to other areas. However, it should be also noted that the depth information from ceiling areas is least useful in various indoor 3D applications. Our future work should include the optimization of both code and algorithm for real-time use of the proposed system.

3361

2 http://cilabs.kaist.ac.kr/IndoorDepth.html

5. REFERENCES [1] D. Hoiem, A.A. Efros, and M. Hebert, “Geometric context from a single image,” in Proc. IEEE International Conference on Computer Vision, 2005, pp. 654–661. [2] D. Hoiem, A.A. Efros, and M. Hebert, “Recovering surface layout from an image,” International Journal Computer Vision, vol. 75, no. 1, pp. 151–172, Oct. 2007. [3] V. Nedovic, A.W.M. Smeulders, A. Redert, and J.-M. Geusebroek, “Stages as models of scene geometry,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1673–1687, Sep. 2010. [4] C. Jung and C. Kim, “Real-time estimation of 3d scene geometry from a single image,” Pattern Recognition, vol. 45, no. 9, pp. 3256–3269, Sep. 2012. [5] Gavin Smith and Jeremy Morley, “Full floor identification in images with minimal close range 3d information,” in Proc. IEEE International Conference on Image Processing, Sep. 2012, pp. 1029–1032. [6] W. Kim and C. Kim, “A texture-aware salient edge model for image retargeting,” IEEE Signal Processing Letters, vol. 18, no. 11, pp. 631–634, Nov. 2011. [7] P. Felzenszwalb and D. Huttenlocher, “Efficient graphbased image segmetation,” International Journal Computer Vision, vol. 59, no. 2, pp. 167–181, 2004. [8] M. Rousson, T. Brox, and R. Deriche, “Active unsupervised texture segmentation on a diffusion based feature space,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, Jun. 2003, vol. 2, pp. 699–704. [9] J.Weickert, B. M. terH. Romeny, and M. A., “Efficient and reliable schemes for nonlinear diffusion filtering,” IEEE Trans. Image Processing, vol. 7, no. 3, pp. 398– 40, 1998. [10] X. Hou and L. Zhang, “Saliency detection: a spectral residual approach,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, Jun. 2007, vol. 2, pp. 1–8. [11] E. Delage and H. Leea nd A.Y. Ng, “A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, Jun. 2006, vol. 2, pp. 2418–2428.

3362