Proceedings of tho 2004 IEEE
lntemrtlonal Conference on RobaUsr &Automation New Or!eans. LA &Ail 2004
Layered Ground Floor Detection for Vision-based Mobile Robot Navigation Young-geun Kim
Hakil Kim
Graduate School oflnfonnation Techonology and Telecommunication Inha Universig 253, Yonghyun-dong,Nom-gu, Incheon, 402-751, Korea younmcn6ilhuro.inha.ac.h
Dept. of Informalion and Communication Engineering Inha Universily 253, Yonghyun-dong,Nom-gu, Incheon, 402-751, Korea hakil6ilinha.ac.h
-
Abstract T h i paper proposes P method of detecting movable paths for visual navigation of mobile robots. The algorithm w to detect and segment tbe ground floor by computing plane normals from motion fields in image sequences. A plane normal in 3D space is an effective due to detect other static or moving objects on the ground floor and can be computed from point correspondencesand planar homographies. Such plane normals are combied together witb iterative refinement processes based on image Segmentation techniques and then allow us to detect and segment the ground floor accurately although mismatched point correspondences are detected in image sequences. The preliminary experiments on real data demonstrate the effectivenessof the proposed method
-
Index Terms ground floor detection, layered image representation, vision-basednavigation, mobile robor
I. INTRODUCTION
Fig. 1. An image of an mdoor scene (a) lhc image shows other slanc objects OD the ground floor. @) the obrlaclcr such JI walls, desks. a eh& a d a book. (e) the ground flwr.
Ground floor is a very interesting object for mobile robots in strucnued environments, which represents a movable path and corresponds to the rest image except for other static or dynamic objects on the ground floor. Methods to obtain the ground floor can be classified into W O categories: Stcrco approaches constmct a scene stmcnue into a 3D map, but it cannot find the ground floor directly because of different depth. In motion approaches. it is possible to find moving objects having separate motions, but it is difficult to find the ground floor directly because the floor has no dominant motion against the surroundings [ 131. Other approaches use multiple visual cues such as comer points, color and texture [9],or plane normals to detect the ground floor [I 31. To overcome the above difficulty, in this paper, wc exploit the fact that other static or moving objects are on the ground floor and they are perpendicular to the ground floor. And we assume that their images consist of small patches, and each corresponds to a plane in 3-space and has at least three image points so that it can define a plane. The reason is that at least three point correspondences in W O or more images can define a plane in 3-space [ 141. Such a fact and assumptions provide us the followng advantages: (i) Plane normals can be an cffcctive clue to separate the ground floor from the scene; (ii) To compute a plane normal for all image points in a patch increases its computational accuracy. But our approach contains the following problems: The first. known as correspondence problem, is determining which
In visual navigation, the purpose of local map building is to solve the problem of "where should I go?" Traditional approaches can be classified into two categories: First, stereo approaches produce a dense disparity map with two or more images and reconstruct the map into a 3D map. Depth information from the 3D map provides 3D visual cues such as obstacles or paths for navigation [I 1,151, but they need timeconsuming processes and will fail when particular types of features are not supported in images. Secondly, motion approaches compute motion field from consecutive images and detect other static or moving objects when their motions are dominant fromthe scene [7,10]. Motion field has also used in mimicking the centering behaviors of honeybees within walls [6] and time-to-contact estimation [l] for navigation. These techniques work in some types of scenes having dominant motions, but they can not provide structure information of the scene and will fail in cases of static scene and robots at rest. In this paper, we consider visual navigation in structured environments such as indoor scenes and focus on a method to detect obstacles and paths at the same time in images. Fig. 1 illustrates the idea with an image of an indoor scene that contains static objects such as walls, desks and a book on the floor. Given the image, we would l i e to detect the static objects as obstacles (Fig. Ib) and the floor as a path that robots can navigate (Fig. IC).
07803-8232-3/04/$17.00 02004 IEEE
13
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on September 3, 2009 at 15:14 from IEEE Xplore. Restrictions apply.
plane normal computation
region classification
t region splitting
region growing
two-layered images
region merging
Fig 2. Framework of the layered g o m d floor segmentation algorithm
point in the fust image corresponds to which point in the second image [3]. The second problem is determining the shape and size of a patch so that a patch corresponds to a flat region in 3-space. In order to solve the problems and satisfy the above.fact and assumptions, we suggest a following method: (i) In order to use all pixels for computation, we need accurate and dense image point correspondences, then adopt multi-scale coarseto-fine estimation, called-L-K optical flow estimation [2]; (ii) We split an image into sub-regions as small as possible so that each region is closest to a plane in 3-space by using image splitting techniques based on color homogeneity [4]; (iii) We compute an optimal estimate of the plane normal for a patch so that the error is minimized although mismatched image point correspondences are obtained; (iv) We design an iterative refinement process to detect and segment the ground floor in an image, based on region growing and merging; (v) We represent the segmented image with two layers in a proper form for visual navigation [8,12]. This paper is organized as follows: Section 2 introduces the proposed algorithm in more detail and section 3 shows the experimental results with two real scenes and compares them with the ground truth &ta. Section 4 shows the validity and efficiency of the algorithm. 11. LAYERED GROUND n O O R DETECTION
The performance of detecting the ground floor is mainly affected by the accuracy and density of image point correspondences. In stereo approaches accurate point correspondences are be obtained in the only particular types of features such as corners or edges because large disparity produces many mismatched points. In motion approaches, accurate and dense point correspondences can be obtained with a multi-scale coarse-to-fine algorithm based on a gradient approach [2,5]. Traditional algorithms produce motion vectors, known as optical flow, from image sequences, but each vector generally represents the direction not the position to its corresponding points and then the motion vector cannot define a plane in 3-space because the same motion field can be produced by two different planes having different motions [3]. Multi-scale technique allows the vector to represent the distance to the corresponding points and then to defme a plane in 3-space. B. Plane Normal Computafion Consider a plane in a scene and suppose the plane is projected into two views. The relation between the images projected from the scene plane is a projective relation, called planar homography induced by a plane [14]. Fig. 3 illustrates the concepts with two images. The three regions between two images correspond to three different planes in 3-space and each region has own planar homography induced by its corresponding scene plane.
This section describes the framework to segment the ground floor by computing plane normals from motion fields in image sequences. Referring to Fig. 2, the proposed a~gorithmconsists of 3 stages: 1) Image motion estimation and image segmenfation: Compute optical flow to obtain,accurate and dense image point correspondences in consecutive images by using multiscale coarse-to-fine estimation. Split images into small regions. 2) Iterative refinement process: Select a seed region and grow it to connected regions using a Queue structye. For each region, estimate the plane normal and merge it into the ground floor if it is close to the ground plane normal. 3) Two-layered represenfation: Decompose the image into the foreground layer and the background layer.
I ) Case I: Three Point Correspondences A scene plane in 3-space can be specified by three image point correspondences. Suppose that three image point correspondences xttf x ' are ~ given between two views T,T' as shown in Fig. 4. Then the homography induced by the plane of the three image points is computed by a planar projective transformation that is a linear transformation on homogeneous
Fig. 3. Each plane between scenes has own homography A induced by the plane.
A. Image Motion Estimafion
14
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on September 3, 2009 at 15:14 from IEEE Xplore. Restrictions apply.
matrices are P = [I 1 01 and P = [M I m].Thus it is necessary to normalize image points for plane normal computation. If image points pi, p: are represented in pixel coordinates, then their normalized image coordinates xj, x: are obtained by using camera calibration matrices:
3-vectors, represented by a non-singular 3x3 matrix A:
xi t)x; = Hxi (1) where xi = PX E A, x: = PX E A' and P, P are camera projection matrices. Suppose the camera projection matrices P, Pare calibrated with respect to the fmt camera and a plane in 3-space has coordinates = (U' I cf)' such that u'X + d = 0. The homography A induced by the plane is given by (see [ 141)
n
n'
n
H = M -myT
where K, K' are calibration matrices representing internal parameters of cameras, respectively.
(2)
where M and m are the sub-matrices of the camera projection matrix P = [M I m] and v is the plane normal of the plane E , parameterized by dd. Since the left and right terms are parallel in (l), their cross product should be zero. Thus a linear equation in the plane normal v is obtained from (I) and (2):
xi v =
(x; x m)T(x;xMx,) = bi (x; x m)T(xy x m)
Geomehie meaning of
Let e and e' be the epipoles of the fmt and second views as shown in Fig. 5. Then the epipole e' of the second view is defined by an image point that is projection of the fmt camera center ST= [OT I 1 1 ~ onto the second view:
(3)
e' = P T .
for three image point correspondences. Summing up the above equation for three image point correspondences yields a linear equation with respect to the plane normal:
e'=m. o
Av=b.
(6)
Since the normalized projection mabix P of the second camera is P = [M I m] and the center of the first camera is 0, then the above equation becomes
T
[i]v=l:]
XI x m
(4)
(7)
The above equation means tbat.the right most column vector m of the second camera projection matrix corresponds to the epipole e' of the second camera. Thus the term x: x m may be written as
Thus the plane normal v can be computed directly from tbree image point correspondences. Note that if the matrix A is not of full rank, a plane normal cannot be obtained because three image points xi are collinear and the accuracy of the plane normal v depends on the accuracy of the tbree image point correspondences. We considered image points xi and x; in normalized image coordinates because we supposed the camera projection
x; x m = x; x e' = 1;
(8) where the l i e 1; is the epipolar line of the image point x: in the second view. Therefore equation (3) can be rewritten as
x ~ v = ( xxMxi)'l~/l~1~ll=bi. :
Fig. 5. Oeomehic meaning of r'<xrn.
Fig. 4. A scme plane normal with 3 image points.
15
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on September 3, 2009 at 15:14 from IEEE Xplore. Restrictions apply.
(9)
The merging tule is that a region R, is merged with REif the angles between their plane normals are very close.
2)
Case 11: Three or More Noisy Point Correspondences The accuracy of the plane normal v depends on the accuracy of computing the corresponding image point in another image for a point in the reference image. Although the multi-scale coarse-to-fine estimation algorithm produces accurate and dense image point correspondences, lessmatched image points may exist in case of large motion vectors and mismatched image points may also exist in case of the apparent brightness changes are not observed in images. For these cases it is necessary to compute an optiminimal plane normal which minimizes computation error. Assume that a region R is the sub-image that is projection of a scene plane in 3kpace. An appropriate error function may he written with the sum of squared difference of the residue for a plane normal v in (9) for all pixels in that region R
e(v) = c ( x f v . - b , ) Z .
RE = R , u R ,
if
B(v,,v,)