Vehicle Ego-Motion Estimation and Moving Object Detection using a Monocular Camera Koichiro Yamaguchi Takeo Kato Yoshiki Ninomiya Toyota Central R&D Labs., Inc.
Abstract This paper proposes a method for estimating the egomotion of the vehicle and for detecting moving objects on roads by using a vehicle mounted monocular camera. There are two problems in ego-motion estimation. Firstly, a typical road scene contains moving objects such as other vehicles. Secondly, roads display fewer feature points compared to the number associated with background structures. In our approach, ego-motion is estimated from the correspondences of feature points extracted from various regions other than those in which objects are moving. After estimating the ego-motion, the three dimensional structure of the scene is reconstructed and any moving objects are detected. In our experiments, it has been shown that the proposed method is able to detect moving objects such as vehicles and pedestrians.
1. Introduction Recently, the development of vision-based driving assistance systems has become of great interest for vehicle safety. There are two approaches in vision-based systems: monocular and stereo systems. We focus on the use of a monocular system, since such systems have the advantage of reducing costs relative to stereo systems. For driving assistance systems, it is important to detect moving objects, such as other vehicles and pedestrians. An approach, based on the pattern recognition techniques, has been proposed for detecting pedestrians from a monocular image [6]. However, this approach is not robust to changes in the appearance of pedestrians. In order to detect various moving objects on a road, estimation of the ego-motion of the vehicle is required. Although vehicle sensors such as speedometers provide information about the ego-motion, such sensors do not satisfy the requirements for the level of accuracy that is needed to detect certain objects. Vehicle ego-motion, i.e. the motion of the vehicle mounted camera, can be estimated by applying Structure from Motion (SFM) algorithms [4]. However, it is difficult
to apply SFM algorithms in a road scene because of the following problems. Firstly, traffic scenes contain cluttered backgrounds, including moving objects such as other vehicles. Moving objects cause false estimations, since they violate the rigid world assumption. Secondly, roads have few feature points, whilst background structures (such as other vehicles, buildings, etc.) display many feature points. This biased distribution of feature points reduces the accuracy of ego-motion estimation because of severe situations that the direction of motion of the vehicle mounted camera is close to its optical axis. To overcome such problems, an approach that uses the road plane for ego-motion estimation has been proposed [7]. In this method, the roadway is assumed to be a planar structure and ego-motion estimation is considered by employing a parametric estimation model. However, the number of motion parameters needs to be reduced, since estimation results are not reliable in cases where there are a large number of unknown parameters, and it is difficult to detect moving objects reliably from a reduced number of motion parameters. In this paper, we propose a novel method for estimating vehicle ego-motion in a road scene and for detecting moving objects. Our proposed method can accurately estimate the ego-motion of the vehicle by selecting feature points using the detection results for moving objects in the previous frame. Moving objects are detected by tracking feature points and by detecting points on moving objects. This paper is organized as follows. In Section 2, we introduce the outline of our proposed method. We then explain the method we use for ego-motion estimation in Section 3. Section 4 describes our methods for detecting regions, containing moving objects, and the road region. We show experimental results in Section 5. Finally, we summarize the present work in Section 6.
2. Outline of Proposed Method The process flow of our proposed method is shown in Figure 1. In this method, it is assumed that the camera is calibrated, i.e. the internal parameters of the camera are known. In each frame, our proposed method detects the
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
from the previous frame. Then, the three dimensional structure of the scene is reconstructed. Finally, the moving objects and the road region are detected. These detection results are utilized recursively for ego-motion estimation and road region detection in the next frame.
3. Vehicle Ego-Motion Estimation This section describes our method for estimating the egomotion of the vehicle. First, feature points for ego-motion estimation are selected from the set of feature points. The ego-motion is then estimated from the correspondences of the selected points.
(a) Flow of proposed method.
(b) Process flow for each frame. Figure 1. Overview of proposed method.
moving objects and the road region in the current image. As shown in Figure 1(a), two consecutive images are used at any one time, i.e. the image taken at time ½ and that taken at time are used at time . Detection results for moving objects and for the road region at time ½ are also used for estimating the ego-motion and for detecting the road region at time . In the initial frame, it is assumed that there is no moving obstacle in the previous time and that the road region in the previous time is decided according to the height of the camera that is measured when the vehicle is stationery. Figure 1(b) shows the process flow for each frame. First, feature points are extracted with the Harris corner detector [2], and the correspondences of the feature points between the two input images are detected by the Lucas-Kanade method [5]. Next, the ego-motion of the vehicle is estimated from the correspondences of the feature points. For accurate ego-motion estimation, feature points are dispersedly selected from various regions, except those containing moving objects, by utilizing the detection results
In a typical road scene, there are two problems that affect the estimation of ego-motion from the correspondences of the feature points. Firstly, there are usually moving objects, such as other vehicles. Feature points on such moving objects cause false estimation of the ego-motion. Secondly, roads have few feature points, while background structures (such as other vehicles, buildings, etc.) display a lot of feature points. In particular, in cases where the direction of motion of the camera is close to its optical axis, a point correspondence-based approach would be rendered impractical due to this biased distribution of feature points. To overcome these problems, we propose a new method for the selection of feature points. We first utilize the moving object detection results from the previous frame to remove feature points on moving objects. Second, for a wide distribution of feature points, each image is divided into three regions; a region which may contain the road, one which may contain low-height objects, and one which may contain high-height objects, as shown in Figure 2. The region that may contain the road is defined at the bottom of an image according to road region detection results from the previous frame. The low-object and high-object regions are then constructed by dividing the remaining region equally into two separate regions. Feature points are selected from each region, and the number of feature points to be selected from each region is set beforehand. Figure 3 shows a set of feature points extracted from an image. As shown in Figure 3(a), in the case where the feature points are extracted from the whole image, background structures contribute many feature points, while the road region has a smaller number of feature points. Moreover, some of the extracted feature points are on the vehicle, which is actually a moving object. On the other hand, feature points are distributed more uniformly throughout the image and some feature points on the vehicle are removed in the image as shown in Figure 3(b). Although feature points on a moving object cannot be removed in cases where the moving objects
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
(a) Candidate points on road. Figure 2. Image divided into three regions.
(a) All feature points
(b) Side view.
(b) Selected feature points
Figure 3. Feature point selection. are not correctly detected in the previous frame, the contribution of feature points on a moving object is suppressed by selecting points from three separate regions. Therefore, the ego-motion of the vehicle can be estimated accurately and robustly in a road scene by using this selection method.
The essential matrix can now be estimated from the correspondences of the selected feature points using the 8 point algorithm [3] and RANSAC [1]. The motion parameters are calculated from the estimated essential matrix. The motion parameters consist of 3 rotational and 3 translational parameters. The translational parameters are estimated up to scale.
4. Moving Object Detection
After estimating the ego-motion of the vehicle, the positions of feature points in three-dimensional space are calculated by triangulation. Outlier points that are away from their epipolar lines or have negative distance are detected. The set of outlier points consists of feature points on moving objects and false correspondence points. To extract only those points that are on moving objects, the feature points are continuously tracked over consecutive frames. Feature
Figure 4. Candidate points on road and their 3D positions.
points that are continuously classified as outliers are added to the set of candidate points for moving objects. Candidates for points on moving objects are grouped according to their position in the image, the direction and the magnitude of their optical flow. Then, a moving object region is defined as a rectangle that includes all points in the same group.
Firstly, the plane of the road is estimated in threedimensional space from points that are contained in the region detected as being the road in the previous frame, as shown in Figure 4(a). In this estimation, the LMedS (Least Median of Squares) estimator is used, because some of the points identified as being in the road region in the previous frame may not actually be on the road, and some may have false positions in space due to false correspondences, as shown in Figure 4. After this, any scale ambiguity in the three dimensional structure can be removed from the position of the estimated road plane and the actual camera height. Then the input image is divided into small patches. Each patch is evaluated to determine whether or not it is a road region from the estimated road plane and the estimated ego-motion. Moreover, the distance of a moving object is estimated from the position of its lower edge by assuming that any moving objects are on the road.
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
(a) Figure 5. Estimated yaw rate.
5. Experiments This section presents our experimental results for estimating ego-motion and for detecting moving objects by using the proposed method. The images were captured using a CCD camera (Sony XC-55) mounted on a vehicle. The image resolution was 640 480 pixels, and the frame rate was 10fps. Figure 5 shows the estimated yaw rate between consecutive frames. This sequence contains moving vehicles. Although it is clear that false estimations of yaw rate often occur in the method when using all correspondence points, the proposed method can estimate the yaw rate in a stable manner. The results of moving object detection are shown in Figure 6. The black rectangle and the white region represent a moving object region and the road region, respectively. Although some false detection occurs, the vehicle is detected as a moving object in Figure 6(a) and the pedestrian who walks across the road is detected in Figure 6(b).
6. Conclusion
(b) Figure 6. Detection results.
[2]
[3]
[4]
This paper has presented a novel method of estimating the ego-motion of a vehicle and of detecting moving objects on a road. The proposed method can accurately estimate the vehicle ego-motion in severe situations, such as when there are moving objects and when the camera is moving nearly parallel to its optical axis. Future work is aimed at the detection of stationary objects on a road and at improving the accuracy of region detection.
[5]
[6]
[7]
References
and automated cartography. Communications of the ACM, 24(6):381–395, 1981. C. Harris and M. Stephens. A combined corner and edge detector. In Proc. Alvey Vision Conference, pages 147–151, 1988. R. Hartley. In defence of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):580–593, 1997. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, UK, second edition, 2004. B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artificial Intelligence, pages 674–679, 1981. A. Shashua, Y. Gdalyahu, and G. Hayon. Pedestrian detection for driving assistance systems: Single-frame classification and system level performance. In Proc. Intelligent Vehicle Symposium, pages 1–6, 2004. G. Stein, O. Mano, and A. Shashua. A robust method for computing vehicle ego-motion. In IEEE Intelligent Vehicles Symposium, pages 362–368, 2000.
[1] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis
0-7695-2521-0/06/$20.00 (c) 2006 IEEE