3D Object Recognition Using Segment-Based Stereo Vision Yasushi Sumi and Fumiaki Tomita Electrotechnical Laboratory, Tsukuba, Ibaraki, 305 Japan
Abstract. We propose a new method to recognize 3D objects using segmentbased stereo vision. Predefined object models are compared with 3D boundaries which are extracted by the stereo vision. Boundaries may be straight lines, circular arcs and free-form curves. The models consist of the local shapes and the whole shapes of the boundaries. The models are constructed from samples of real objects or from CAD models. Based on the local shapes, the candidate transformations are generated. The candidates are verified and adjusted based on the whole shapes. Experimental results show the effectiveness of the method.
1 Introduction Three-dimensional object recognition is an important topic in computer vision. Many techniques which address the problem of finding the position and orientation of a known object have been proposed for fully-automated production systems using intelligent robots. Especially, recent customized production requires versatile object recognition systems. Approaches of 3D object recognition are generally classified into two groups. One is to analyze a monocular intensity image. The position and orientation of an object are estimated by matching 2D features extracted from the image to the object model [1, 2]. However, the combinatorial problem arises if the object and the scene are complex. The other approach is to use range images. Since the combinatorial problem reduces by using 3D features extracted from the range images, many recognition techniques have taken this approach [3]. However, special devices, such as a laser range finder, is necessary to get a range image of a scene. Recently, various techniques to reconstruct 3D information from intensity images have been proposed, such as stereo vision, etc. Though it is reasonable to use them for object recognition, there have been very few researches. TINA developed by Porrill et al. [4] is one of the few examples, but it did not handle free-form boundaries. This paper proposes a new method to recognize objects with fixed edges, that is edges originating discontinuity of surface normal and/or reflectance. Edges may be straight lines, circular arcs and free-form curves. Object models consist of the local shapes and the whole shapes of the boundaries. One of the object models is compared with the 3D boundaries extracted by segment-based stereo vision. Based on the local shapes, the candidate transformations are generated. The candidates are verified and adjusted based on the whole shapes. The partial pattern matching algorithm makes it possible to recognize objects which are partially occluded by the other objects. A recognition system using this method has been implemented as a subsystem of a 3D vision system which we call VVV (Versatile Volumetric Vision) [5]. The goal of
B0
S0 S4 B1 S5 S7 S1
Region A
Segment
R
Rev.-Segment
S6
S3
Region C Prev.-Segment
Region B
branch
corner
inflection
transition
S2 Next-Segment
R B0
Region D
B1
S0
S1
P0
P1
S2 P2
S3 ...
(a)
(b)
Fig. 1. Boundary representation of an image. (a) Data structure; (b) Segmentation of the boundary.
the VVV project is to develop a versatile vision system that can be used for various applications such as hand-eye robots and autonomous vehicles. The rest of this paper describes the reconstruction of 3D information in the VVV system, the modeling of objects, and the object recognition algorithms. Experimental results are finally shown to demonstrate the effectiveness of this method.
2 Reconstruction of 3D data 2.1 Boundary representation of an image In the VVV system, we adopt B-rep, boundary representation of an image, as an intermediate description which describes both the 2D structure of an image and 3D structure of a scene. Each of stereo images is converted into the B-rep data structure by the image segmentation and the boundary segmentation procedures [6, 7]. Fig. 1 (a) illustrates the data structure of the B-rep which consists of the four layers: region (R), boundary (B), segment (S), and point (P). The boundaries are segmented into straight, convex, or concave segments at the points shown in Fig. 1 (b). The inflection points and the transition points are particularly useful for free-form boundaries. In this paper, we call a pair of B-rep forms converted from stereo images a stereo B-rep. 2.2 Segment-based stereo vision 3D data of a scene is reconstructed by the segment-based stereo vision system [8]. We call the B-rep point with its 3D position data point and the B-rep containing the data points 3D B-rep. Fig. 2 shows examples of stereo images (a), stereo B-rep (b) and reconstructed 3D B-rep (c). 2.3 Geometrical features Geometrical features for recognition are generated from the 3D B-rep. We define two types of geometrical features, data vertexes and data arcs. They are generated by fitting
x y
(a)
z
z y
x
(b)
(c)
Fig. 2. (a) Stereo images (6402480 pixels, 256 gray levels); (b) Stereo B-rep; (c) 3D B-rep in three orthographic views.
lines and/or circles to the B-rep segments. The data vertex consists of its 3D position and two tangent vectors and the data arc consists of the position of the end point of the segment, and two vectors given by the circle. The geometrical features are generated by the following procedures (see Fig. 3 (a)): 1. A line or circle is fitted in 3D space for data points on each of the B-rep segments. 2. If the fitting error is large, the data points are bisected and the fitting is done for each part of the data point recursively. As a result, the B-rep segment is approximately expressed by a line, a circular arc, or the combination of multiple lines and arcs. 3. When the segment is a single arc, the arc can be used as a data arc. 4. When the segment is expressed by a combination of lines and arcs and an arc is fitted at the end part of the segment, the arc is used as a data arc. 5. Two tangential lines are determined by lines or arcs at the ends of connected two B-rep segments. A data vertex is defined by the two tangential vectors at the midpoint of a line segment which is the shortest distance between the two tangential lines. Fig. 3 (b) shows geometrical features (vertexes) generated from the 3D B-rep in Fig. 2.
3 Object model Object models consist of model vertexes, model arcs and model points, as shown in Fig. 4 (a). The model vertexes and arcs are geometrical features and the data structures are the same as the data vertexes and arcs. The model points reflect the whole shape of the objects and sample points on the B-rep segments at equal intervals. Each of the model points has a 3D position and a normal vector. We have developed both sensor-based and CAD-based modeling systems. The sensor-based system uses the stereo vision system as a sensor. Fig. 4 (b) is an example of CAD-based object models which is shown by connecting adjacent model points.
z
z
x y
z
x
x y
Data arc
y
Data points
Data points
Fitted lines
Data vertex Vertexes
(a) (b) Fig. 3. Geometrical features for recognition.
z x
z
Model arc
y
x y
Model vertex
Model points
(a)
(b)
Fig. 4. Object model.
4 Recognition algorithm The position and orientation of an object are expressed as a 4 2 4 transformation matrix
T=
R t
R
t
is a 3 2 3 rotation matrix and is a 3D translation vector ; where 0 0 01 which move an object model. In other words, the recognition algorithm is a procedure for calculating by comparing an object model and scene data which is reconstructed by the stereo vision. An object is recognized in the following two phases: the initial matching and the fine adjustment.
T
4.1 Initial matching
T
In the initial matching phase, the candidates of are roughly calculated by comparing model features (vertexes and arcs) with data features. However, correct correspondence between a model feature and a data feature is not known in advance, so every data feature which is similar to the model feature is a candidate for the correspondence. When a model vertex VM moves to the position and orientation of a data vertex VD as shown in Fig. 5, is calculated by the 3D coordinates of VM and VD , and is also calculated by two vectors of VM and VD . If the angle M between two vectors of the model vertex is largely different from D of the data vertex, we can assume that the model vertex does not correspond to the data vertex. That is, if VM (i) (i = 1; : : : ; m) and VD (j ) (j = 1; : : : ; n) satisfy
t
R
j (i) 0 (j )j < 2; M
D
(1)
z x
Model
y
θM
T0
VM
θD
Data VD
Fig. 5. Initial matching with vertexes.
T
a candidate of the transformation matrix, ij (0), is calculated, where m and n are the numbers of vertexes in the model and the scene. 2 is a threshold value. In the case of arcs, the radii are used for . 4.2 Fine adjustment The fine adjustment process tests if the candidates selected in the initial matching phase are correct and makes the transformation matrix ij (0) more accurate. The procedures are as follows:
T
T
1. Model point P¯ (k¯ ) (k¯ = 1; : : : ; p¯) is moved to P¯ 0 (k¯ ) by ij (0), where p¯ is the number of model points. 3D positions ¯ (k¯ ) and unit normal vectors ¯ (k¯ ) of P¯ (k¯ ) are transferred to ¯ 0 (k¯ ) = ij (0) ¯ (k¯ ) + ij (0); (2) ¯ 0 (k¯ ) = ij (0) ¯ (k¯ ):
P
P N
R R
P N
t
2. If
!
S¯ 0 (k¯ ) 1 N¯ 0 (k¯ ) > ; 2 jS¯ 0(k¯ )j ¯ ) is observable from a camera position C , where P¯ 0 (k cos01
t
N
t
S¯ 0 (k¯ )= P¯ 0 (k¯ ) + t (0) 0 C ij
(3)
(4)
is a vector that expresses the direction of observing P¯ 0 (k¯ ). Let P (k) (k = 1; : : : ; p; p p¯) and (k ) denote the observable model point and its 3D coordinate. 3. If a data point D (l) (l = 1; : : : ; q ) exists in the vicinity of P (k ), let (k; l) be the combination of P (k ) and D (l), where q is the number of data points. 4. The transformation matrix 0ij which moves (k ) to (l) are given by the least squares method which minimizes
P
T
J
=
X
P
jR0 P (k) + t0 0 D(l)j2 : ij
(k;l )
D
ij
(5)
5.
P (k) can be transformed into an image coordinate [col
k
error
2
=
P (k;l )
rowk
]. If a mean square
f(col 0 col )2 + (row 0 row )2 g k
k
l
r
l
(6)
P
is large or the number of combinations r = (k;l) 1 is few compared to p, the above processing from 2. to 5. must be iterated using (u) = 0 (u 0 1), because sufficient accuracy may not be obtained. If the accuracy does not converge after many iterations, the candidate should be considered incorrect, so we exclude the candidate. 6. After all the candidates have been processed, the transformation matrix ij (u) which has the largest r is selected as the final recognition result.
T
TT
T
T
A candidate ij (0) may not be accurate enough, even if the candidate is correct, because local features are used in the initial matching phase. Therefore, the fine adjustment procedures are applied by the following two steps. Initial adjustment: The position and orientation of the object is roughly adjusted by using only the model points near the geometrical feature. Main adjustment: The position and orientation is further refined using all the model points. Fig. 6 shows an example of both the initial adjustment and the main adjustment. This shows that both the initial adjustment and the main adjustment were iterated three times. Although the corresponding data points were found only in the vicinity of the geometrical feature at first, they were finally found over the entire model points.
5 Experiments Fig. 7 shows example experimental results. The model was moved according to the recognition result and straight lines connecting two adjoining model points were projected and displayed on a left stereo image respectively. In Fig. 7(a), an object corresponding to the model in Fig. 4 was correctly found from the recognition data shown in Fig. 3. (b) shows another result for the same object. The object was recognized from a cluttered scene in which there were many objects of partly similar shape. In (c), a result for a partially occluded object is shown. (d) shows a result for an object consisting of only a free-form boundary. The computational time depends primarily on the complexity of the model and the scene. In the case of Fig. 7(a), the recognition took about 6 seconds? . For the scene of (b) in which there were roughly twice the number of data vertexes, it took about 10 seconds. The computational times for (c) and (d) were approximately one second because there were far fewer vertexes or arcs corresponding between the data and model. The recognition error in 3D space usually does not exceed 2 mm?? and it is no problem to pick up the recognized object by a robot manipulator. ? SuperSPARC, 40MHz ?? Errors which arise in the stereo vision measurement are excluded.
col z
row
x y
(a)
(b)
col
col row
(c)
row
(d)
Fig. 6. (a) Object model built by the sensor-based modeler (a piece of cardboard); (b) Stereo image (left); (c) Initial adjustment; (d) Main adjustment; 1 (dot): data point, : model point that has a data point in the vicinity, +: model point that has no data point, : the position of the geometrical feature (a vertex, in this case) used in the initial matching phase.
0
6 Conclusion In this paper, we have presented a method to recognize the 3D position and orientation of an object. We consider that the method is useful for the vision of intelligent robot systems because of the following advantages: – Based on stereo vision. No special sensing devices, such as a laser range finder, are required. – The targets are not only polyhedra but also objects with free-form boundaries. – Robust for occlusion as shown by the result in Figure 7(c), because the local features of an object are used for the initial estimation of the object position. This method is, however, only effective for objects including some fixed edges and does not cope with apparent boundaries of curved objects. We have already developed a recognition method for the apparent contours that is an extension of the method proposed in this paper [9]. We can now combine these two methods to recognize any rigid object.
Acknowledgments We would like to express our thanks to the members of the Computer Vision Section, ETL and the VVV working group for helpful discussions.
(a)
(b)
(c)
(d)
Fig. 7. Experimental results. (a) a toy block; (b) the toy block in a cluttered scene; (c) a partially occluded mug; (d) a piece of cardboard which consists of a free-form boundary.
References 1. Kriegman, D. J., Ponce, J.: On recognizing and positioning curved 3-D objects from image contours. IEEE Trans. on PAMI, 12, 12, 1127–1137, 1990 2. Wong, K. C., Kittler, J.: Recognizing polyhedral objects from a single perspective view. Image and Vision Computing, 11, 4, 211–220, 1993 3. Arman, F., Aggarwal, J. K.: Model-based object recognition in dense-range images — A review. ACM Computing Surveys, 25, 1, 5–43, 1993 4. Porrill, J., et al.: TINA: A 3D vision system for pick and place. Image and Vision Computing, 6, 2, 91–99, 1988 5. Tomita, F.: Toward practical 3D vision. J. Robotics Society of Japan, 12, 8, 1124–1127, 1994 (in Japanese) 6. Tomita, F., Tsuji, S.: Computer Analysis of Visual Textures, chapter 3. Kluwer Academic Publishers, 1990 7. Sugimoto, K., Tomita, F.: Boundary segmentation by detection of corner, inflection and transition points. Proc. IEEE Workshop on Visualization and Machine Vision, 13–17, 1994 8. Kawai, Y., et al.: Search for stereo correspondence based on connectivity of segments. Technical Report PRMU96–135, IEICE, January 1997 (in Japanese) 9. Sumi, Y., Kawai, Y., Yoshimi, T., Tomita, F.: Recognition of 3D free-form objects using segment-based stereo vision. Proc. ICCV98, January 1998 (to appear)