Attention-driven Segmentation of Cluttered 3D Scenes Ekaterina Potapova, Michael Zillich, Markus Vincze Automated and Control Institute, Vienna University of Technology {potapova,zillich,vincze}@acin.tuwien.ac.at
Abstract Vision is an essential part in robotic systems. In robot applications visual attention plays an important role. However, a good segmentation of objects remains challenging due to cluttered environment. To improve object segmentation we first propose to attend to objects based on saliency maps, which are calculated using color and depth. An attended object is segmented using a probabilistic edge map. This probabilistic edge map is calculated using color, depth and curvature within the introduced probabilistic framework. We showed that the quality of proposed attention points is better than existing methods, in terms of their location within the object and the number of attendant objects. The proposed attention points and probabilistic edges lead to significant improvement of segmentation results comparing to existing methods of active segmentation1 .
1. Introduction Segmentation of unknown objects in heavily cluttered scenes remains a challenging task in computer vision. At the same time reliable segmentation is a significant bottleneck in many robotic scenarios (such as grasping). Many segmentation approaches attempt to segment the entire scene at once [4, 10, 5]. However, in robotic applications it is often more efficient to segment precisely the first most important object and leave the rest as background. Moreover, a robot will get additional information about its environment after it performs necessary manipulations with the selected object. Boykov et al. [2] and Rother et al. [16] described interactive approaches for segmentation of color images that however require input from the user, which is often not feasible in robotic applications. Moreover, segmentation algo1 The research leading to these results has received funding from the Austrian Science Fund (FWF) under project TRP 139-N23 InSitu
(a)
(b)
Figure 1. Segmentation examples and respective attention points: (a) Active Segmentation from Mishra et al. [12], (b) our method. Respective attention points and segmentation masks are shown in the same color (only first three segmented objects are shown).
rithms towards active, or so-called guided or seed segmentation have appeared [7, 14, 13]. Ko et al. [7] propose to segment objects of interest based on the principles of human attention and semantic region clustering. In [14] a “seeded region growing” technique is used to segment an image w.r.t. seed points. The algorithms of [7, 14] are based on color images. Robotics scenarios often require 3D information (e.g. for selection of grasping strategies). Accordingly there is a renewed interest in segmentation methods based on 2D and 3D cues. Mishra et al. [13, 12] propose an active segmentation algorithm based on provided seed points situated on objects. At first, in [13] a probabilistic edge map, previously presented in [11], is calculated and improved using depth information or motion cues. This probabilistic edge map is remapped from Cartesian coordinates to Polar coordinates with seed points (attention points) as poles, and the optimal cut through the polar edge map is found using the principle of minimization of path energy. This cut defines the border of an object. In this paper, we use the same approach for finding the optimal border cut, but using our proposed attention points and probabilistic edges instead of edges proposed by Martin
et al. [11]. The algorithm described in [12] calculates attention points based on the principle of the boundary pixels’ border ownership. The idea of border ownership was described in details by Zhou et al. [17]. This means that each border knows to which object it belongs, and therefore a proper attention point inside the object can be selected. However, while active segmentation [12] appears to be very accurate in some applications, it fails in complex situations (Fig. 1). Kootstra et al. [8] describe a seed-based segmentation algorithm based on an energy minimization approach using 2D and 3D cues: color, disparity and plane information. In the described method object locations were detected based on a symmetry principle. The algorithm requires a-priori knowledge about the number of objects presented in the scene, which is not always possible in real life scenarios. Bergstrom et al. [1] presents a segmentation algorithm based on the generation of segmentation hypotheses from 2D and 3D cues. The key idea is to obtain proper object segments through human-robot interaction. In our particular scenario we want the robot to be completely autonomous, which rules out any human intervention. Our work differs from previously described work in three main aspects. Firstly, the scope of the scenes for segmentation includes highly cluttered scenes of object piles, with multiple multi-colored objects, partially occluding each other (Fig. 2). Secondly, we show that the way we attend to objects is more accurate and robust compared to related attention mechanisms, such as the principle of border ownership described in [12]. Thirdly, we developed a new probabilistic framework to calculate edges, which are further used for segmentation. Finally, to the best of our knowledge it is the first paper that proposes to combine explicitly 2D and 3D features (intensity, depth and curvature) both for seed (attention) points and edge calculation.
(a)
(b)
Figure 2. Examples of attention points and ground truth labeling (blue polygons around objects): (a) Attention points calculated by Mishra et al., (b) our method. Each attention point has different color and number.
more suitable for our scenario. In our approach, we combine three different types of edges: Sobel Color Edges (SC) are calculated from color images. Sobel Depth Edges (SD) are computed on depth images using the Sobel operator, and indicate jump (occluding) edges. Curvature Edges (CE) are computed using information about point normals, as curvature discontinuity is often an indicator for object boundaries. Strong values of curvature mean a discontinuity in the surface normal direction, i.e. a large difference between a normal and the averaged normal over its neighborhood. To fuse the above types of edges into single edge map, we first learn the two probability distributions p (ci |E = edge) and p ci |E = edge for each edge cue c ∈ {SC, SD, CE}. The better the separation between the positive and negative distributions, the better a given observation distinguishes between edges and non-edges. Assuming that all three edge observations (SC, SD, CE) are conditionally independent, the probability of E being an edge is then given by Bayes rule as: p (E = edge|c1 , . . . cQ n) = n p (E = edge) i=1 p (ci |E = edge) Qn Qn i=1 p (ci |E = edge) + i=1 p ci |E = edge
2. Probabilistic Edges for Segmentation
(1) We assume that the a-priori probability of E being or not being an edge is equally likely p (E = edge) = p E = edge = 0.5.
Segmentation results of [13] strongly depend on the quality of edges. The authors propose to use edges of [11] (Fig. 3a) weighted according to disparity or motion cue information. In our scenarios, however, the segmentation approach did not show promising results (Fig. 3a). This can be explained by the complexity of the object configurations in the scenes. Martin et al. [11] compute the probability of an edge based on color, intensity and texture. However, the algorithm is computationally expensive and 3D depth information is not explicitly used to calculate edges. We decided to incorporate available 3D and 2D cues to obtain edges
3. Attention Points as Seed Points for Guided Segmentation Selecting seed points belonging to objects still remains an open question. Mishra et al. [13] proposed to fixate on salient locations, so called attention points, which can be extracted from saliency maps. There exist 2
Segmentation Mishra et al. [12] Our Method
Mean 0.40 0.54
Std 0.13 0.09
Table 1. F scores for edge segmentation. many approaches for the calculation of saliency maps, as presented in [6, 3]. The concept of border ownership was introduced in Mishra et al. [12] as an idea to select attention points automatically. However, the proposed approach works only for “simple” objects. Mishra et al. defines a “simple” object as a “compact region in the scene enclosed by the edge pixels at depth and contact boundaries with ’correct’ border ownership”. In case of multiply occluded objects dumped in a heap this strategy is not successful. In Fig. 2a several attention points are not even located on the objects. This means that attention points are not optimal. The algorithm described in [15] proposes an effective strategy for selection of attention points by incorporation of 3D and 2D saliency cues. The approach was motivated by a robotic grasping scenario using specific saliency maps (Surface Height (SH) and Relative Surface Orientation (RSO)), inspired by the observations on how humans pick objects from occluded piles. To calculate SH cue, first the ground plane on which objects rest (e.g. a table) was detected (Ax+By +Cz + D = 0). For every point p (i, j) its distance to the supporting plane d (i, j) was calculated. dmax is the distance between the ground plane and the most remote point. Values of the SH cue are calculated according to: SH (i, j) = a (d (i, j))
2
(a)
(b)
(c)
(d)
Figure 3. Examples of edges and segmentation results: (a) Martin et al., (b) Proposed probabilistic edges, (c) Mishra et al., (d) our method
consists of 244 images of table scenes challenging for segmentation due to the number of objects and the complex arrangements, including even partial occlusions. All objects in the images were hand-labeled with outer contours represented by polygons (Fig. 2). As in our scenes we only have ground truth edges for objects on the table, we defined a ROI, based on the automatic extraction of the table plane, to include only heaps of objects on the table and disregard the background. We used 40% randomly chosen images from our database for training. As shown in Fig. 3d and Table 1, we evaluated the segmentation results on the remaining 60% using F metric:
(2)
2
where a is chosen such that a (dmax ) = 1. The normal vector n = (A, B, C) of the ground plane and the local surface normal vector nij define the RSO cue: RSO (i, j) = |nij · n| (3)
F =
2P R P +R
(4)
where P stands for the precision (the fraction of our segments being located on ground truth), and R for the recall (the fraction of the ground truth being located on our segments). Furthermore, we evaluate two different aspects of the quality of attention points: average Hit Ratio (HR) and average exponential distance (ED). The HR shows the percentage of unique attention points being situated inside different objects:
SH and RSO are then combined with the classical 2D approach [6] (IKN) in order to obtain a master saliency map. Attention points are extracted from this saliency map using a winner-take-all neural network and inhibition of return mechanism [9].
4. Experimental Results RGB-D databases are not as common as regular 2D image databases yet, therefore we created our own RGB-D image database2 for evaluation. The database
HR =
n N
(5)
where N is the total number of calculated attention points, and n is the number of different attended objects. A perfect attention mechanism would hit every object
2 The
database is available for download under: https://repo.acin.tuwien.ac.at/tmp/permanent/TOSD.{zip,tar.gz}
3
Method IKN IKN+MO SH+RSO+IKN SH+RSO+IKN+MO Mishra et al. [12]
HR 0.55 0.57 0.62 0.64 0.47
AED 0.33 0.41 0.35 0.46 0.32
tation results of cluttered scenes compared to the existing method of [12]. Future work will include the incorporation of motion and shape cues in addition to 3D, to further improve segmentation results.
References
Table 2. Comparison of different methods for calculation of attention points.
[1] N. Bergstrom, M. Bjorkman, and D. Kragic. Generating object hypotheses in natural scenes through humanrobot interaction. In Proc. IROS, pages 827–833, 2011. [2] Y. Boykov and M. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Proc. ICCV, volume 1, pages 105–112, 2001. [3] N. Bruce and J. Tsotsos. Saliency, attention, and visual search: An information theoretic approach. Journal of Vision, 9(3), 2009. [4] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. on PAMI, 24(5):603–619, 2002. [5] P. Felzenszwalb and D. Huttenlocher. Efficient graphbased image segmentation. IJCV, 59(2):167–181, 2004. [6] L. Itti, C. Koch, and E. Niebur. A model of saliencybased visual attention for rapid scene analysis. IEEE Trans. on PAMI, 20(11):1254–1259, 1998. [7] B. Ko and J. Nam. Object-of-interest image segmentation based on human attention and semantic region clustering. JOSA, 23(10):2462–2470, 2006. [8] G. Kootstra, N. Bergstrom, and D. Kragic. Fast and automatic detection and segmentation of unknown objects. In Proc. ICHR, pages 442–447, 2010. [9] D. Lee, L. Itti, C. Koch, and J. Braun. Attention activates winner-take-all competition among visual filters. Nature neuroscience, 2:375–381, 1999. [10] J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image segmentation. IJCV, 43(1):7– 27, 2001. [11] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. on PAMI, 26(5):530–549, 2004. [12] A. Mishra and Y. Aloimonos. Visual segmentation of simple objects for robots. In Proc. of RSS, pages 1–8, 2011. [13] A. Mishra, Y. Aloimonos, and C. Fah. Active segmentation with fixation. In Proc. ICCV, pages 468–475, 2009. [14] N. Ouerhani, N. Archip, H. H¨ugli, and P. Erard. Visual attention guided seed selection for color image segmentation. In Proc. Computer Analysis of Images and Patterns, pages 630–637, 2001. [15] E. Potapova, M. Zillich, and M. Vincze. Learning what matters: combining probabilistic models of 2d and 3d saliency cues. In Proc. ICVS, pages 132–142, 2011. [16] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans. on Graphics, 23(3):309–314, 2004. [17] H. Zhou, H. S. Friedman, and R. V. D. Heydt. Coding of border ownership in monkey visual cortex. Journal of Neuroscience, 20:6594–6611, 2000.
exactly once, making HR equal to one. The method in [12] and our method typically obtain different numbers of attention points. To obtain comparable results we therefore selected the first N attention points in our experiments, with N equal to the number of objects in the scene. The ED between extracted attention point p and the center of the respective object c is defined as: ED(p, c) = exp(−
k p − c k2 ) 2σ 2
(6)
where σ is an application dependent parameters (in our experiments set to 14), which was calculated with respect to the size of the biggest objects graspable with our gripper. The centers of the respective objects represent physical centers of the visible parts of the objects. The ED shows the accuracy of the algorithm. Attention points situated closer to the object centers obtain larger weight than those situated far away. Because attention points often lie on the borders of objects (SH and RSO saliency maps basically highlight top border regions of the objects) we furthermore applied a morphological erosion (MO) operation on the master saliency map, which essentially pulls attention points towards the interiors of objects. Fig. 2 shows our evaluation results. The HR of the proposed attention points SH+RSO+IKN+MO (Fig. 2b) is up to 35% higher than the attention points calculated using the approach in [12]. Table 2 shows that morphological erosion improves ED metric up to 43%.
5. Conclusion and Future Work We presented a method for attention-driven segmentation of cluttered scenes. Our method uses 3D attention points and a probabilistic framework to calculate probabilistic edges, based on color, depth and curvature. We showed that our proposed approach for calculation of attention points leads to better seeds for segmentation. The use of the proposed attention points and probabilistic edges significantly improved segmen4