IMAGE SEGMENTATION USING SALIENT ... - Semantic Scholar

Report 3 Downloads 200 Views
IMAGE SEGMENTATION USING SALIENT POINTS-BASED OBJECT TEMPLATES Hui Zhang and Sally A. Goldman Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130 ABSTRACT

2. RELATED WORK

Using prior knowledge about object(s) is beneficial for accurate image segmentation. We present an image segmentation method that performs segmentation using a set of user provided object templates. We call the salient points of the object template that are within the object, the characteristic points of the object. Given an arbitrary image, each object templated can be mapped onto the image as a bit mask by using the correspondence between the salient points of the image and the characteristic points of the object template. To segment a desired object o in an image, the bit masks obtained using all templates for o are combined. Experiments show that this method can create good segmentations.

Many segmentation methods have been proposed to incorporate prior knowledge about object(s) into the segmentation process. Prior knowledge is described though modeling or by training, such as using an implicit shape model [2], an active appearance model [3] or deformable templates [4]. To obtain prior knowledge, many methods learn global or local features in fixed configurations [5], which requires a large number of training examples. Some methods learn the configurations of selected object parts [6, 7]. Mori et al. [8] combine segmentation and recognition to segment the human body. Yu and Shi [9] use object-specific patch-based templates to improve the Normalized Cuts framework. Torralba et al. [10] examine relationships between local object detectors and the global scene context. Yu et al. [11] and Ferrari et al. [12] also proposed methods to combine recognition and segmentation. Other knowledge-incorporated segmentation methods include the figure-ground segmentation using unsegmented training examples proposed by Borenstein and Ullman [13], and the integrated framework proposed by Tu [14], etc.

1. INTRODUCTION Obtaining an accurate object/background segmentation is critical for many applications performing tasks such as target recognition, image understanding and image retrieval. Since purely data-driven segmentation algorithms often cannot generate satisfactory results, use of prior knowledge about object(s) can be beneficial [1]. In this paper, we present an object/background segmentation method that uses object templates to segment a new image. First we create object templates using training examples with human segmented object/background, and assign characteristic points to these templates using salient points extracted from these images on the object part. Salient points are then extracted from the image to be segmented. Based on the correspondence between the salient points of an arbitrary image and the characteristic points of the objects, object templates are mapped onto the image as bit masks. The final segmentation is the combined results of all mapped object templates for the same object. The remainder of this paper is organized as follows: In Section 2, we provide an overview of prior work related to our method. In Section 3, we give a detailed description of our approach. Experimental results are presented in Section 4. We discuss future work and conclude the paper in Section 5. Email: {huizhang, sg}@wustl.edu. The material is based upon work supported by NSF grant No. 0329241. We thank Robert Pless and Sharath Cholleti for many useful discussions.

3. OUR APPROACH We present a segmentation method that uses object templates to guide the segmentation. The objects templates are generated using a small set of human segmented images of each object. The correspondence between the object in an image and the object templates is based on the similarity between their salient points. 3.1. Overview The characteristic points (i.e. the salient points within the user provided object templates) are computed for each object template. Given an image I to segment, first the salient points are extracted from I. These salient points are then compared with the characteristic points of each object template. Using the correspondence between the salient points and characteristic points, we map the object templates onto this image as bit masks using affine transforms. The final segmentation is created by combining all mapped object templates. The process is illustrated in Figure 1, and each step is detailed in the following sub-sections.

Salient Points Extraction

Object Template (one)

Salient Points Matching

Mapped template

Affine Transform Combining multiple mapped templates

Final Segmentation

Fig. 1. Image segmentation using salient points-based object templates (only one object template is shown in this figure). 3.2. Salient Points Extraction

ing salient points sets SP1 and SP2 . For each salient point p k in SP1 , we find its matching set mp(p k ) = {pj | pj ∈ SP2 In our experiments, we use a wavelet-based salient points exand dist(pk , pj ) ≤ τ } for a select threshold τ . The distance traction method inspired by Sebe et al. [15]. We use Daubechiesbetween two salient points is described as the Euclidean dis4 wavelet and study image I at j scales (j ∈ N and j > 0). tance between the SIFT descriptor [16] of these points. These Suppose Wj (p) denotes the wavelet coefficients at scale 2 −j , 4×4 SIFT descriptors are extracted from 16×16 window cenand C(Wj (p)) denotes the sets of coefficients for point p at tered at the current salient point, measuring its local image scale 2−j+1 , i.e. C(Wj (P )) = {Wj−1 (k), 2n ≤ k ≤ 2n + gradient orientation histogram. Other shape matching tech2r−1}, where r is the wavelet regularity and 0 ≤ n < 2 −j N , niques, such as geometric blurred [17] can also be used. and N is the length of the signal. The most salient subset of After this step, we have matching salient points sets for Wj (p) is that with the highest wavelet coefficients at its next each salient point in SP 1 . If pk1 , pk2 and pk3 are three nonscale 2−j+1 . Recursively, we find the!salient points of the collinear salient points in SP 1 , we can connected them as a j whole image with its saliency value as: k=1 |C (k) (Wj (p))|, triangle, and we have |mp(p k1 )| ∗ |mp(pk2 )| ∗ |mp(pk3 )| difwhere 1 ≤ j ≤ log2 N . These salient points are sorted ferent matching triangles for this triangle in I 2 . based on their saliency value, and the 100 most salient points are used. Alternate salient point extraction methods, such as 3.5. Segmentation Using Object Templates SIFT [16], could instead by used. Our technique requires that an object in an image is the affine 3.3. Object Templates transform of itself in another image. This assumption holds in most object/background segmentation tasks when images Object templates are created from a small number of training are not close-ups of a 3-dimensional object and the object is examples, each of which is an image with user-provided obrigid. So, for a pixel p on a desired object in image I 1 , its ject/background segmentation. We define an object template corresponding pixel M (p) on I 2 can be computed by an affine by: " transform M (p) = Ap+B. Six equations uniquely determine 1, if i is on the object P (i) = matrix A and vector B that define a linear combination of 0, otherwise translation, rotation, scaling and/or shearing operations. A where i is a pixel (which may fall outside the boundary of triangle and one of its matching triangles uniquely define an image I). Let SPj be the set of salient points in image I j . affine transform. All pixels on the object in I 1 should be on The characteristic points for image I j are defined as {p ∈ the object in I2 after this transform. SPj | P (p) = 1}. To segment image I, we first extract its salient points SP I . Any three non-collinear salient points can be connected to a 3.4. Salient Points Matching triangle, and we use ∆ I to denote all such triangles in SP I . Matching salient points is critical for finding the corresponThen for each salient point, we find its matching characteristic dence between the object in the image and the object tempoints for all object templates. For each salient point triangle plates. Suppose we have images I 1 and I2 with correspondt ∈ ∆I , we use M TIk (t) to denote all its matching triangles

Fig. 2. Segmentation results comparison. The first row: original images, the second row: results using mean shift based segmentation, the third row: results using hierarchical segmentation, the fourth row: results using our method. on object template k. An object template can be mapped as a bit mask onto I using the affine transform determined by every t and its matching triangle t " on object template k that ! maps t" to t. Conversely, an affine transform M ktt can be determined by t and t " that maps t to t" , and then a pixel i in I ! can be mapped onto the object template k as M ktt (i). Finally, all object templates mapped onto I can be combined as: N # # # ! Pk (Mktt (i)), PI (i) = k=1 t∈∆I t! ∈MTIk (t)

where i ∈ I, N is the number of object templates and P k is the object template k. An alternative method to define P I is to first use RANSAC [18] to find the best match between the object in an image and each of its templates, and then use the best match for the affine transform. In this case, |M T Ik | = 1. The final segmentation S of I is: " 1, if PI (i) ≥ τ SI (i) = 0, otherwise where 1 indicates the pixel i is on the object and 0 means on the background, τ is a threshold. 4. EXPERIMENTAL RESULTS To compare our segmentation method with other approaches, we applied it to the SIVAL image set [19]. SIVAL includes 25 different image categories with 60 images in each category. The categories consist of images containing a single object photographed against highly diverse backgrounds. The object may occur anywhere spatially in the image. The size of

these images is 1024*768. Six images from each category are manually segmented and used as the object templates. We compare our object/background segmentations with those generated by both a mean-shift based image segmentation method [20] and a hierarchical segmentation method [21]. To compare our method to these other methods that are general segmentation (versus object/background segmentation) methods, we consider the optimal way to label each segment output as either the object or background. That is, meanshift based and hierarchical segmentation methods we apply a post-processing step in which a person optimally partitions the segments into object and background. A small number of comparisons are shown in Figure 2. As compared to other shape matching based methods [2, 3, 4], our method is more versatile since it can be used to segment objects that are rotated or scaled because of the use of SIFT features in matching. Also we do not need a large number of training images to create the templates. Our method uses the object templates to improve the segmentation, which incorporates the prior knowledge about the object, so it works better than the other methods in which no domain knowledge is used. For instance, the other two segmentation methods often regard the transparent part of the Ajax Orange bottle as part of the background. Only our method can successfully identify it as part of the object. 5. CONCLUSION We present an object/background segmentation method that uses object templates to improve segmentation quality. The objects templates and their characteristic points are generated

beforehand. The correspondence between the object in an image and the object templates are based on the similarity between their salient points. Object templates are mapped onto the image as bit masks, and the final segmentation is the combined results using all bit masks. Experiments demonstrate that our method can generate better object /background segmentation than current segmentation methods that do not use prior knowledge. Our current object template only uses location information to define whether a pixel is on an object. So images containing the exact same object must be used to create object templates, and these templates can only be used to find this object. We can extend our definition of an object template to a probabilistic model by using P (i, f (i)) to describe the object templates, where f (i) is the image features of pixel i, and assigning different weights to different templates, so that the templates can be used for more versatile object. Our method has shortcomings. First, since the bit masks are mapped from object templates using affine transforms, it adds shape constraints on the objects. The objects must be rigid, and the images cannot be the close-ups of 3-dimensional objects taken from different angles. Second, the correspondence between the object in the images and the object templates is based on salient points matching, so our method depends highly on the performance of salient point extractor. For those images that we cannot match using SIFT features, our method will fail. Third, for images with complex background, a large number of salient points should be extracted so that enough salient points are on objects in the image. For objects with variable appearance, a large number of templates are required so that enough characteristic points can be provides for good segmentation. For future work, we plan to extend our method and try to overcome these shortcomings. 6. REFERENCES [1] Hui Zhang and Sally Goldman, “Perceptual information of images and the bias in homogeneity-based segmentation,” IEEE workshop on POCV, 2006. [2] Bastian Leibe, Ales Leonardis, and Bernt Schiele, “Combined object categorization and segmentation with an implicit shape model,” ECCV Workshop on Statistical learning in Computer Vision, 2004. [3] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” ECCV, 1989. [4] A. L. Yuille, D. S. Cohen, and P. W. Hallinan, “Feature extraction from faces using deformable templates,” CVPR, 1989. [5] C. Papageorgious and T. Poggio, “A trainable system for object detection,” IJCV, vol. 38, no. 1, pp. 15–33, 2000.

[6] M. C. Burl, M. Weber, and P. Perona, “A probabilistic approach to object recognition using local photometry and global geometry,” ECCV, pp. 109–122, 2002. [7] A. Mohan, C. Papageorgious, and T. Poggio, “Examplebased object detection in images by components,” PAMI, vol. 23, no. 4, pp. 349–361, 2001. [8] G. Mori, X. Ren, A. A. Efros, and J. Malik, “Recovering human body configurations: Combining segmentation and recognition,” CVPR, 2004. [9] S.X. Yu and J. Shi, “Object-specific figure-ground segregation,” CVPR, pp. 39–45, 2003. [10] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Robin, “Context-based vision system for place and object recognition,” ICCV, pp. 273–280, 2003. [11] S.X. Yu, Ralph Gross, and J. Shi, “Concurrent objects recognition and segmentation by graph partitioning,” Neural Inforamtion Processing Systems, 2002. [12] Luc Van Gool Vittorio Ferrari, Tinne Tuytelaars, “Simultaneous object recognition and segmentation by image exploration,” ECCV, 2004. [13] E. Borenstein and S. Ullman, ECCV, pp. 315–328, 2004.

“Learn to segment,”

[14] Zhuowen Tu and Song-Chun Zhu, “An integrated framework for image segmentation and perceptual grouping,” ICCV, 2005. [15] N. Sebe, Q. Tian, E. Loupias, M. S. Lew, and T. S. Huang, “Evaluation of salient point techniques,” Image and Vision Computing, 2003. [16] David Lowe, “Distinctive image features from scaleinvariant keypoints,” IJCV, 2004. [17] Jitendra Malik Alexander C. Berg, “Geometric blur for template matching,” CVPR, 2001. [18] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” CACM, 1981. [19] SIVAL benchmark, “http://www.cs.wustl.edu /∼sg /accio /sival.html”. [20] D. Comanicu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Trans. PAMI, vol. 24, pp. 603–619, May 2002. [21] Hui Zhang, Jason Fritts, and Sally Goldman, “A improved fine-grain hierarchical method of image segmentation,” Tech report, Washington University, 2005.