Generating Visual Content from Images via Segmentation Yingbo Zhou, Ifeoma Nwogu, and Venu Govindaraju Computer Science and Engineering SUNY at Buffalo {yingbozh, inwogu, govind}@buffalo.edu
Abstract. In this paper, we proposed a novel graph based approach for image segmentation using color and texture features that are extracted from small image patches. This algorithm segments the image through two steps – it conservatively over segment the image and then merges the larger regions progressively. We also generate the visual content from the image by labeling it into different classes. We got 66.23% and 79.09% classification accuracy from our dataset and the Stanford background dataset respectively, and our segmentation approach consistently gives superior performance as compared to the over-segmentation method. Keywords: image segmentation, image labeling, computer vision
1
Introduction
Human beings are naturally capable of visual recognition, e.g. identifying objects, but it is still largely an open question in the area of computer vision. This may be attributed to the huge structural and compositional difference between human visual/cognition system, and the computer counterpart. Generating visual contents from a static image involves a number of simultaneous computer vision tasks such as global scene descriptions, foreground/background separation, object recognition and detection, object-to-background interactions, object-toobject interactions, visual attributes of objects (such as color, shape, etc.). The tasks of scene and object analyses have made tremendous progress in recent years, much of which is due to the introduction of the two dominant contextual models (scene-based context and object based context models) in computer vision. In our work, we achieve the visual contents from an image labeling perspective, and employed a bottom-up framework. The main contribution of this paper is summarized as i) an iterative progressive image segmentation algorithm that utilizes higher level features obtained from small image patches; ii) an automated approach for labeling the static image from various categories. In addition, we also comparatively studied the effectiveness of image segmentation versus over segmentation for the image labeling task. The rest of the paper is organized as follows: section 2 shows a brief review on the previous works in the related area. Section 3 presents detail of our segmentation method. Section 4 describes various features employed in the labeling
process, and section 5 illustrates the labeling framework. Experiments and results are shown in section 6, and finally conclusions are presented in section 7.
2
Related Work
Our work is related to several efforts in scene understanding and image segmentation. Normalized Cuts [12] is a popular image segmentation method that tackles the image segmentation from a perceptual grouping viewpoint, and it tries to find the global criteria that optimize the partition of the graph. Nock and Nielsen [9] proposed to segment image by pixel-level region merging following a particular order, it handles the noise corruption by enlarging the smoothing and derivative convolution masks. A graph-based method for image segmentation was proposed by Felzenszwalb and Huttenlocher [1], where the algorithm treats every pixel as a node of the graph, and the edges are either selected based on the neighborhood relationship inside the image, or the nearest neighbors in the feature space. In their study, two nodes are only combined if there mutual difference is smaller than the internal difference. Recent advances in scene understanding have highlighted the importance of combining both scene and object recognition, the two related visual tasks traditionally studied separately. Contextual models for scene analysis, specifically, object recognition were therefore introduced [2, 8] and popularized by Olivia and Torralba in [10]. These contextual models were either based (i) solely on scene statistics, or (ii) on relationships among objects in the image. Although seemingly related, these two contextual approaches attempted to answer different questions, (i) what global scene does the picture represent (indoor, coastal, urban, etc.)? or (ii)what “objects” exist in this picture (sky, airplane, person)? Paper [11] and [3] address the scene labeling problem according to the co-occurrence of the objects in a scene. More recently, Li et al. [7] proposed a framework that combine both the models, using the high level features from an image to improve object labeling and scene understanding. Similarly, paper [13] integrate the holistic scene features into the contextual model to improve object detection.
3
Image Segmentation
We propose to use a graph based method to do image segmentation, while instead of treating each pixel as one node of the graph [1, 9], we first cluster a small number of pixels that have similar intensities and treat those superpixels as nodes of the graph. The reason why we prefer group representation of pixels over individual’s is illustrated as following. A segment of an image should contain the region of similar properties, and preferably, of the same object(s), and such properties are better represented by a group of pixels than individuals, since the similarity within the segment is supposed to be measured by all the pixels of the region, not separately. This gives rise to the first assumption we made
Fig. 1. The LM filter bank with four different scales, six orientations and a total of 60 filters.
during our segmentation: measuring all the pixels inside one superpixel is more appropriate than measuring them individually. Therefore, we firstly over segment the image into different groups of pixels by applying watershed algorithm over the gradient image (we use the Matlab implementation of this algorithm in our experiments). The output of the watershed algorithm gives small, relatively regular shaped superpixels, and in particular, the superpixels will not cross the sharp boundaries of the image. This property of superpixels leads to another assumption for doing segmentation: each superpixel does not contain pixel that belongs to different segments. We extract several features from each superpixel patch, including histogram of hue, saturation, values, and maximum response from LM filters (texture features). The LM filters are shown in figure 1. We segment the image by combining the adjacent superpixels, using χ2 similarity of the feature vectors as merging predicate (the merging threshold was set to 0.3 in our experiments). Before segmentation we first sort all the superpixels according to their mean intensity values so that the area of similar color will be combined in higher priority over the areas of distinct colors. The formula for computing χ2 similarity is illustrated as follows: N
χ2 (fm , fn ) =
1 X {fm (i) − fn (i)}2 2 i=1 fm (i) + fn (i)
(1)
where fm , fn denotes the feature vector of the m-th and n-th node, N is the dimension of the feature vector. Two nodes are combined only if the similarity is less than the predefined threshold. Let V and E represent the nodes (i.e. superpixels) and edge weights (i.e. adjacent superpixel similarities) respectively, and the whole process is illustrated in figure 2. Note that while the number of the superpixels in one segment is smaller than a threshold (in our experiment we set this threshold to 10), we only consider the similarity of their colors for combination. Because textures may not be representative when the occupied area is small (e.g. less than 40 pixels); typically some small areas with no visual difference are not able to be combined due to the difference in the texture features, resulting in lots of spikes after the segmentation. Some small image patches may not be combined after the first iteration
Function SegmentImage (V) For each vi ∈ V Find the adjacent nodes N . For each nj ∈ N s = χ2 (Features(vi ), Features(nj )) If s