Object Detection via Foreground Contour Feature Selection and Part-based Shape Model Huigang Zhang Beihang University
[email protected] Jun Zhou Griffith University
[email protected] Junxiu Wang Beihang University
[email protected] Xiao Bao Beihang University
[email protected] Jian Cheng Chinese Academy of Science
[email protected] Huijie Zhao Beihang University
[email protected] Abstract In this paper, we propose a novel approach for object detection via foreground feature selection and partbased shape model. It automatically learns a shape model from cluttered training images without need to explicitly given bounding box on objects. Our approach commences by extracting a set of feature descriptors, and iteratively selects the foreground features using Earth Movers Distances based matching. This leads to a part-based shape model that can be used for object detection. Experimental results show that the proposed method has comparable performance with the state-ofthe-art shape-based detection methods but with less requirements on the data at the training stage.
1. Introduction Object detection is a challenging task in computer vision and pattern recognition. Although various features and structures have been developed for this purpose, psychophysical studies show that human can recognize objects using fragments of outline contour alone [12]. Compared to other features, contour information is invariant to color, texture, and brightness changes, which enables significant reduction in the number of required training examples without loss in the detection accuracy [14]. In the past decade, several contour-based methods have demonstrated excellent performance in object detection and recognition [7, 13]. Shotton et al. [11] and Opelt et al. [8] simultaneously proposed similar recognition frameworks based on contour fragments codebook learning and clutter sensitive chamfer match-
Figure 1. Illustration of the proposed partbased shape model. (a)The original training images. (b)Foreground features selected via weight update. (c)The category center image. (d)Foreground features after clustering. (e)part-based shape model for object detection.
ing, respectively. Extending the contour based method, Shotton et al. introduced a bootstrapping technique that augments the sparse set of training examples [12]. In [4, 3], Ferrari et al. built a codebook of pairs of adjacent segments (PAS) from cropped training images, and localized objects in the form of shape boundary. Zhu et al. [14] introduced a contour context selection framework for detecting objects in cluttered images using only one exemplar. More recently, sophisticated training process is designed to ensure models being automatically learned from training data instead of using a single category shape prototype [13]. Lu et al. [6] used particle filters as inference tool for relevant contour fragments selection in edge images in order to match the model contours.
Most contour based methods use ground-truth bounding-boxes during the training stage in order to get reliable models. This has made the training process inconvenient and less automatic as they are normally manually labeled. The local shape features used in these methods are often too generic. They are easily matched to irrelevant parts of an image, while the geometric relationship between them are often not quantitatively evaluated. To overcome these problems, we propose a framework for object detection via foreground feature selection and part-based shape model as shown in Figure 1. The novelty of this framework lies in three aspects. Firstly, a novel contour context descriptor is developed, which combines PAS feature and its context information for shape part description. Secondly, we propose a novel method to eliminate the background responses for object boundary model learning without using ground-truth bounding-boxes. Finally, a simple deformable part model for object identification is proposed. The effectiveness of this framework is validated by the experiments on ETHZ shape dataset.
2. Approach Given a set of training images with cluttered background, our method extracts feature descriptors, calculates initial similarities between them using fast and robust Earth Mover Distances (fast EMD) [9], and iteratively selects the foreground features through the updated fast EMDs. The selected features are used to form a part-based shape model. The object detection is then achieved by matching each testing image with this model. The details are discussed below.
2.1. Feature Description In our method, PAS is used to extract the basic features because of its scale-invariant local property and good performance in object detection [3]. We extended PAS by combining the local feature with context information, which leads to a novel PAS context feature. The method commences by constructing a 40dimensional codebook via clustering a set of PAS descriptors extracted from the training set [3]. For each image, at the location where each PAS feature is extracted, we construct log-polar coordinates centered at it, which forms the five circles as shown in Figure 2(a). The radius of these five circles are 10, 20, 30, 40, 50, respectively. They divide the neighborhood of a feature location into five regions. For each region, we assign a weight α to it, which are set to 5, 3, 2, 1.5, 1 from inside to outside. Then we can map all PAS features in
the neighborhood to the closest codebook entry, and extend the PAS features to context descriptors represented as p = [p1 , p2 , ..., p40 ]T (Figure 2(b)), where pi = ∑5 j=1 (αj × #{q ̸= p : q ∈ codebook(i) and q ∈ region(j)}) are the number of PAS features around p belong to codebook entry i multiplied by their location weight α.
Figure 2. PAS context computation(a) and descriptor(b). Generally speaking, local features such as PAS are favored for foreground feature selection due to their resilience under occlusion and clutter. Yet, too much locality can be problematic: features with minimal spatial extent may be too generic and easily matched to anything. The proposed PAS context descriptor alleviated these problems by encoding the descriptors and relative locations of other features in a spatial neighborhood.
2.2. Foreground Feature Selection With the proposed PAS context descriptors available, we continue with the foreground contour feature selection. We observed that images of the same category have higher similarity, because of sharing a set of common foreground features. We also observed the selected foreground features can help improve the similarity between images. Based on the recent researches [10], the main process is divided into two steps. 2.2.1. Pairwise matching via EMD. The first step is pairwise image matching using fast EMD [9]. Let X = {(f1 , w1 ), (f2 , w2 ), ..., (f|X| , w|X| )} be the representation of an image I, where fi corresponds to a PAS context feature, and wi ≥ 0 is the corresponding weight. Initially, all feature weights are assigned to 1. Given two representations Xp , Xq of images Ip , Iq , the fast EMD is ∑ ∑ f EM D(Xp , Xq ) = (min fij dij )/( fij ) {fij }
s.t
∑
i,j
∑
i,j
fij ≥ 0 j fij ≤ wi i fij ≤ wj ∑ ∑ ∑ fij = min( wi , wj ) i,j
i
j
where {fij } is the flow matrix with the element fij indicating the flow between features fi (∈ Xp ) and fj (∈ Xq ). {dij } is the thresholded distance matrix with the element dij being defined as dij = min(d(i, j), t), where d(i, j) is the Euclidean distance between features fi and fj , t > 0 is the threshold which controls the speed of the EMD process. The smaller t is, the faster the process is. 2.2.2. Feature weights update. The second step is updating the feature weights according to the features’ contribution to the fast EMD calculation. From section 2.2.1, we know that the flow matrix {fij } reflects the correspondence between features of two images. We define the ∑ contribution of feature fi to image ∑n Iq as cq (i) = j fij ×δj /dij where δj = n×wj / k=1 wk , n is the number of PAS context features in image Iq . The mean of all related contributions in one category is used as the updated weight of feature fi . If N is the number of images in one category, the weight of feature ∑N −1 fi in images Ip is wi = N 1−1 q=1 cq (i). The above two processes are performed iteratively. We measure the difference of the average feature weights computed from all images between two consecutive iterations. If this difference is lower than 10%, the iteration stops and the PAS context features with high weights will be selected (Figure 1(b)). This step extracts the foreground contour features, which are used to build the shape model. Please note that the groundtruth bounding-boxes are not required here.
2.3. Part-based Shape Model Having selected the foreground features from the training images, we need to build a model for object detection. We use the part-based shape model inspired by the structural model of [2]. First, for the training images of the same category, we find an image as a category center based on the fast EMD distances calculated in section 2.2. The sum of EMD distances between the central image and the other images of the same category is the smallest, which means this image has the highest similarity to others. Assuming the central image be M (Figure 1(c)), the foreground of M is represented as M = {(f1 , w1 ), (f2 , w2 ), ..., (fm , wm )}, where m is the number of foreground features selected in section 2.2. Second, we group the selected features of M into k clusters according to their locations using the K-means clustering algorithm, as shown in Figure 1(d). Then we assign each cluster with a rectangular box to show the parts of an object. By combining these rectangular
boxes, we get the root rectangular box and the corresponding center of the object, as illustrated in Figure 1(e). Each part and the root can be represented as a set of PAS context features and a location center. Third, to represent the model in a mathematical way, we normalize each part and the root to a 50 dimensional histogram using the bag-of-word[1] method. Thus, each part model Pi is defined by a 2-tuple (Fi , vi ), where Fi is a 50-dimensional histogram descriptor for the i − th part, vi is a two-dimensional vector specifying an anchor position for part i relative to the root position, i = 1, 2, ..., k. The whole part-based shape model is formally defined as (F0 , P1 , ..., Pk ) where F0 is a 50 dimensional histogram descriptor for the root.
2.4. Object Detection To detect objects in an image, we first extract the PAS context descriptors. Our detection system scans the image using a sliding window, and traverses the testing images at 5 different scales (which are 0.5, 0.8, 1, 1.2, 1.5 times the shape model scales respectively). We compute an overall score for each root location according to the best possible placement of the parts according to the part model. By defining a threshold for each possible root location we can detect multiple objects of an image. We retain all the hypotheses where the cost function is less than the detection threshold, which is 2 in our experiment. For different part scales, we perform non-maxima suppression by retaining only the lowest cost hypothesis among the ones which had significant overlap. Given a testing image T , let (F0t , P1t , ..., Pkt ) be its possible detection result at location t. Then an optimal match of the model to image T is defined as ∑k ∑k t∗ = arg min( i=0 d(Fi , Fit ) + i=1 d(vi , vit )) t
where d(· , ·) is the distance function measuring the similarity. We use Euclidean distance in our experiment.
3. Experiments We tested our method on the ETHZ shape dataset [5] which contains 5 different object categories with 255 images in total. It’s highly challenging because of significant intra-class variations, scale changes, and many severely cluttered images, with objects comprising only a fraction of the total image area. We followed the training/testing protocol described in [4]: we used the first half of images in each class for training, and the other half for testing. We used the following criteria to evaluate the method: a detection is counted as correct if the intersection-over-union ratio (IoU) with the ground-
Figure 3. FPPI/DR curves on ETHZ shape classes of our method.
Our method Srinivasan et al [13] Lu et al [6] Maji et al [7] Ferrari et al [4]
Applelogos 0.846/0.902 0.95/0.95 0.9/0.9 0.95/0.95 0.777/0.832
Bottles 0.966/0.966 1/1 0.792/0.792 0.929/0.964 0.798/0.816
Giraffes 0.883/0.883 0.872/0.896 0.734/0.77 0.896/0.896 0.399/0.445
Mugs 0.849/0.867 0.936/0.936 0.813/0.833 0.936/0.967 0.751/0.8
Swans 1/1 1/1 0.938/0.938 0.882/0.882 0.632/0.705
Mean 0.909/0.924 0.952/0.956 0.836/0.851 0.919/0.932 0.671/0.72
Table 1. Comparison of detection rates for 0.3/0.4 FPPI on ETHZ Shape classes. truth bounding-box being greater than 20% (20%-IoU) or 50% (50%-IoU). Figure 3 shows detection rate vs. false positive per image (FPPI/DR) curves for our method based on the above criteria. Although we learned our shape model without using ground-truth bounding-boxes, we can get comparable detection results with the several alternatives that require bounding-boxes [13, 6, 7, 4]. Table 1 shows the compared detection rates at 0.3/0.4 FPPI using the 50%-IoU criterion. We observe that our method performs well on all classes but applelogos and mugs comparing with other methods. The reason is that these two classes are very simple. When they’re small in an image, it becomes hard to distinguish their parts. Therefore, some small instances may not be detected.
4. Conclusion This paper introduce a part-based shape model for object detection. We have shown how our proposed contour feature descriptor can be used for foreground feature selection, by iteratively computing pairwise fast EMDs. Moreover, we have built a simple but effective shape model using the selected features for the detection process, which does not require a ground-truth bounding-box. The results indicate that our detection performance is comparable to other methods reported in the literature. Future work will deal with a more precise framework for the shape part description.
References [1] L. Fei-Fei, R. Fergus, and A. Torralba. Recognizing and learning object categories. In ICCV Short Course, 2005.
[2] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. In PAMI 2010, volume 32, pages 1627–1645. [3] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour segments for object detection. In PAMI 2008, volume 30, pages 36–51. [4] V. Ferrari, F. Jurie, and C. Schmid. Accurate object detection with deformable shape models learnt from images. In CVPR 2007, pages 1–8. [5] V. Ferrari, T. Tuytelaars, and L. Van Gool. Object detection by contour segment networks. In ECCV 2006, pages 14–28. [6] C. Lu, L. Latecki, N. Adluru, X. Yang, and H. Ling. Shape guided contour grouping with particle filters. In ICCV 2009, pages 2288–2295. [7] S. Maji and J. Malik. Object detection using a maxmargin hough transform. In CVPR 2009, pages 1038– 1045. [8] A. Opelt, A. Pinz, and A. Zisserman. A boundaryfragment-model for object detection. In ECCV 2006, pages 575–588. [9] O. Pele and M. Werman. Fast and robust earth mover’s distances. In ICCV 2009. [10] L. Shang and B. Xiao. Discriminative features for image classification and retrieval. PR Letters, 2011. [11] J. Shotton, A. Blake, and R. Cipolla. Contour-based learning for object detection. In ICCV 2005, volume 1, pages 503–510. [12] J. Shotton, A. Blake, and R. Cipolla. Multiscale categorical object recognition using contour fragments. In PAMI 2008, volume 30, pages 1270–1281. [13] P. Srinivasan, Q. Zhu, and J. Shi. Many-to-one contour matching for describing and discriminating object shape. In CVPR 2010, pages 1673–1680. [14] Q. Zhu, L. Wang, Y. Wu, and J. Shi. Contour context selection for object detection: A set-to-set contour matching approach. In ECCV 2008, pages 774–787.