Region annotations in hashing based image retrieval †‡
† ‡
†
‡
†
Cheng Guo , Luis Herranz , Lifang Wu , Shuqiang Jiang
Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, 100190, P.R.China
School of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China
[email protected],
[email protected],
[email protected],
[email protected] ABSTRACT During recent years, large-scale data has become more and more present in our daily life, requiring better, faster, and more effective ways to discriminate between unrelated content and what we are truly interested in. In particular, retrieving similar images is a fundamental problem in both image processing and computer vision. Since finding similar images high-dimensional image features is not feasible in practice, approximate hashing techniques have been proved very effective due to their great efficiency and reasonably accuracy. Current hashing methods focus on image as a whole. However, sometimes the relevant content might be just bounded to a region of interest while the background part seems to be needless for users. In this paper, we propose to exploit region-level annotations, whenever they are available, for both training supervised hashing methods and also in the query image. Our objective is to evaluate whether region-level features can be helpful in a supervised hashing scenario. Experimental results confirm that region-level annotations in both the query and training images increase the accuracy.
Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing.
General Terms Algorithms, Experimentation.
Keywords Region-based hashing, supervised hashing, image retrieval.
1. INTRODUCTION Usually, accurate image retrieval requires high dimensional features to express image information, from hundreds to thousands of dimensions. This fact leads to the big problem of requiring too much time to compute similarity between images using conventional Euclidean or cosine distances. Nevertheless, hashing methods have been applied to image retrieval showing their advantages in significantly saving computing time while keeping good retrieval performance. The main idea in image hashing is to learn a binary code to represent the image Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICIMCS’14, July 10–12, 2014, Xiamen, Fujian, China. Copyright 2014 ACM 978-1-4503-2810-4/14/07 …$15.00..
information, in such a way that similar images will share the same (or very similar) code. Since hashing methods produce binary codes, we need to find similar images by computing the Hamming distance between the codes of the query image and the images in the database. Obviously, it will cost less time than exhaustively comparing the query image with the whole dataset in the original high dimensional space, as hashing codes are usually much shorter than feature vectors and Hamming distances also much easier to compute than distances in the feature space. Many binary embedding techniques have been proposed[2][5]. Unsupervised methods approximate similarity in the original feature space. Locality-Sensitive Hashing (LSH) [1][7] uses projections on random hyperplanes to produce binary codes that can approximate lp metrics to quickly find nearest neighbors in sub-linear time. Recently, Liu et al[3] proposed a unified framework for bit selection problems. It exploits a new way to solve the problem of bit selection in hashing for multiple features hashing, multiple hashing methods and multiple bit hashing. Supervised hashing methods also consider label information in order to retrieve not visually but semantically similar images. Supervised Hashing with Kernels (KSH) proposed a flexible approach by incorporating kernels in the hash functions, and it can be applied in both semantic and metric-based similarity hashing. In some situations, full images may include information we do not need in retrieval, which could be regarded as redundancy. Furthermore, some redundancy may disturb the retrieval results by bring us the wrong image, and to some extent, it could be noisy information for users and also noisy data for learning algorithms. To reduce these noisy results and redundancy, we consider leveraging region annotations in hashing to see whether region information could better represent the important content in the image. In this paper, we evaluate using region annotations in the training process, and check whether the performance is better than using full image information or not. We evaluate both supervised (KSH)[4] and unsupervised (LSH) [7] hashing in a single object scenario. We compare three different cases. First, we consider the conventional image-image (I-I) hashing framework where full images are used as query and training samples. The object-object (O-O) hashing framework exploits region annotations in both query and training images. Finally, we consider the image-object case (I-O) in which region annotations are only available for training, but not for the query image. The rest of the paper is organized as follows. Section 2 introduces the proposed framework for region-based hashing. Section 3 presents the experimental evaluation, and finally the last section draws the conclusions.
2. Region-based Hashing The typical image retrieval framework based on hashing has three stages. First, a subset of the database is used to learn suitable hashing functions (typically one function per bit), given a similarity measure (unsupervised) or labels (supervised). For instance, the Hamming distance using the LSH family h(x)=sign(ωTx) can approximate the arccos distance[7] in the feature space of x. Second, the images in the dataset are indexed using the set of learned hashing functions. Each image is represented with a binary code and indexed using suitable structures for efficient search. Finally, during retrieval, the binary code of the query image is also obtained, and images with the same code (or with low Hamming distances) are retrieved. The ultimate objective in image retrieval is to retrieve the most similar images to a query one. In unsupervised hashing methods, no additional information is available. In this case the objective is to efficiently approximate visual similarity in certain feature space and with certain distance/similarity measures. Thus, given a dataset, unsupervised methods try to find the inner relationships or hidden structure in order to discriminate which data are similar or not, and then encode similar data with same codes. However, visual similarity does not always reflect semantic similarity, which is more desirable than purely visual similarity. Thus, when some category labels are available, it is possible to exploit them to learn, in a (semi-)supervised way, hashing functions that better approximate semantic similarity. Typically, semantic similarity is encoded in a pairwise similarity matrix where the similarity sim(i,j) of two training images i and j is 1 (-1) when they have the same (different) labels[2][4]. In all these methods, full images are still the basic unit for indexing and retrieval. However, we notice that users are often only interested in certain objects or parts of the query image, while the background is not really relevant. In some cases, this background information may even add noise during training, resulting in retrieving images that are not related with the user’s interest. Thus, separating relevant and irrelevant parts may help the system to model only those aspects that are important.
Figure 1. Region-based hashing (indexing).
Sometimes, information about the location of objects or regions is available, either from manually user-contributed annotations or automatically generated from object specific detectors (e.g. faces). Based on this assumption, in the proposed framework (see Figures 1 and 2), we consider regions instead of full images as basic indexing units. In this work we assume each image has a region of interest. We also assume that the region of interest has also a label associated. During the indexing stage (see Figure 1), images are cropped to the corresponding region of interest, and a feature descriptor is computed. The corresponding descriptor is transformed to a binary code using hashing. Appropriate hashing functions are learned using supervised hashing with the region labels (if available). During retrieval (see Figure 2), the same process is followed: region extraction followed by region hashing. Then the images with the same code (or within a small Hamming distance) are retrieved. Table 1. Performance results for SIFT features (500 words). MAP Method
KSH
LSH
Type 8 bit
16bit
32bit
48bit
64bit
I-I
0.2214
0.2238
0.2330
0.2479
0.2617
O-O
0.2541
0.2571
0.3016
0.2946
0.3112
I-O
0.2119
0.2358
0.2608
0.2563
0.2728
I-I
0.1297
0.1379
0.1609
0.1540
0.1616
O-O
0.1360
0.1575
0.1734
0.1580
0.1895
I-O
0.1167
0.1489
0.1482
0.1491
0.1528
3. Experiments 3.1 Dataset We evaluated our approach using the dataset released for the Pascal Visual Object Classes (VOC) 2012 challenge. The dataset has images with 20 classes of objects, annotated with bounding boxes and labels. We use only the images in the training and validation sets, as the annotations for the test set are not available. In total, the dataset has 11530 images with 27450 objects annotated. In this paper we focus on images with only one object, and use these object as region of interest for region-based hashing. Thus, we discard images with more than one object. We also discard images with objects tagged as ‘difficult’ in the dataset, which usually indicates that the object is too small. This results in 4956 images with one object. We further filter the dataset keeping only
Figure 2. Region-based hashing (retrieval).
representations. In the latter we used SIFT as local descriptors and locality-constrained linear coding (LLC)[6] for feature encoding based on a dictionary with 500 words. We also included a spatial pyramid in the representation, with two levels (1 and 3x3 regions) for a combined dimensionality of 5000 dimensions. We considered two hashing methods: LSH[7] (unsupervised) and KSH[4] (supervised). LSH uses random hyperplanes, while KSH uses kernels in the hashing functions, computed over a set of m fixed samples, uniformly selected from the database. Following [4], we set m to 300 and use a Gaussian RBF kernel. We randomly selected 1000 samples for training. For testing, 20 images of each class were selected as testing set, so in total we use 300 images and the remaining as retrieval database.
(a)
We compared three variants, depending on whether we use images or objects for query and indexing:
Image-image (I-I): both full images are used for query and indexing (and training). This is the conventional hashing framework, used as a baseline.
Object-object (O-O): object annotations are exploited in both queries and indexed images. The images from the dataset are cropped to objects, from which features are extracted.
Image-object (I-O): object annotations may be available for the images in the database, but not necessarily for query images, so we also consider this case.
3.3 Results
(b)
Table 1 shows the Mean Average Precision (MAP) computed over 300 neighbors, in the case of SIFT features. As expected, KSH variants perform significantly better than the LSH counterparts, proving that supervision leads to much better performance. When comparing purely image-level (I-I) and object-level (O-O) methods, we found the latter has better performance. This suggests that exploiting region-level annotations can be also helpful in hashing, with a gain of around 3-5% over the conventional image level version in the case of KSH. The O-O variant can be seen as the ideal case in which both query and database have proper annotations. However, even in the more realistic case in which the query image has no objects detected (thus using the full image to extract the hashing code), using object for indexing and training is still useful to slightly improve the performance. In the case of the unsupervised LSH hashing object-level annotations improve the performance, but not as much as in the supervised case. Figure 3 shows the precision and recall at different amount of topranked neighbors in the case of 48 bits. These plots and the MAP of Figure 5a show similar trends. In Figure 3a, we can see that for few neighbors, I-O KSH is closer to the precision of O-O KSH.
(c) Figure 3. Performance for SIFT features (500 words, 48 bits): (a) precision, (b) recall; (c) F-score. those classes with a minimum of 100 images per class. This results in 15 classes and 4583 images in total.
3.2 Settings We evaluated our framework using two commonly used features. We extracted 512-dimensional GIST features, and bag of word
As in the case of local features, such as SIFT based bag of words, global features such as GIST can also benefit from object level hashing (see Figure 5b), with even better MAP than SIFT in the case of O-O KSH. However, GIST is a global descriptor that models the overall spatial layout of the scene, so cropping the image to one object significantly changes the descriptor. This results in that the I-O variant, if the object is not significantly big, the feature will model the layout of the whole scene rather than the object, while the hashing functions have been learned from properly cropped object regions. This mismatch results in lower performance than both O-O and I-I. In contrast, the bag of word
representation over SIFT is more robust, as it are computed over local features (see Figure 5a).
[4] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang and ShihFu Chang. Supervised Hashing with Kernels. CVPR, 2012.
As results with GIST suggest, encoding certain spatial layout is also useful to improve the performance. Thus, we also included a two level spatial pyramid in the SIFT representation. Figure 5c shows that it outperforms both SIFT without spatial pyramid and GIST in the case of O-O KSH. However, it loses the robustness of SIFT and I-O and I-I have almost the same performance.
[5] Kristen Grauman. 2010. Efficiently Searching for Similar Images. Comm. of the ACM, 53 (6), June 2010. pp 84-94.
An example of query with the retrieved results for each variant is shown in Figure 4. The O-O variant returns more correct images than I-O and I-I. It also seems that correct results with I-O also tend to have big objects, which makes their descriptors not so different from the corresponding to the full image.
[6] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. CVPR, 2009 [7] Moses S. Charikar, Similarity estimation techniques from rounding algorithms, STOC, 2002
(a) Figure 4. Images retrieved using object-level annotations.
4. Conclusions We have shown that region-level annotations, when available, can be exploited to improve the performance, by focusing the learning of hashing functions only on the relevant parts of the images. Both global and local features can benefit from this information, with local representations such as SIFT being more robust to the absence of region-level annotation in the query image.
5. ACKNOWLEDGMENTS This work was supported in part by the National Basic Research Program of China (973 Program): 2012CB316400, in part by the National Natural Science Foundation of China: 61322212 and 61350110237, in part by the Key Technologies R&D Program of China: 2012BAH18B02, in part by National Hi-Tech Development Program (863 Program) of China: 2014AA015202, and in part by the Chinese Academy of Sciences Fellowships for Young International Scientists: 2011Y1GB05, and in part by the National Science Fund of China under grant No. 61040052.
(b)
6. REFERENCES [1] M. Slaney and M. Casey. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Magazine, vol 5, March 2008. [2] K. Grauman and R. Fergus. Learning Binary Hash Codes for Large-Scale Image Search. Machine Learning for Computer Vision Studies in Computational Intelligence vol 411, 2013, pp 49-87. [3] X. Liu, J. He, Bo Lang and Shih-Fu Chang. 2013. Hash Bit Selection: a Unified Solution for Selection Problems in Hashing. CVPR, 2013.
(c) Figure 5. Mean Average Precision: (a) SIFT (500 words) (b) GIST, (c) SIFT (500 words) with two-level spatial pyramid.