Benchmarking Large-Scale Fine-Grained Categorization Anelia Angelova∗ Google Inc
Philip M. Long∗ Microsoft
[email protected] [email protected] Abstract This paper presents a systematic evaluation of recent methods in the fine-grained categorization domain, which have shown significant promise. More specifically, we investigate an automatic segmentation algorithm, a region pooling algorithm which is akin to pose-normalized pooling [31] [28], and a multi-class optimization method. We considered the largest and most popular datasets for finegrained categorization available in the field: the CaltechUCSD 200 Birds dataset [27], the Oxford 102 Flowers dataset [19], the Stanford 120 Dogs dataset [16], and the Oxford 37 Cats and Dogs dataset [21]. We view this work from a practitioner’s perspective, answering the question: what are the methods that can create the best possible fine-grained recognition system which can be applied in practice? Our experiments provide insights of the relative merit of these methods. More importantly, after combining the methods, we achieve the top results in the field, outperforming the state-of-the-art methods by 4.8% and 10.3% for birds and dogs datasets, respectively. Additionally, our method achieves a mAP of 37.92 on the of 2012 Imagenet Fine-Grained Categorization Challenge [1], which outperforms the winner of this challenge by 5.7 points.
1. Introduction Fine-grained categorization [3, 11, 27] addresses classification into categories which belong to the same base-level category, for example, classification of different species of birds or species of flowers. It has been a topic of intense research recently, with a variety of methods proposed, e.g. [2, 4, 5, 13, 20, 21, 25, 28, 30, 29, 31]. Fine-grained classification is an important problem to address algorithmically because it is, in general, very hard to do and requires expert knowledge. For example, very few people in the general population can correctly name the species of an arbitrary bird or flower (Figure 1), or make the fine dis∗ This
work was done while both authors were at NEC Labs America.
Figure 1. Can you name these common flowers and birds? Top row: the flower of a tomato plant (left) and of an avocado tree (right); bottom row: nighthawk (left) and mockingbird (right).
tinctions between categories which are very similar to one another (Figure 2). In contrast, general-object recognition, e.g. naming whether an object is a piano or a chair, although hard algorithmically, is rather easy for humans, and can be solved relatively inexpensively in practice by a human-inthe loop approach [3]. Thus automatic systems for finegrained categorization can provide much value in practice. This paper evaluates three prominent directions in finegrained classification: 1) Segmentation, which has consistently been shown to improve recognition performance [2, 4, 19]; we used the method of [2] as it is efficient and applicable to various super-categories. 2) Region pooling, which is a simpler version of the pose-normalized pooling [31] and template-based pooling [28]. Region pooling intends to align regions which correspond to common parts and do pooling per region; we implemented the algorithm of Yang et al. [28] which selects the regions in unsupervised manner. 3) Direct optimization of the multi-class cost function [7, 13] to substitute the most predominantly used one-
this problem domain [16, 19, 21, 27]. In addition, our results clearly point to the best approaches in the context of fine-grained categorization. We propose a pipeline that accomplishes the best results for fine-grained classification to date, improving by as much as 4.8%-10.3% on the stateof-the-art results for two of the four datasets. Additionally, our method achieves a mean average precision of 37.92 (as measured by the official evaluation server) on the 2012 Imagenet Fine-Grained Categorization Challenge, which is by 5.7 points better than the winner of this challenge [1]. We further point out that those algorithms, although generally known, have not been systematically evaluated or used on diverse datasets for fine-grained recognition, with some exceptions [4, 5].
2. Previous work Figure 2. Images from four different flower species that are similar in appearance: common daisy, ox-eye daisy, chamomile and shasta daisy. Expert knowledge is needed to make such fine distinctions.
vs-all strategy for multi-class classification; we use the loss of Crammer and Singer [7]. These three methods have been shown to be the most promising for fine-grained categorization [2, 4, 13, 28, 31]. Furthermore, our choice is guided by these methods’ generalization abilities, namely, they are general enough and applicable irrespective of the specific super-category. Our goal is to understand which set of classification algorithms and preprocessing techniques are most advantageous to fine-grained classification. We evaluate the above-mentioned methods on the largest and most commonly used fine-grained classification datasets in the field. We cover a variety of tasks in finegrained categorization: recognition of species of flowers, cats, dogs, birds (Figure 3). More specifically, in our study we have included the Caltech-UCSD 200 Birds dataset [27], the Oxford 102 Flowers dataset [19], the Stanford 120 Dogs dataset [16], and the Oxford Cats and Dogs dataset [21]. Our goal is to understand how popular methods for classification compare on various datasets and obtain further insights regarding generalization, scalability, etc. This work can serve as a reference for practitioners who are interested in which methods to apply given their specific performance and/or computational requirements. We are mostly interested in which components contribute most to improved performance and how much benefit to expect from each. Similar work on evaluating the performance of encoding methods [6] has been an inspiration to us. Our work is also related to [22] who focused on evaluating various learning techniques for general-object recognition. This is the first comprehensive study that considers the most contemporary and promising methods for fine-grained classification and evaluates them on the largest datasets in
Fine-grained classification tasks can be addressed as general multi-class classification problems and more generic approaches are used [10, 13, 14, 22]. These directions, although not specific to fine-grained categorization, are an important area of research with potentially large impact on this domain. Recently, methods specialized to fine-grained classification domain have emerged, e.g. Farrell et al [11] proposed to learn volumetric primitives, called ‘birdlets’ for fine-grained recognition of birds. In recent works there have been two major directions for fine-grained categorization: segmentation [2, 4, 19] and pose-normalized pooling [31]. Segmentation has played a major role in many computer vision applications, and has naturally been shown to be successful in fine-grained recognition as well [2, 4, 19, 21]. Chai et al [4, 5] proposed a set of co-segmentation methods which have been demonstrated on a variety of datasets, including birds, dogs and flowers. A recent method which utilizes fast Laplacian propagation has been proposed in [2] which does not depend on the type of super-category at hand. Pose-normalized pooling [31] is another prominent direction which, instead of trying to segment the object, tries to find alignment between common parts across different sub-categories. The work of Yang et al [28] uses a similar idea but proposes a simpler approach, which, most notably, does not require annotation of object parts or regions. Other methods for fine-grained classification have also been shown to be successful [13, 25, 29], but are beyond the scope of our study. Additionally, some methods explore attributes for fine-grained classification [9] or utilize humanin-the-loop [3, 25] for improved categorization. Classification of web-page content, online ads or products, etc. [24], can also be formulated as fine-grained classification tasks and have been a major topic of research outside the computer vision domain. Finally, our paper is related to the works of Chatfield et al [6], Perronnin et al [22] and others [13] who perform
comparative evaluation of multiple methods for recognition.
3. Methods This section describes the main methods we have examined in our study. We note that we have provided our own implementation of the examined methods, so as to obtain comparable performance on the same base-level feature representation. For reference, we have included the reported performances of the original publications and all prior methods on each dataset known to us.
3.1. Baseline algorithm Feature representation. We apply a feature extraction and classification pipeline which is very similar to the one of Lin et al. [17]. In our feature extraction pipeline we first extract HOG [8] features at four different levels, then those features are encoded in a global feature dictionary using the Locally Linear Coding (LLC) encoding method [26] as our base feature representation. All other methods below will be adapted to use LLC in their applications as well. After that, a global max pooling of the encoded features in the image is done, as well as, max poolings in a 3 by 3 grid of the image. Other feature encodings can also be successfully used, see the excellent survey by Chatfield et al [6]. Note that LLC is among the best ones in terms of performance [6]. Learning algorithm. For our baseline, we use the onevs-all strategy for classification with linear SVM decision functions. The one-vs-all strategy is the most common one for multi-class classification. We used the Liblinear SVM implementation [10].
3.2. Multi-class classification strategy We start by considering the learning algorithm for multiclass classification and explore how it can be improved. As mentioned, the one-vs-all strategy for arbitrating the decisions among multiple linear classifiers has been widely used and is the de facto standard in the field. However, it has been pointed out that this strategy is suboptimal [13, 18]. The problem is exacerbated for large number of classes. 3.2.1
Classification with Crammer-Singer loss
We here investigate one alternative to the one-vs-all strategy for classification. We use the Crammer-Singer (CS) loss [7], which is especially designed for multi-class classification. Given an example x with a label y and a set of K linear classifiers hj (x), the Crammer-Singer loss is as follows: L(y, h1 (x), . . . , hK (x)) = max{0, 1 − hy + max hz (x)}. z6=y
In other words, if the scores of all classes are at most 1 less than hy (x) then the loss is 0. Otherwise the loss is the maximum amount by which this constraint is not satisfied.
Optimizing the Crammer-Singer loss, although intended for addressing multi-class classification problems typical for fine-grained classification, has achieved mixed results on related large-scale evaluations [13], which makes it a good candidate for systematic evaluation here. We used the Liblinear [10] implementation of the algorithm of Keerthi et al [14], but also coded a stochastic version for comparison.
3.3. Segmentation Segmentation has been used in many applications of computer vision with vast practical importance. Naturally, segmentation can be useful for fine-grained classification and has been applied to flowers [19], birds, dogs [4] and others. Also of note is that segmentation may potentially help in some datasets, classes, or examples, but not others, e.g. for some flowers, the leaves or stems may provide useful disambiguation between similar classes. This warranted our evaluation on diverse datasets. We selected here the segmentation algorithm of [2] because it is simple, applies to variety of types of categories, and is relatively fast. The segmentation algorithm has two main components: First, super-pixelization of the image is done [12] and then a model is learnt of which superpixel likely belongs to the super-class, e.g. any flower, or any bird [2]. Secondly, using the high confidence regions as initialization, Laplacian propagation [32] is performed to obtain the final foreground/background region. Assuming the pixel labels are X and the initial regions have labels Y, the goal of the segmentation task is to find the label Xj for each pixel Ij , where Xj = 1 when the pixel belongs to the object (foreground) and Xj = 0, otherwise. The following cost function is minimized with respect to all pixel labels X: λ 1 min X T (I − S)X + |X − Y |2 X 2 2 PN where S = D−1/2 W D−1/2 , Dii = j=1 Wij , W is the affinity matrix computed in 8-connected pixel neighbour2 2 hoods Wij = e−(|fi −fj | )/2σ , and fi is the color feature representation of each pixel. The above-mentioned optimization is solved by the following system of linear equations: ((1 + λ)I − S)X = λY. After the image is segmented, a global pooling over LCC-encoded features from the segmented image is computed. The new feature representation is concatenated to the existing baseline representation, as in [2]. The parameter λ is set to 0.0001 as recommended by Zhou et al. [32]. We selected σ so that 1/σ 2 = 300, but other values within a range, that allows the propagation to occur, are viable. Both parameters are fixed across all experiments.
Figure 3. Images from the fine-grained categorization datasets considered: Caltech-UCSD 200 birds dataset (top left), Oxford 37 cats and dogs dataset (top right), Oxford 102 flowers dataset (bottom left) and Stanford 120 dogs dataset (bottom right). One example image per class is shown (flowers and cats and dogs datasets show fewer classes).
3.4. Unsupervised region pooling Pose-normalized pooling has been introduced recently [31] and has shown significant promise in advancing fine-grained classification. Since the original work [31] required manual annotation of object parts, which may not be readily available for all datasets, or could be costly especially for the larger datasets, we considered the algorithm of Yang et al. [28] which selects the features in an unsupervised manner. We adapted this algorithm to work with the LLC encoding feature representation. The algorithm proceeds by learning a set of regions (templates) T1 , ...TK that are common across classes. Introducing auxiliary variables viI , i = 1 . . . K as indicator variables whether region i is found in image I and liI , i = 1 . . . K as locations where the regions are found, we maximize: K K X K X X X max { vi (1−kTi −R(I, liI )k)− viI vjI d(liI , ljI )} T,v,l
I
i=1
i=1 j=1
where R(I, liI ) denotes the region of image I at location liI , d(liI , ljI ) is a penalty to limit overlap between regions: d(liI , ljI ) = ∞, if the area of overlap between candidate patches at locations liI and ljI is larger than a threshold, and 0, otherwise. One notable difference in our implementation is that we relaxed the constraints on features’ locations, allowing features to be found anywhere in the image. We also did not use the co-occurrence criterion of [28]. In addition, to accommodate our baseline feature space which is high dimensional, we limited the number of pooled regions to 10 per dataset, collected at a single scale, instead of 34, collected at two scales, as suggested by [28]. We also applied this algorithm to both datasets that have and that do not have bounding box information.
3.5. Implementation details This section describes specific implementation details. In our experiments, all images are resized to 500 pixels at
the larger side. When learning the regions for region pooling, we followed the recommendations of [28] and selected image patches which were approximately 1/3 of the image size. Since we work with fixed 32x32 size patches, all images are resized to be of area approximately 100x100. To train the segmentation model we used the ground truth segmentation provided with the datasets [19, 21, 27]. Since the dogs dataset has no segmentation information we used as pseudo ground truth the inner rectangle which takes 0.5 of the image area. To combine the methods we concatenate the feature representations of the specified methods, including the baseline. Each of the feature representations are scaled by 0.1 before concatenation, since the base feature representation is 10 times larger and each individual feature representation is individually L2 normalized. In order to tune the regularization parameters for both the one-vs-all SVM and for the Crammer-Singer loss optimization, we did cross-validation on one dataset (cats and dogs) and the selected regularization was applied to all other datatsets. This is done to keep the parameters uniform across datasets. For the stochastic version of CrammerSinger loss optimization, we set the number of iterations to 50 for all settings. Training time was not an issue for these datasets for either of the optimization algorithms, but may differ significantly for much larger datasets. For large-scale datasets, only the stochastic versions can be used.
4. Experiments In our experiments we used the standard setups established in prior works, e.g. [19] for flowers, [27] birds, [16] dogs and [21] cats and dogs. We report results of the considered methods of segmentation, region pooling and optimization with Crammer-Singer loss. We also include the state-of-the-art results in the literature for these datasets.
4.1. Datasets We consider the following datasets, commonly used for evaluating fine-grained classification algorithms (Figure 3): Oxford 102 Flowers. This dataset contains 102 different flower species and 8,981 images [19]. It has been collected by Nilsback and Zisserman and is one of the first large-scale fine-grained datasets containing more than 100 classes. Caltech-UCSD 200 Birds. This dataset contains 200 different bird species and has been introduced by Welinder et al. [27]. It consists of 6,033 images. Although manual annotation of objects parts and attributes is available, they are not used in our experiments. Stanford 120 Dogs. This dataset contains 120 different species of dogs and consists of 20,580 images. It has been introduced by Khosla et al. [16]. Oxford 37 Cats and Dogs. This dataset contains 37 species of cats and dogs, of which 25 of dogs and 12 species
of cats. It contains total of 7,349 images and has been introduced by Parkhi et al. [21]. The selected datasets for fine-grained classification are the largest that are publicly available. These datasets are quite challenging, contain large variety in classes, and are collected in natural environments. There are large interclass similarities, i.e. some classes are very similar to each other (Figure 2), which is a typical problem in fine-grained classification and makes recognition very hard. Figure 3 visualizes example images from each of these datasets.
4.2. Results The results of our evaluation are presented in Table 1. The first two columns correspond to the birds and dogs datasets which have bounding box information, whereas the latter two columns (for the flowers and cats and dogs datasets) do not have such information available. This choice is guided by the fact that previously published works report their results in these settings. Table 2 shows comparison to state-of-the art methods, where we can see that our method significantly outperforms the best known state of the art methods so far in all cases, with the exception of recently published work of Chai et al [5] on flowers. Table 3 shows the relative improvements of each algorithm over our baseline and the relative computational times. As seen from our results, segmentation consistently improves the performance. The improvement is significant, around 3-3.5%. The improvement when the bounding box around the object is known is smaller, which is expected since the objects in the bounding box are already well aligned and take larger portion of the image. Still, segmentation seems to be very helpful for birds, around 3% improvement. The improvement for the dogs dataset is smaller, less than 1%, which is due to the fact that no segmentation ground truth is available for the dogs dataset and we have used an inner rectangle as pseudo ground truth. These results are consistent with the conclusions of prior works which apply segmentation [2, 4, 19, 21]. The results for region pooling are also quite interesting, where here we observe that the datasets with bounding box information can benefit much more from region pooling: improvements of 3.4% and 6.5% for birds and dogs. Also we observe an improvement of 1.6%-1.8% for cats and dogs dataset which does not have bounding box information. Our experiments show that region pooling is detrimental for the flowers dataset. This is not surprising since, unlike birds or dogs, where one can find correspondence between visually similar parts, e.g. head, ears, tails, etc., the correspondences in flowers exhibit much more variability in appearance. Figure 4 shows example regions that have been learned for these datasets. As seen, the flower regions learned, although roughly capturing ‘corresponding’ regions in flowers, exhibit too much variety, which may be the reason of deterio-
Table 1. Summary of benchmarking results: Classification accuracy (in%) for all methods and on all fine-grained datasets.
Method Baseline With Crammer-Singer (CS) With segmentation With segmentation and CS With region pooling With region pooling and CS With segm. and region pooling With segm., reg. pooling and CS
200 Birds 29.46 29.08 32.51 32.52 32.85 32.32 34.93 34.45
120 Dogs 40.58 41.16 41.34 41.91 47.03 47.58 47.55 48.26
102 Flowers 76.78 76.52 80.43 79.92 76.03 75.66 79.92 79.56
37 Cats and Dogs 50.89 51.11 54.35 54.67 52.46 52.67 55.39 55.34
Table 2. Comparison to the state-of-the-art results in the literature on all fine-grained datasets (where available).
Method Welinder et al [27] Yao et al [30] Khan et al [15] Khosla et al [16] Nilsback and Zisserman [19] Nilsback [20] Parkhi et al [21] Chai et al [4] Yang et al [28] Angelova et al [2] Chai et al [5] Ours (from Table 1) Ours, improvement over the best state-of-the-art
200 Birds 19.0 19.2 22.4 23.3 28.2 30.17 26.7 34.93
120 Dogs 22.0 38.0 26.0 48.26
102 Flowers 72.8 76.3 80.0 80.66 85.2 80.43
37 Cats and Dogs 54.05 54.3 55.39
+4.76
+10.26
-4.77
+1.09
rated performance. Applying Crammer-Singer optimization has less pronounced improvements, and for some experiments is underperforming its one-vs-all counterpart (Table 1). It generally gives comparable results and tends to be better for larger datasets, e.g. the dogs one, and other larger datasets we have worked on (not included here). We also compared to a stochastic version which works on par with the batch implementation but is worse on smaller datasets. The main conclusion that we draw from the results is that the three methods are complementary to one another and when used together, can improve the performance significantly. Combining segmentation and region pooling provide notable improvements over the baseline: 5.5% for birds 7% for dogs, 3.1% for flowers and 4.5% for cats and dogs. Including the Crammer-Singer loss optimization gives improvements of 5% for birds 7.7% for dogs, 2.8% for flowers and 4.5% for cats and dogs. It outperforms its onevs-all counterparts on the larger dogs dataset. Otherwise, it is comparable or slightly worse than one-vs-all, but since it does not come at additional computational cost during testing, it can be used when beneficial without extra penalty. When comparing the performance to other state-of-the-
art methods (Table 2), we can see significant improvements: 4.8% for birds, 10.3% for dogs and 1.1% for cats and dogs. The improvements over our own baseline are also of the order of 3.1-7.7% across all datasets. We note that the next best result for cats and dogs is specifically targeted to cats and dogs body layouts and has not been applied to other fine-grained classification problems. The overall performance on flowers, although an improvement of about 3.1% over our baseline is less than recent best results achieved by co-segmentation [5]. We also estimated the test times per each method (Table 3). The table also includes improvements in performance over the baseline, so that computational cost vs. improvements in accuracy tradeoffs can be made. Clearly, for datasets for which region pooling improves the performance by a large margin, this is the most worthwhile method. Applying Crammer-Singer optimization comes virtually at no additional cost during testing, so any improvement in performance that it can offer is worthwhile. We also note that even the combination of all methods is relatively fast, compared to computational times of prior methods reported in the literature, e.g. 4-5s for region pooling [28].
Table 3. Approximate computational time per image (in seconds) on all fine-grained datasets. The improvement in accuracy over the baseline (in%) per each dataset is also included for reference.
Method Baseline With segmentation / CS With region pooling / CS With segm. and region pooling / CS
Time (sec.) 1.0 3.5 1.8 4.3
200 Birds acc. impr. +3.05/+3.06 +3.39/+2.86 +5.47/+4.99
120 Dogs acc. impr. +0.76/+1.33 +6.45/+7.00 +6.97/+7.68
102 Flowers acc. impr. +3.65/+3.14 -0.75/-1.12 +3.14/+2.78
37 cats/dogs acc. impr. +3.46/+3.78 +1.57/+1.78 +4.50/+4.45
Figure 4. Example learned region-pooling features: birds (top left), cat and dogs (top right), flowers (bottom row). Each panel shows example patches that are representatives of this region. As seen these regions are capable of learning consistent appearances across different classes, e.g. a head plus beak for birds or head for cats and dogs. The flower features also represent consistent visual appearances.
4.3. A new benchmark
serve as a new benchmark for other works in the field.
As a result of our analysis, we can identify a new pipeline which essentially combines the three methods. As seen, it performs extremely well and outperforms the best known methods in the literature (with the exception of [5]). Most notably we improve by 4.8% and 10.3% the top performing methods for birds and dogs datasets. We also evaluated this pipeline on the 2012 Imagenet Fine-Grained Recognition Challenge [1] and obtained a mAP of 37.92 which improves the winning entry by 5.7 points. The proposed pipeline can
5. Conclusions and future work This paper investigates three prominent directions in fine-grained categorization, segmentation, region pooling and optimizing a Crammer-Singer loss, and evaluates them in a unified framework. We test those methods and their combinations on the largest and most commonly used finegrained recognition datasets. We also offer analysis of performance vs computational cost.
Furthermore, we propose a pipeline that demonstrates improvements across the fine-grained recognition datasets, in some cases outperforming the best known methods in the field by as much as 10% . This method also achieves mean average precision which is 5.7 points higher than the winner of 2012 Imagenet Fine-Grained Categorization Challenge [1]. The proposed pipeline is fast and does not depend on the fine-grained recognition task at hand. Considering alternative segmentation methods or more sophisticated dictionaries for region pooling, e.g. at multiple scales, are all possible extensions to this work. Further investigation is needed to compare the performances of alternative feature representations or encodings, e.g. [23]. This study considered fine-grained classification on natural images: birds, dogs, cats, flowers. Other possible finegrained classification tasks may focus on man-made objects, e.g. in image classification or retrieval for online retail. More experimentation which include datasets in this domain is needed. Acknowledgements. We are grateful to our team at NEC Labs America, where this work was developed.
References [1] http://www.image-net.org/challenges/lsvrc/2012/. [2] A. Angelova and S. Zhu. Efficient object detection and segmentation for fine-grained recognition. CVPR, 2013. [3] S. Branson, C. Wah, B. Babenko, F. Schroff, P. Welinder, P. Perona, and S. Belongie. Visual recognition with humans in the loop. ECCV, 2010. [4] Y. Chai, V. Lempitsky, and A. Zisserman. Bicos: A bi-level co-segmentation method for image classification. ICCV, 2011. [5] Y. Chai, E. Rathu, V. Lempitsky, L. V. Gool, and A. Zisserman. Tricos: A tri-level class discriminative co-segmentation method for image classification. ECCV, 2012. [6] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. BMVC, 2011. [7] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JLMR, 2001. [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, 2005. [9] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discovering localized atributes for fine-grained recognition. CVPR, 2012. [10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 2008. [11] R. Farrell, O. Oza, N. Zhang, V. Morariu, T. Darrell, and L. Davis. Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. ICCV, 2011. [12] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation. IJCV, 2004.
[13] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, and J. Malick. Large-scale classification with trace-norm regularization. CVPR, 2012. [14] S. Keerthi, S. Sundararajan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A sequential dual method for large scale multiclass linear svms. ACM KDD, 2008. [15] F. Khan, J. van de Weijer, A. Bagdanov, and M. Vanrell. Portmanteau vocabularies for multi-cue image representation. NIPS, 2011. [16] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. Workshop on Fine-Grained Visual Categorization, CVPR, 2011. [17] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image classification: fast feature extraction and svm training. CVPR, 2011. [18] P. Long and R. Servedio. Consistency versus realizable hconsistency for multiclass classification. ICML, 2013. [19] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. ICVGIP, 2008. [20] M.-E. Nilsback and A. Zisserman. An automatic visual flora - segmentation and classification of flower images. DPhil Thesis, University of Oxford, UK, 2009. [21] O. Parkhi, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Cats and dogs. CVPR, 2012. [22] F. Perronnin, Z. Akata, Z. Harchaoui, and C. Schmid. Towards good practice in large-scale learning for image classification. CVPR, 2012. [23] J. Sanchez and F. Perronnin. High-dimensional signature compression for large-scale image classification. CVPR, 2011. [24] D. Shen, J.-D. Ruvini, and B. Sarwar. Large-scale item categorization for e-commerse. Proc. of the 21st ACM Int. Conf. on Information and Knowledge Management, 2012. [25] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recognition and part localization with humans in the loop. ICCV, 2011. [26] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. CVPR, 2010. [27] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-ucsd birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. [28] S. Yang, L. Bo, J. Wang, and L. Shapiro. Unsupervised template learnign for fine-grained object recognition. NIPS, 2012. [29] B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. CVPR, 2012. [30] B. Yao, A. Khosla, and L. Fei-Fei. Combining randomization and discrimination for fine grained image categorization. CVPR, 2011. [31] N. Zhang, R. Farrell, and T. Darrell. Pose pooling kernels for sub-category recognition. CVPR, 2012. [32] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. NIPS, 2004.