Efficient Learning of Sample-Specific ... - Semantic Scholar

Report 3 Downloads 165 Views
IEEE SIGNAL PROCESSING LETTERS, VOL. 18, NO. 11, NOVEMBER 2011

683

Efficient Learning of Sample-Specific Discriminative Features for Scene Classification Yina Han and Guizhong Liu, Member, IEEE

Abstract—Learning the sample-specific discriminative features based on numerous local learning may not scale well to real world scene classification tasks and suffer from the risk of overfitting. Hence we cast it in SVM based localized multiple kernel learning framework, and design a new strategy to alternately optimize the standard SVM solver and the sample-specific kernel weights, by either a linear programming (for -norm) or with closed-form solutions (for -norm). Experiments on both natural scene dataset and cluttered indoor scene dataset demonstrate the effectiveness and efficiency of our approach. Index Terms—Localized multiple kernel learning, scene classification, support vector machine.

I. INTRODUCTION

T

HE development of computer vision and biological vision has enjoyed great success in designing various good visual features, which can describe different aspects of the visual contents, either the holistic spatial layouts (e.g., gist, hog 4 4, and Geometric Probability Map) or the local visual properties (e.g., Dense SIFT, Sparse SIFT, line histograms, and texton) and local geometric layouts (e.g., Self-similarity). However, for complex visual scene images, a major challenge is that while some scenes can be well characterized by the holistic spatial layouts (e.g., natural outdoor scenes), others are better characterized by some local properties (e.g., cluttered indoor scenes). Hence, learning the sample-specific discriminative features (LSDF) by localized multiple kernel learning (LMKL) can reasonably improve the classification performance. For instance, in [1], Christoudias et al. learnt the covariance of a Gaussian process using a product of kernels as the localized combination. In [2], Lin et al. used kernel alignment to construct local ensemble kernel for SVM classifier. However, in their methods, learning of multiple local models, each of which accounts for a sample or its neighborhood, is performed independently. Hence even for a modest size of training set, they may already not scale well. Furthermore, the learnt sample-specific local models may suffer from the risk of overfitting to the training data.

Manuscript received August 08, 2011; revised September 13, 2011; accepted September 19, 2011. Date of publication October 03, 2011; date of current version October 10, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Xiao-Ping Zhang. The authors are with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: [email protected]. edu.cn; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2011.2170165

SVM based multiple kernel learning (MKL) using the alternating optimization between standard SVM solver and the kernel weights [6]–[10] provides a natural way to address these two unfavorable issues. Since the time for updating the kernel weights is ignorable compared to SVM solver. The overall complexity is tied to canonical SVM problem, for which many efficient toolboxes exist. On the other hand, via sharing the common SVM classifier, all the sample-specific local models establish a parametric space, where they lie and spread as manifold like structure. This could lessen the overfitting problem caused by independently learning each classifier with insufficient training data. Nevertheless, due to the sample-specific combination, LMKL presents a difficult quadratic non-convex problem. In [11], Gönen and Alpaydin predefined the local weights in the form of soft-max gating function, and updated the parameters of the gating function by gradient descent approach. Yang et al. [12] used similar idea, but learning group-wise local weights shared by a set of samples of the same group. But the introduction of gating function does not change the non-convexity in the local weights. Thus the gradient descent based approach is prone to being trapped in local minima [11], [12]. Moreover, the parameters of the gating function rather than the local weights are updated at each time. This letter introduces the SVM based LMKL into the task of LSDF. Moreover, we also propose a new equivalent optimization, from which the associated sample-specific kernel weights can be directly obtained by either a linear programming (for -norm constrain) or with closed-form solutions (for -norm constrain). Experimental results on both natural scene dataset and cluttered indoor scene dataset demonstrate the efficacy of our approach. II. FORMULATION OF SVM BASED LMKL Given a collection of scene samples , where denotes the label of the scene in the training set, denotes the corresponding visual features, and is the number of training data. The features and similarity measures are then “kernelised” to yield base kernel matrices . For simplicity, we focus on binary problems in this letter, and use a one vs. all approach for multivariate classification problems. The binary decision function corresponding to the localized combination of base kernels is of the form (1) where each function Hilbert Space (RKHS)

1070-9908/$26.00 © 2011 IEEE

belongs to the Reproducing Kernel associated with kernel . The idea

684

IEEE SIGNAL PROCESSING LETTERS, VOL. 18, NO. 11, NOVEMBER 2011

of SVM based LMKL is to learn the optimal , as well as to estimate the sample-specific weights that maximize the margin while minimize the loss on the training data. This can be achieved by solving the following primal problem:

(2) where is a loss function. At test time, the learnt are deployed among test set by nearest neighbor rule. Hence for each training sample , the associated sample-specific weights are expected to give good performances for testing samples falling around . To this end, based on the similarity measure , we specify the neighborhood of by its nearest neighbors (including itself), denoted as . The original primal problem (2) is modified by substituting the empirical loss incured by the whole neighborhood of using the same local weights for the empirical loss incured by each sample , namely

A. Computing

With Fixed

In the inner loop, is fixed, problem (6) can be identified as the standard one-class SVM dual formulation using the combined kernel . As stated in [7], the SVM solution is unique if and only if is strictly positive semi-definite. Given that is often restricted to be non-negative, such condition can be guaranteed by the fact that is strictly positive semi-definite. However, can not be always guaranteed to be positive semi-definite. Following [5], this issue is solved by first computing the eigenvalues of . Then if the smallest one is negative, we add its absolute value to the diagonal of . Then the optimal can be conveniently obtained by any off-the-shelf SVM solver. B. Computing

With Fixed

In the outer loop, with the optimal obtained from above procedure, problem (5) is a difficult quadratic non-convex problem in . Hence, instead of solving (5) directly, we rewrite , and transform problem (3) as follows:

and we have the following new primal problem

(7) (3) When the hinge loss is employed, the dual problem of (3) is

where , , and are the primal and the dual variables in (3) and (4) respectively, and . Since (7) is convex in , by taking the minimization of , we have (8)

(4) where

Thus

.

can be calculated as (9)

III. SOLVING THE SVM BASED LMKL In order to inherit the efficiency of alternating approaches, we follow the standard procedure to formulate (4) as a nested two step optimization: (5) where

Substituting the optimal

and

in (6), we have

(10) Because of the mutual independence among samples, can be conducted respectively for each , , namely

(6) and use the alternating optimization between the maximization and the minimization of variables . of variables

(11) Given problem (11), it is interesting to discuss the domain of .

HAN AND LIU: EFFICIENT LEARNING OF SAMPLE-SPECIFIC DISCRIMINATIVE FEATURES

1) -Norm of Kernel Weights: When simplex, i.e., expressed as a standard linear program

685

lies in a , (11) can be

s.t.

(12)

. Problem (12) can be conveniently solved where by off-the-shelf solver. 2) -Norm of Kernel Weights: When , namely -norm of kernel weights, (11) is unlikely to be solved by standard optimization toolboxes. Hence, we propose to apply Lagrangian theorem to incorporate the -norm constraint into objective (12)

(13) When setting to zero the gradient of the Lagrangian with respect to and , we have (14)

The above local learning is summarized in Algorithm 1. Note that the alternating optimization cannot guarantee the global optimum of obtained and . According to alternating projection theory, this is not determined by the specific optimization strategy within the alternating framework, but is determined by the property of the problem itself. Nevertheless, compared with [11], [12], our method can guarantee the global optimum of obtained at each iteration. Algorithm 1 SVM based LMKL 1: Initialize

for

and

Solve the dual problem of one-class SVM with for the optimal

4: Calculate sample-wise and and (12) (or (14)) respectively for each

according to (8) ,

5: until Convergence At test time, given a new sample , the proposed algorithm will first find its nearest training sample , then use the local classifier (15) to predict the label of .

IV. EXPERIMENTS We evaluate the proposed approach on both natural scene images (15-Scene [3]) and cluttered indoor scene images (MIT Indoor Scene [4]) following their own benchmarking protocols for each dataset. Eight state-of-the-art features, i.e. GIST, HOG 4 4, Dense SIFT, Sparse SIFT, Line Histograms, SSIM, Texton, and Geometric Map, which are potentially useful for scene classification, are used together with their associated similarity measure, i.e., RBF distance for GIST and distance for all the others, to construct the base kernels. A. Sensitivity Study of the Number of Nearest Neighbors

2: repeat 3:

Fig. 1. Sensitivity of the number of nearest neighbors. Average performance for (a) 15-Scene dataset and (b) MIT Indoor dataset over five-fold cross-validation with various values. Error bars indicate standard deviation.

There is no prior knowledge about the optimal number of nearest neighbors in -NN. To examine the impact of , for the training set of each dataset, we use five-fold cross-validation over a set of values . Fig. 1 plots the mean classification accuracies and standard deviation over the five training folds with various values. On the whole, the performances of our model are fairly insensitive to the setting of , suggesting that need only be roughly estimated. In the following experiments, we employ the optimized derived from validation , that is 4 for 15-scene and 2 for MIT indoor. B. Comparison With global MKL Over Various

Values

Table I compares our method with global MKL across a large operating range of . Here we use the specialised -MKL solver of [8], namely MKLGL. In terms of classification accuracy, our method can achieve significant improvements to

686

IEEE SIGNAL PROCESSING LETTERS, VOL. 18, NO. 11, NOVEMBER 2011

TABLE I THE PERFORMANCE COMPARISON OF MKLGL AND OUR LOCAL LEARNING METHOD ON THE TWO SCENE DATASETS

Fig. 3. 15-Scene dataset: Comparison with three state-of-the-art local learning algorithms with different number of training samples per category. (a) Classification accuracy; (b) training time. Fig. 2. The weights distribution over the eight features for four random selected training samples from (a)–(d) 15-scene dataset and (e)–(h) MIT Indoor dataset with their respective optimal values.

global MKL. As shown in Fig. 2, the weights distribution are varied from sample to sample. Hence, learning the sample-specific discriminative features can more precisely characterize the diverse appearances of scene contents. In terms of efficiency, our method can achieve the same order of computational time to the state-of-the-art efficient MKLGL. This is because the sample-wise computation only works on the sparse set consisting of a classifier’s support vectors, and the sample-wise alternating converges faster than MKLGL. C. Comparison With State-of-the-Art Local Learning Methods Then we fix the value of as 4, and compare our method with Lin07-CVPR [2], Yang09-ICCV [12], and Lin09-ICCV [5] in Fig. 3. In terms of classification accuracies, our method achieves around 4% to 6% improvements. In terms of efficiency, by using a common SVM classifier to correlate the local classifiers, our method shows around 2 order training time speed-up to [2] and [12], and 1 order training time speed-up to [5]. V. CONCLUSION Starting with state-of-the-art features which show different discriminative power for scene classification, our solution is to combine them optimally for each scene by kernel learning. Comprehensive evaluation on both natural scene dataset and cluttered indoor scene dataset show superior classification accuracies to global MKL and three state-of-the-art local learning

algorithms while with 1 2 order learning time speed up, supporting the efficacy of our method in addressing both the a) time-consuming; and b) risk of overfitting problems. REFERENCES [1] M. Christoudias, R. Urtasun, and T. Darrell, Bayesian Localized Multiple Kernel Learning Univ. California-Berkeley, 2009, Tech. Rep.. [2] Y.-Y. Lin, T.-L. Liu, and C.-S. Fuh, “Local ensemble kernel learning for object category recognition,” in CVPR, 2007. [3] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006, pp. 2169–2178. [4] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in CVPR, 2009, pp. 413–420. [5] Y.-Y. Lin, J.-F. Tsai, and T.-L. Liu, “Efficient discriminative local learning for object recognition,” in ICCV, 2009, pp. 598–605. [6] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, 2006. [7] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “Simplemkl,” J. Mach. Learn. Res., vol. 9, pp. 2491–2521, November 2008. [8] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, “Simple and efficient multiple kernel learning by group lasso,” in ICML, 2010, pp. 1175–1182. [9] C.-V. Nguyen and D. B. H. Tay, “Regression using multikernel and semiparametric support vector algorithms,” IEEE Signal Process. Lett., vol. 15, pp. 481–484, 2008. [10] J. Wu and X.-L. Zhang, “Efficient multiple kernel support vector machine based voice activity detection,” IEEE Signal Process. Lett., vol. 18, pp. 466–469, 2011. [11] M. Gönen and E. Alpaydin, “Localized multiple kernel learning,” in ICML, 2008, pp. 352–359. [12] J. Yang, Y. Li, Y. Tian, L. Duan, and W. Gao, “Group-sensitive multiple kernel learning for object categorization,” in ICCV, 2009, pp. 436–443.