Probability-Confidence-Kernel-Based Localized ... - Semantic Scholar

Report 5 Downloads 101 Views
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

1

Probability-Confidence-Kernel-Based Localized Multiple Kernel Learning With lp Norm Yina Han and Guizhong Liu, Member, IEEE

Abstract—Localized multiple kernel learning (LMKL) is an attractive strategy for combining multiple heterogeneous features in terms of their discriminative power for each individual sample. However, models excessively fitting to a specific sample would obstacle the extension to unseen data, while a more general form is often insufficient for diverse locality characterization. Hence, both learning sample-specific local models for each training datum and extending the learned models to unseen test data should be equally addressed in designing LMKL algorithm. In this paper, for an integrative solution, we propose a probability confidence kernel (PCK), which measures per-sample similarity with respect to probabilistic-prediction-based class attribute: The class attribute similarity complements the spatial-similarity-based base kernels for more reasonable locality characterization, and the predefined form of involved class probability density function facilitates the extension to the whole input space and ensures its statistical meaning. Incorporating PCK into support-vectormachine-based LMKL framework, we propose a new PCKLMKL with arbitrary lp -norm constraint implied in the definition of PCKs, where both the parameters in PCK and the final classifier can be efficiently optimized in a joint manner. Evaluations of PCK-LMKL on both benchmark machine learning data sets (ten University of California Irvine (UCI) data sets) and challenging computer vision data sets (15-scene data set and Caltech-101 data set) have shown to achieve state-of-the-art performances. Index Terms—Local learning, multiple kernel learning, support vector machines.

I. I NTRODUCTION

L

OCALIZED multiple kernel learning (LMKL) is an attractive strategy for combining multiple heterogeneous features in terms of their discriminative power for each individual sample. Unlike multiple kernel learning (MKL) [1]–[9], which learns a global combination for the whole input space, LMKL draws on the idea that a sample-specific local combination should most likely better characterize the distinctive properties of each sample. Consequently, LMKL would achieve better performances. Some effective attempts have been made with promising results under different prediction frameworks [10]–[15]. In [10],

Manuscript received March 18, 2011; revised July 15, 2011 and October 5, 2011; accepted November 23, 2011. This work is supported by the National Natural Scientific Foundation of China Project 61173110 and Key Projects in the National Science & Technology Pillar Program 2011BAK08B02. This paper was recommended by Associate Editor Dr. J. Basak. The authors are with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2011.2179291

Fig. 1. Kernel matrices computed on the toy data set in Section IV-A. (a) Base kernel: Gaussian kernel with Euclidean distance. (b) PCK. (c) True label kernel: The label product of pairwised samples.

Frome et al. proposed a sample-dependent distance function learning for the K-nearest neighbor classifier. Lin et al. [11] constructed a local ensemble of support vector machine (SVM) classifiers defined on a per-example basis. In the framework of Gaussian process [12], a rank-constrained covariance matrix that represents per-sample similarity was designed. In [13], the sample-specific local model was a classifier derived by boosting. At test time, the learned models are deployed among test set by the nearest neighbor rule. These local models are specifically designed for each training sample, thus well characterizing the local diversity. However, due to the nonreversibility of the neighborhood [16], [17], the nearest-neighbor-based extension could lead to degenerated models. On the contrary, Gönen and Alpaydin [14] used a sample-dependent gating function, which facilitates the extension to unseen data by using the predefined gating function. However, the proposed fixed form seems to be insufficient for local diversity characterization, for only statistically similar performances to global MKL have been reported in [14]. These works clearly point to the importance of addressing both learning sample-specific local models for each training datum and effective extending the learned models to unseen test data. As local models are used for per-sample classification, they should be not only relevant to associated spatial position in the feature space but also more controlled by their latent class attributes, both of which are often inconsistent [as shown in Fig. 1(a) and (c)]. In this paper, by exploring the relationship between the class probability density distribution and the class prediction confidence, we define a probability confidence (PC) kernel (PCK), which measures per-sample similarity with respect to the probabilistic-prediction-based class attribute. This provides an integrative solution: The class attribute similarity [as shown in Fig. 1(b)] can complement the spatial similarity [as shown in Fig. 1(a)] provided by base kernels for more reasonable locality characterization, and the predefined form of

1083-4419/$26.00 © 2011 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

class probability density function can facilitate the extension to the whole input space and ensure the statistical meaning. Incorporating PCK into SVM-based [18] MKL, a new PCKLMKL with arbitrary lp -norm constraint implied in the definition of PCKs is presented, where both the parameters in PCK and the final classifier can be efficiently optimized by a gradient-descent wrapping canonical SVM solver. Empirically, we demonstrate the efficacy of the proposed PCK-LMKL on ten benchmark UCI data sets and two real-world computer vision data sets. We show that PCK-LMKL outperforms SimpleMKL [4] and several LMKL algorithms and achieves state-of-theart performances on real-world image classification, while with comparative computational cost to efficient SimpleMKL. Algorithm 1 PCK-LMKL Optimization Process m Initialize am , σ±1 with random small positive values for m = 1, . . . , M Estimate the class probability  density function m m m m (x )= (1/N σ±1 ) {i|yi =±1} exp(Km (xm , xm pσ±1 i )σ±1 ) for m = 1, . . . , M while Stopping criterion not meet do Calculate the Probability Confidence Kernel: m Pm (xm i , xj ) for m = 1, . . . , M Calculate the  localized combination of kernels: m m m m KP (xi , xj ) = M m=1 Pm (xi , xj )Km (xi , xj ) Solve the canonical SVM with kernel KP Initialize the step size μ whileJ(Pt+1 ) < J(Pt )do Update μ using linear search (am )t+1 ⇐ (am )t − μt (∂J(P)/∂am ) for m = 1, . . . , M m t+1 m t m ) ⇐ (σ±1 ) − μt (∂J(P)/∂σ±1 ) for (σ±1 m = 1, . . . , M end while end while

The rest of this paper is structured as follows. Section II presents a brief review of SVM-based MKL. A detailed description of the proposed PCK-LMKL follows in Section III. Section IV shows the empirical results, and Section V concludes.

Kernel-based methods [19], [20] such as SVM [21], [22] have got great success in diverse application areas. The power of SVMs strongly depends on the choice of a good kernel K(xi , xj ), which defines certain spatial similarity measure in the associated reproducing kernel Hilbert space [23]. Let D = {(xi , yi )}i=1,...,N denote the collection of N labeled training examples, where x lies in some input space X and y ∈ {−1, +1} for binary classification. The result of SVM learning is an α-weighted linear combination of kernel elements with bias b [2] N  i=1

αi yi K(x, xi ) + b.

K(x, xi ) =

(1)

M 

βm Km (xm , xm i ).

(2)

m=1

Within this framework, the problem is then transferred to the choice of weights {βm }m . For intuitive interpretability of the learned weights, namely, singling out which features are of importance for discrimination, canonical MKL imposes sparse l1 -norm constraints, namely, |β|1 = 1. Many efficient algorithms [1]–[4], [26]–[28] have been proposed for solving this l1 -norm MKL. However, when the involved kernels capture relevant information of the data from different aspects, sparseness may discard useful information and hence lead to degenerated models. In [8], [29], and [30], algorithms for solving arbitrary lp -norm MKL, namely, |β|p = 1, are developed with impressive improvement over its sparse counterpart. The superiority of arbitrary lp -norm constraint on the kernel weights has also been demonstrated in the framework of kernel fisher discriminant analysis [31]. From another point of view, the aforementioned MKL approaches assign the same weight to a kernel in the whole input space. Due to the character of data distribution, the set of kernels important to discriminate between samples may vary locally. Gönen and Alpaydin [14] proposed an LMKL (G-LMKL), in which kernel (2) was extended to the samplespecific form K(x, xi ) =

II. B RIEF R EVIEW OF SVM-BASED MKL

f (x) =

The choice of a good kernel can, in principle, be solved via cross-validation. However, for complex classification tasks, a subject often contains different attributes (e.g., sex or age in UCI heart data set) or is described by different aspects (e.g., global or local properties in visual data set). Different attributes or aspects demand different feature representation and similarity metrics. For example, attributes in UCI heart data set are often described by categorical, integer, or real numbers, which are measured by classical kernels, such as polynomial and Gaussian, but with different parameters. For the images in visual data set, global GIST features are often measured by Euclidean distance, while local scale-invariant feature transform (SIFT) features are often measured by χ2 distance. It is difficult to handle such multiple heterogeneous features [2], [24], [25]. Let xi = [x1i , . . . , xM i ] denote the measurements of M different features for sample xi , and the corresponding similarity metrics are represented by kernels {Km }m

M 

βxmm βxmm Km (xm , xm i ). i

(3)

m=1

This is a difficult quadratic nonconvex problem. Instead of solving {βxmm }m directly, Gönen and Alpaydin [14] used a set of normalized exponential gating functions to approximate {βxmm }m . The normalized exponential function could be insufficient to characterize the ambiguity of underlying data with only statistically similar performances to global MKL. Moreover, the normalization would promote a set of sparse local weights. Yang et al. [32] used a similar idea but defining the gating function for a set of presegmented sample groups instead of individual samples. Seeing that the sample-specific kernel weights {βxmm }m are hard to interpret and βxmm only occurs in pairs, we substitute

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. HAN AND LIU: PCK-BASED LOCALIZED MULTIPLE KERNEL LEARNING WITH lp NORM

m m m Pm (xm βxm , and the localized combination bei , xj ) for βxm i j comes M 

K(xi , xj ) =

    m m K m xm Pm xm i , xj i , xj

(4)

m=1

where Pm is a matrix which measures per-sample similarity in {Hm }m . III. PCK-LMKL A LGORITHM A. PCK To complement the spatial similarity defined by Km from a class attribute view, we propose a PCK to estimate per-sample class attribute similarity. Definition 1 (PC): Let +1 and −1 represent the positive and negative classes, respectively, in binary classification and pm (xm | + 1) and pm (xm | − 1) be the corresponding probability density distributions at xm in space Hm . The PC of xm is defined as pcm (xm ) = pm (xm |+1)−pm (xm |−1),

m = 1, . . . , M. (5)

Remark 1: The sign of pcm , m = 1, . . . , M represents the likelihood of being either class, namely, when pcm (xm ) > 0, sample xm is likely to be of class +1 and vice versa. The absolute value of {pcm }m represents the confidence on the likelihood. m m Definition 2 (PCK): Let pcm (xm i ) and pc (xj ) be the PCs m m of points xi and xj in Hm , respectively. The PCK in Hm is defined as  m   m  m m xj exp am pcm (xm i ) pc Pm xi , xj =     p1 M p ak pck (xm ) pck xm exp i j k=1 (6) where am > 0, m = 1, . . . , M . Proposition 1: The PCKs {Pm }m defined in Definition 2 are Mercer kernels. ˜ 1 (xm , xm ) = Proof: For any m, let K i j m m m m pc (xi )pc (xj ). Consider the 1-D feature map φ : xm → pcm (xm ) ∈ R ˜ 1 is the corresponding then, according to [20, Th. 3.11], K m˜ ˜ Mercer kernel. Let K2 = a K1 , where am ∈ R+ . Recall that a sufficient and necessary condition for a matrix to be a Mercer kernel is that it is symmetric and positive semidefinite. ˜ 2 follows from the symmetry property The symmetry of K ˜ 1 . Then, for any vector z ∈ RN , we have z am K ˜ 1z = of K ˜ 2 is a Mercer kernel. Con˜ 1 z ≥ 0, verifying that K am z K ˜ 3 (xm , xm ) = exp(K ˜ 2 (xm , xm )) is a ˜ 3 with K sequently, K i j i j Mercer kernel according to [20, Proposition 3.24]. Similar to ˜ 2 , let K b=  M k=1

3

Remark 2: The PCK-based class attribute similarity measure m between xm i and xj is as follows. m m m m 1) When pc (xi )pcm (xm j ) < 0, the value of Pm (xi , xj ) will be small, corresponding to higher confidence of belonging to different classes. m m the value of 2) When pcm (xm i )pc (xj )  0, m m Pm (xi , xj ) will be large, corresponding to higher confidence of belonging to the same class. m m m m 3) When pcm (xm i )pc (xj ) > 0, the value of Pm (xi , xj ) will be moderate, corresponding to moderate confidence of belonging to the same class. Remark 3: The exponential formulation ensures the nonnegativity of the localized kernel weights. The value of p, selected on a separated validation data set, can reveal the intrinsic sparsity of the given set of base kernels. Remark 4: The class probability density distributions, namely, pm (xm | + 1) and pm (xm | − 1), are obtained by kernel density estimation [33]  m m m  s (x , xi ) 1 m m (x pm (xm | ± 1) = pσ±1 )= m k m nσ±1 σ±1 {i|yi =±1}

(7) where sm (xm , xm i ) is certain similarity measure in space Hm . m m m Given initial (sm )0 (xm i , xj ) = Km (xi , xj ), we refine it by referencing [17] so as to regularize the distribution under the two classes      m t m m := (δim )t (sm )t xm δj (sm )t+1 xm (8) i , xj i , xj where 

(δim )t+1 = (δim )t

1  m m  2n m t x (s ) , x i j j=1 .   12   N m ) t x m , xm (s i j i=1

N i=1

N

(9)

The aforementioned iteration is terminated until ⎛ ⎛ ⎞n1   n n n n  ⎜     m t m m   m t m m (s ) xi , xj ⎠  ⎝ (s ) xi , xj −⎝   i=1  j=1 i=1 j=1   ⎞ 1 ⎛ ⎞ N n n n      m t+1 m m  ⎟  m ⎝ −  (sm )t+1 xm xi , xj ⎠ ⎠< ε. , x (s ) i j −   i=1 j=1  j=1  (10) m {σ±1 }m represents data-driven parameters called bandwidth, which determine the fitness of the model to underlying data. The bandwidth, together with the final classifier, can be efficiently optimized by a gradient-descent wrapping canonical SVM solver. The detailed formulation and learning are shown in Section III-B and C.

1

    p1 expp ak pck xki pck xkj 

˜ 3 is a Mercer kernel.  and we have b ∈ R+ . Thus, Pm = bK

B. Formulation in SVM-Based LMKL Given M different feature mappings φm : X → Hm , m = 1, . . . , M , each of which is endowed with the reproducing

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

kernel Km of Hm , the discriminative function corresponding to kernel (3) is in the form f (x

m

M 

Thus min C ξ 1 ξ

M 

  )= wm φβ m (xm )+b = wm xm m=1 m=1

(βxmm φm (xm ))+b. (11)

The idea in SVM-based LMKL is to learn the optimal values of wm and b, as well as to estimate {βxmm }m=1,...,M i=1,...,N , that i maximize the margin while minimizing the hinge loss on the training data. This can be achieved by solving the following primal problem:

= min max α ξ ξ

=

N  i=1

α∈[0,C]N



max αi 1−yi

αi ∈[0,C]

 s.t.



M 

 wm yi m=1

  +b ≥ 1−ξi

φm (xm βxmm i ) i

ξ≥0

max

min

m=1

N 

α∈[0,C]N {wm },b

∀i

αi (1 − yi b)

i=1

M      1 2  m m wm 2 − αi yi wm βxm + φm (xi ) . (19) i 2 m=1

(12)

where Ω(P) is the Tikhonov regularization [34]. According to definition (6), we have Ω(P) = |P(xi , xj )|p ∀i, j here.

Since problem (19) is convex in wm and b, by setting to zero the gradient of (19) with respect to wm and b, we get the following: (a) wm =

C. Optimization Strategy

min J(P) + μΩ(P) P

J(P) =

⎧ ⎪ ⎪ m min wb,ξ ⎪ ⎪ ⎨ ⎪ s.t. ⎪ ⎪ ⎪ ⎩

1 2

yi

M 

N 

yi αi βxmm φm (xm i ) i

i=1

In order to inherit the efficiency of existing alternating approaches [2], [4], we follow the standard procedure of formulating the primal (12) as a nested two-step optimization. This can be achieved by rewriting the primal (12) as follows:

where

 M     m m wm βxm φm (xi ) +b . (18) i

Substituting (18) in (14), we have

M 

1 wm 22 +C ξ 1 +μΩ(P) min m 2 m=1 {wm },b,ξ,{βx m}



(13)

wm 22 +C ξ 1    m m wm βxm φm (xi ) +b ≥1−ξi ∀i

m=1  M 

m=1

i

ξ ≥ 0.

(14) The dual problem is a key point for deriving optimization algorithm [35]. Theorem 1: The dual formulation of J is in the form of min W (P) + μΩ(P) P

(15)

where

⎧  M  ⎨ Pm ◦Km (α◦y) max 1 α− 12 (α◦y) W (P)= α m=1 ⎩ s.t. α y = 0, 0 ≤ α ≤ C. (16)

◦ defines the elementwise product. Proof: According to problem (14), we have

N 

(b)

αi yi = 0.

Substituting (20) in (19) leads to the theorem.  In the inner loop, P is fixed, and problem (16) can be identified as the standard SVM dual formulation using the localized  combination of base kernels KP = M m=1 Pm ◦ Km . Hence, α can be conveniently obtained by any off-the-shelf SVM solver. In the outer loop, with the optimal α obtained from the aforementioned procedure, strong duality holds between J and W , i.e., J(P) = W (P) for any given {Pm }m=1...,M . As stated in [4], W (P) is differentiable if the SVM solution is unique. Such condition can be guaranteed by the fact that KP is strictly positive definite. However, KP cannot be always guaranteed to be strictly positive definite. Similar to [36], this issue is solved by first computing the eigenvalues of KP . Then, if the smallest one is zero, we add the smallest nonzero eigenvalue to the diagonal of KP . Then, we use the gradient-descent method to train the weighting function. The derivatives of (13) and (15) m are as follows: w.r.t. am and σ±1 ∂J(P) ∂W (P) = ∂am ∂am 1 = − (α ◦ y ◦ pcm )t 2 ⎞⎞ ⎛ ⎛  ×⎝ Pk ◦ Kk ◦ ⎝Δkm − Pm ◦ · · · ◦ Pm ⎠⎠    k

C ξ 1 =

max α ξ.

α∈[0,C]N

(17)

(20)

i=1

p

× (α ◦ y ◦ pc ) 

m

(21)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. HAN AND LIU: PCK-BASED LOCALIZED MULTIPLE KERNEL LEARNING WITH lp NORM

5

∂J(P) ∂W (P) m = ∂σ m ∂σ±1 ±1  t ∂pcm m  = −a α ◦y◦ m ∂σ±1 ⎞⎞ ⎛ ⎛  ×⎝ Pk ◦ Kk ◦ ⎝Δkm − Pm ◦ · · · ◦ Pm ⎠⎠    k

p

× (α ◦ y ◦ pc )) 

m

(22)

where Δkm is the all-one matrix when m = k and the all-zero m m . = ±∇m matrix if otherwise and ∂pcm /∂σ±1 ±1 pσ±1 In this paper, we make explicit use of the spherical Gaussian m m (x ) kernel Gσ±1  1 m m m (x m (x )= Gσ±1 ) pσ±1 m N σ±1 {i|yi =±1}

1 = m N σ±1





exp

{i|yi =±1}

sm (xm , xm i ) m σ±1

.

(23)

m m (x The derivative of pσ±1 ) is m m pσ m (x ∇σ±1 )=− ±1

1 m m m Gσ±1(x ) σ±1

 m m m  s (x , xi ) 1 m m m s (x , xi )exp . (24) − 3 m m σ±1 N (σ±1 ) {i|y =±1} i

The optimization algorithm of PCK-LMKL is summarized in Algorithm 1. Note that PCK-LMKL cannot guarantee convergence to the global optimum and the initialization of m }m=1,...,M may affect the solution quality. The stop{am , σ±1 ping criterion is based on the duality gap ε. As an interleave procedure, the algorithm can be terminated when [4] max (α ◦ y)t Km (α ◦ y) m

 t

− (α ◦ y) 



 Pm Km

(α ◦ y) < ε.

(25)

m

At test time, given a new sample z = [z1 , . . . , zM ], m {Pm (zm , xm i )}i can be directly obtained by calculating (6) m }m . using the learned parameters {am , σ±1 IV. E XPERIMENTS This section investigates the effectiveness of PCK-LMKL on three sets of experiments: 1) For an intuitive illustration of the proposed approach, we compare the classification boundaries and support vectors computed by SVM, SimpleMKL [4], G-LMKL [14], and PCK-LMKL on a synthesized twodimension two-class toy data set; 2) for further statistical significance study, we compare PCK-LMKL against SimpleMKL and G-LMKL as well as three state-of-the-art local learning approaches on ten benchmark machine learning data sets (UCI data sets); and 3) to see the applicability of PCK-LMKL to

Fig. 2. Toy data set: (Colored solid lines) The three types of regions involved in classification: (1) Densely distributed separable regions, (2) less densely distributed overlapping regions, and (3) sparsely distributed regions, together with the Gaussian distributions from which data are sampled and (gray dashed lines) the optimal Bayesian decision boundary.

challenge real-world problem, we evaluate PCK-LMKL on a 15-scene data set and the Caltech-101 data set. The main body of PCK-LMKL is implemented in Matlab, and the associated optimization problem is solved using the Mosek software. The step size of each iteration is obtained by line search. The relative duality gap is set as 10−4 , together with a maximum iteration number of 50 for termination.

A. Toy Data Set We first synthesize a two-dimension two-class toy data set, which consists of 200 samples drawn from each of four Gaussian distributions with covariance matrix (0.8, 0.0; 0.0, 2.0) and means (3.0, 1.0), (1.0, 1.0), (1.0, 2.5), and (3.0, 2.5). As shown in Fig. 2, classification on this toy data set mainly involves three types of regions: 1) densely distributed separable region; 2) less densely distributed overlapping region; and 3) sparsely distributed region. Performances in region 1) reflect the capability of capturing the main separable trend of the given training data, while in regions 2) and 3), there is either no clear separable trend or only sparsely distributed data. Hence, performances on these two regions reflect the capability of extending to unseen data set. Given the true distribution of samples, the Bayesian boundary has the minimum generalized classification error [37], i.e., the Bayesian boundary is the optimal solution for both characterizing local training data and generalizing to unseen test data. Hence, how well the computed boundaries match to the optimal Bayesian boundary is used as an intuitive evaluation. Moreover, the number of stored support vectors reflects the algorithm’s testing efficiency. Since the two dimensions can be thought of as two different “views” on the toy data set, we use two Gaussian kernels with the width being half the mean-squared distance of the associated view. Certainly, more sophisticated model selection approaches, e.g., [38] and [39], could be used. Fig. 3 shows the classification boundaries and support vectors computed by single-view-based canonical SVM and two-view-based multiple kernel approaches: SimpleMKL [4], G-LMKL [14], and the two versions of our PCK-LMKL. Because the projection of the two classes to either view alone is heavily overlapping,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

B. UCI Data Sets

Fig. 3. Toy data set: Separating (gray solid lines) hyperplanes and (filled points) support vectors calculated by SimpleMKL, G-LMKL, and two versions of PCK-LMKL. Gray dashed lines show the Gaussian distributions from which data are sampled and the optimal Bayesian decision boundary. (a) SVM for view 1. (b) SVM for view 2. (c) SimpleMKL. (d) G-LMKL. (e) PCK-LMKL (p = 1). (f) PCK-LMKL (p = 5).

the two single-kernel-based canonical SVMs show poor classification boundary and store more support vectors, as shown in Fig. 3(a) and (b). The four multiple kernel approaches all show good approximation to the optimal Bayesian boundary in region 1), demonstrating the capability of characterizing the main separable trend of the training data. In region 3), sample-specific kernel combination, e.g., G-LMKL and PCKLMKL (with p = 1), overfits the sparsely distributed samples, while global SimpleMKL (with the weights of 0.58 and 0.42) and PCK-LMKL (with p = 5) show good approximation to the Bayesian boundary. The superiority of global SimpleMKL could be due to the Gaussian distribution of data, which makes the global trend determined by the majority of data in region 1) continue in region 3) with better generalization. For PCKLMKL, by tuning the norm p, it also gets good approximation to the Bayesian boundary in this region. Moreover, in region 2), which is a great challenge to other methods, only our PCK-LMKL (with p = 5) achieves good approximation to the Bayesian boundary. In terms of stored support vectors, PCKLMKL is similar to the efficient SimpleMKL and G-LMKL. This intuitively demonstrates the effectiveness of PCK-LMKL in both characterizing the training data and extending to unseen test data, as well as its testing efficiency.

We then evaluate PCK-LMKL on ten UCI data sets, i.e., “Banana,” “Germannumeric,” “Heart,” “Ionosphere,” “Liverdisorder,” “Pima,” “Ringnorm,” “Sonar,” “Spambase,” and “Wdbc.” Following the experimental methodology in [14], given a data set, a random splits into two-third training, and one-third testing data are prepared. The training data are normalized to have zero mean and unit variance, and the testing data are then normalized using the mean and variance of the training data. We generate Gaussian kernels with the same kernel setting as in Section IV-A for each individual dimension of the feature vector. For each training set, the regularization parameters C in SVM and p in PCK are jointly optimized using 5 × 2 crossvalidation over a grid of values: C = {0.01, 0.1, 1, 10, 100} and p = {1, 2, 3, 4, 5, 6, 7, 8}. The final results are reported as the mean performances over the ten 5 × 2 training folds. Given a pair of samples, Fig. 4 shows the weights over the given set of base kernels on the validation set of the Wdbc data set with various p norms, where, for each p, the PCK learned with the optimal C value is plotted. We can observe that, as p increases, the sparsity of the learned weights decreases. Next, we plot in Fig. 5 the testing accuracies on the validation set for each of the ten UCI data sets over various p norms, where, again, for each p, the C value that gives the best testing accuracy on the validation set is plotted. It is clear that the testing accuracies are quite different under varied p norms and, in all the cases, p > 1 outperforms its p = 1 counterpart. This is because the base kernels are often complementary to each other for classification. Hence, learning p in lp -norm constraint using a validation set can reveal the intrinsic sparsity of the base kernels. Finally, we show an example of the evolution of PCK matrices computed on the training set of the Spambase data set in Fig. 6. We can observe that, as the iteration progresses, similarities between samples of the same class are strengthened. Table I summarizes the average testing accuracies and support vector percentages for the two baseline approaches, i.e., SimpleMKL and G-LMKL [14], and the two versions of PCKLMKL on the ten UCI data sets. For more reliable results rather than what would be expected by chance, we use three kinds of statistical tests, namely, t-test [40], direct comparison, and Wilcoxon’s signed rank test [41]. In the t-test and direct comparison, we use W −T −L to record the counts of win–tie–loss of our approach against other methods to be compared on the ten benchmark machine learning data sets. In Wilcoxon’s signed rank test, W means that our approach has statistically significant superiority to the method to be compared, T means that there is no statistically significant difference between the two compared methods, and L means that our approach has statistically significant inferiority to the method to be compared. Specifically, in terms of testing accuracy, using 5 × 2 cross-validation paired t-test, PCK-LMKL (p > 1) obtains 8–2-0 against SimpleMKL, that is to say, PCK-LMKL (p > 1) outperforms SimpleMKL on eight out of ten data sets and achieves the same performances as SimpleMKL on two out of ten data sets. Similarly, PCK-LMKL (p > 1) outperforms G-LMKL and PCK-LMKL (p = 1) on seven and eight out of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. HAN AND LIU: PCK-BASED LOCALIZED MULTIPLE KERNEL LEARNING WITH lp NORM

Fig. 4.

Wdbc data sets: Entries of PCKs learned on the training set with various p norms for a given pair of samples.

Fig. 5.

UCI data sets: Learning the norm p for each UCI data set on their respective validation sets.

Fig. 6. Spambase data set: An example of the evolution of PCK (a) Initial. (b) Iteration 1. (c) Iteration 2. (d) Iteration 3.

M m=1

Pm .

ten data sets, respectively, and achieves the same performances as them on three and two out of ten data sets, respectively, (namely, 7-3-0 and 8-2-0). Using direct comparison, PCKLMKL (p > 1) shows similar superiority, i.e., 8-0-2, 9-0-1, and 9-1-0, against SimpleMKL, G-LMKL, and PCK-LMKL (p = 1), respectively. The significance is further verified by Wilcoxon’s signed rank test (namely, W ). The improvements of PCK-LMKL to global SimpleMKL could attribute to the diverse locality of underlying data. As reported in

7

[14], G-LMKL only shows statistically similar performance to SimpleMKL. Hence, the improvements of PCK-LMKL to local G-LMKL demonstrate the ability of PCK-LMKL in capturing such diversity. This can be seen in Fig. 7, which shows five different entries in PCKs, corresponding to the localized weight distribution of five different pairs of samples. In terms of efficiency, the results of Wilcoxon’s signed rank test on the ten data sets are that PCK-LMKL (p > 1) stores comparative support vectors to efficient SimpleMKL and G-LMKL (namely, T ), justifying the testing efficiency of PCK-LMKL. Table II lists the exact p values for the 5 × 2 cross-validation paired t-test over each of the ten data sets and for Wilcoxon’s signed rank test over the overall ten data sets. We also compare the testing accuracies of PCK-LMKL to our in-house implementation of three state-of-the-art local learning methods [10]–[12]. These local models are all specifically designed for each training sample; hence, local models for testing sample could not be obtained without the use of additional heuristics, such as elaborate nearest neighbor search

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

TABLE I AVERAGE ACCURACIES AND S UPPORT V ECTOR P ERCENTAGES ON THE UCI DATA S ETS

Fig. 7. Ringnorm data set: An example of five different entries in PCKs (corresponding to the localized weight distribution of five different pairs of samples) with the optimal {p, C} combination on the validation set. TABLE II p-VALUESa FOR THE T WO S TATISTICAL T ESTS ON THE UCI DATA S ETS . D IFFERENCE B ETWEEN p VALUES AND p N ORMS U SED IN T HIS PAPER . T HE F ORMER R EFERS TO THE p VALUES IN S TATISTICAL T EST, AND THE L ATTER R EFERS TO THE VALUES OF p IN lp -N ORM C ONSTRAINT I MPOSED ON THE K ERNEL W EIGHTS

a N OTICE THE

[10], feature space smoothing [12], or nearest-neighbor-based referencing to the training set [11], which would significantly increase the complexity of the algorithms. As shown in the left part of Table I, by an integration of the characterization of diverse training data and effective extension to unseen test data, PCK-LMKL shows consistent outperformances to the three local learning approaches on all the ten UCI data sets. C. Computer Vision Data Sets We finally evaluate the performances of PCK-LMKL on two challenge computer vision data sets. 1) 15-scene: This is a data set of 4485 images over 15 natural scene classes provided by Lazebnik et al. [42]. We follow the standard setup by using 100 images in each class for training and the rest for testing.

2) Caltech-101: This is a data set of 9197 images over 101 object categories and one background assembled by Fei-Fei et al. [43]. We respectively use 5, 10, 15, 20, 25, and 30 images from each class for training and the rest for testing. Fig. 8 shows the sample images from each data set. For both data sets, we use eight state-of-the-art descriptors, i.e., GIST, histogram of oriented gradient 2 × 2, dense SIFT, sparse SIFT, line histograms, self-similarity, texton, and geometric map. The associated kernels are computed based on radial basis function distance for GIST and χ2 distances for all the others and then normalized to unit trace. The regularization parameter in SVM is set as C = 100 for it provides us reasonable performances on the test set. The norm p is learned on the training set using fivefold cross-validation. The multiclass classification is done using the one-versus-all rule. The performances of the aforementioned features, together with two multiple kernel classifiers, are compared in Fig. 9(a), where the training sets are consistently decreased while the testing sets are kept unchanged. One multiple kernel classifier is based on a set of prior weights that is proportional to the performance of associated individual feature [44], and the other is the proposed PCK-LMKL. However, it is difficult if not impossible to evaluate the performance of each feature before combining them, particularly when the number of features is huge. Since the aim of MKL is just to automatically single out the discriminative power of each feature and combine them for strengthened classifier, the multiple kernel classifier in [44] is served as the “ground truth.” It is clear in Fig. 9(a) that both multiple kernel classifiers yield significant improvements over each individual feature, and PCK-LMKL achieves comparative

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. HAN AND LIU: PCK-BASED LOCALIZED MULTIPLE KERNEL LEARNING WITH lp NORM

Fig. 8.

9

Computer vision data sets: (a) Sample images from the 15-scene data set and (b) sample images from the Caltech-101 data set.

Fig. 9. Comparison on computer vision data sets. (a) 15-scene data set: Performances of eight state-of-the-art features and two multiple kernel classifiers. (b) Caltech-101 benchmark comparison: The performance of PCK-LMKL along with the most recently reported results. TABLE III C LASSIFICATION ACCURACIES AND C OMPUTATIONAL T IMES ON C OMPUTER V ISION DATA S ETS

performances to that in [44], demonstrating that the weights learned by our approach can indeed reveal the discriminative power of associated features. In Fig. 9(b), the accuracy rates of several recent techniques, including ours, on Caltech-101 are plotted over different numbers of training data. The recognition rates by PCK-LMKL are either better than or comparable to those by other published systems. The SimpleMKL and G-LMKL are known to be efficient [4], [14]. When the values of Ntrain are 100 and 15 for 15-scene and Caltech-101, we compare PCK-LMKL with

SimpleMKL and G-LMKL in Table III. In terms of recognition rates, again, due to the possibly large intraclass appearance variance in real-world data sets, PCK-LMKL achieves more than 2% and 3% improvements to SimpleMKL and G-LMKL, respectively. In terms of computational costs, PCK-LMKL has the same order of computational time as SimpleMKL and has computational time that is lower by one order than that of G-LMKL. Due to the online computation of the PCK for each category, the computational time gap between SimpleMKL and PCK-LMKL increases as the number of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

categories increases, namely, from 1.14 × 102 s in 15-scene to 2.19 × 103 s in Caltech-101. Hence, further efficiency could be gained in PCK-LMKL if such information was precomputed offline. V. C ONCLUSION As we have argued that both characterization of individual training sample and effective extension to unseen test sample should be equally addressed in LMKL, we have proposed a PCK for both at once. For the facet of locality characterization, the philosophy is that the classification of local samples can rely not only on their spatial position but also on their latent class attribute. Based on the probability density estimation, PCK with arbitrary lp -norm constraint is defined to measure the confidence of being the same or different classes between samples. Selection of norm p on separated validation data reveals the intrinsic sparsity of underlying data with further improved performances. In addition to characterizing the diverse individuality, the probability-density-distribution-based PCK also accommodates the universality shared within the whole space. Hence, for the facet of extension to unseen data, statistical meaningful extension to test set is conveniently obtained. We also incorporated PCK into existing SVM-based MKL framework. In terms of empirical performance, we compared PCK-LMKL to SimpleMKL and several state-of-the-art local learning algorithms. PCK-LMKL has shown to achieve state-of-the-art performances on various classification problems, while with comparative computational costs to efficient SimpleMKL. In the future, we will further incorporate PCK into other recent-trend formulations, such as nonlinear kernel combination and multiclass and multilabel MKL, as well as build PCK in a more sophisticated form. R EFERENCES [1] G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” J. Mach. Learn. Res., vol. 5, pp. 27–72, Dec. 2004. [2] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, 2006. [3] A. Zien and C. S. Ong, “Multiclass multiple kernel learning,” in Proc. ICML, 2007, pp. 1191–1198. [4] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” J. Mach. Learn. Res., vol. 9, pp. 2491–2521, Nov. 2008. [5] C. Cortes, M. Mohri, and A. Rostamizadeh, “Learning non-linear combinations of kernels,” NIPS, pp. 396–404, 2009. [6] M. Varma and B. R. Babu, “More generality in efficient multiple kernel learning,” in Proc. ICML, 2009, pp. 1065–1072. [7] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola, “On kerneltarget alignment,” in Proc. NIPS, 2001, pp. 367–373. [8] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, “Simple and efficient multiple kernel learning by group lasso,” in Proc. ICML, 2010, pp. 1175–1182. [9] Y. Tang, L. Li, and X. Li, “Learning similarity with multikernel method,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 131–138, Feb. 2011. [10] A. Frome, Y. Singer, F. Sha, and J. Malik, “Learning globally-consistent local distance functions for shape-based image retrieval and classification,” in Proc. ICCV, 2007, pp. 1–8. [11] Y.-Y. Lin, T.-L. Liu, and C.-S. Fuh, “Local ensemble kernel learning for object category recognition,” in Proc. CVPR, 2007, pp. 1–8. [12] M. Christoudias, R. Urtasun, and T. Darrell, “Bayesian localized multiple kernel learning,” Univ. California Berkeley, Berkeley, CA, 2009.

[13] Y.-Y. Lin, J.-F. Tsai, and T.-L. Liu, “Efficient discriminative local learning for object recognition,” in Proc. ICCV, 2009, pp. 598–605. [14] M. Gönen and E. Alpaydin, “Localized multiple kernel learning,” in Proc. ICML, 2008, pp. 352–359. [15] Y. Han and G. Liu, “Efficient learning of sample-specific discriminative features for scene classification,” IEEE Signal Process. Lett., vol. 18, no. 11, pp. 683–686, Nov. 2011. [16] R. Sinkhorn, “A relationship between arbitrary positive matrices and double stachstic matrices,” Ann. Math. Statist., vol. 35, no. 2, pp. 876–879, 1964. [17] H. Jegou, C. Schmid, H. Harzallah, and J. J. Verbeek, “Accurate image search using the contextual dissimilarity measure,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 2–11, Jan. 2010. [18] W.-F. Zhang, D.-Q. Dai, and H. Yan, “Framelet kernels with applications to support vector regression and regularization networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 4, pp. 1128–1144, Aug. 2010. [19] B. Schölkopf and A. J. Smola, Learning with Kernels, T. Dietterich, Ed. Cambridge, MA: MIT Press, 2002. [20] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, MA: Cambridge Univ. Press, 2004. [21] B. E. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in Proc. COLT, 1992, pp. 144–152. [22] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sep. 1995. [23] F. Girosi, “An equivalence between sparse approximation and support vector machines,” Neural Comput., vol. 10, no. 6, pp. 1455–1480, Aug. 1998. [24] K. P. Bennett, M. Momma, and M. J. Embrechts, “Mark: A boosting algorithm for heterogeneous kernel models,” in Proc. KDD, 2002, pp. 24–31. [25] L. Cao, J. Luo, F. Liang, and T. S. Huang, “Heterogeneous feature machines for visual recognition,” in Proc. ICCV, 2009, pp. 1095–1102. [26] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. ICML, 2004, p. 6. [27] M. Varma and D. Ray, “Learning the discriminative power-invariance trade-off,” in Proc. ICCV, 2007, pp. 1–8. [28] Z. Xu, R. Jin, I. King, and M. R. Lyu, “An extended level method for efficient multiple kernel learning,” in Proc. NIPS, 2009, pp. 1–8. [29] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Müller, and A. Zien, “Efficient and accurate lp -norm multiple kernel learning,” Proc. NIPS, pp. 997–1005, 2009. [30] S. Vishwanathan, Z. Sun, and N. Theera-Ampornpunt, “Multiple kernel learning and the SMO algorithm,” in Proc. NIPS, 2010, pp. 1–9. [31] F. Yan, K. Mikolajczyk, M. Barnard, H. Cai, and J. Kittler, “Lp norm multiple kernel fisher discriminant analysis for object and image categorisation,” in Proc. CVPR, 2010, pp. 3626–3632. [32] J. Yang, Y. Li, Y. Tian, L. Duan, and W. Gao, “Group-sensitive multiple kernel learning for object categorization,” in Proc. ICCV, 2009, pp. 436–443. [33] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London, U.K.: Chapman & Hall, 1986. [34] T. Poggio and F. Girosi, “Regularization algorithms for learning that are equivalent to multilayer networks,” Science, vol. 247, no. 4945, pp. 978– 982, Feb. 1990. [35] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, MA: Cambridge Univ. Press, Mar. 2004. [36] H. Zhang, A. C. Berg, M. Maire, and J. Malik, “SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,” in Proc. CVPR, 2006, pp. 2126–2136. [37] R. O. Duda, P. E. Hart, and G. Stork, Pattern Classification. Hoboken, NJ: Wiley, 2001. [38] Z. Xu, M. Dai, and D. Meng, “Fast and efficient strategies for model selection of Gaussian support vector machine,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 5, pp. 1292–1307, Oct. 2009. [39] M. Varewyck and J.-P. Martens, “A practical approach to model selection for support vector machines with a Gaussian kernel,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 2, pp. 330–340, Apr. 2011. [40] T. G. Dietterich, “Approximate statistical test for comparing supervised classification learning algorithms,” Neural Comput., vol. 10, no. 7, pp. 1895–1923, Oct. 1998. [41] G. W. Corder and D. I. Foreman, Nonparametric Statistics for NonStatisticians: A Step-by-Step Approach. Hoboken, NJ: Wiley, 2009. [42] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. CVPR, 2006, pp. 2169–2178.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. HAN AND LIU: PCK-BASED LOCALIZED MULTIPLE KERNEL LEARNING WITH lp NORM

[43] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories,” in Proc. Workshop Generative-Model Based Vision, 2004, pp. 1–9. [44] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in Proc. CVPR, 2010, pp. 3485–3492.

Yina Han received the B.S. degree in electronic and information engineering from Xi’an Jiaotong University, Xi’an, China, in 2004, where she is currently working toward the Ph.D. degree in the School of Electronic and Information Engineering. From 2007 to 2008, she was a joint-training Ph.D. student with Laboratoire Traitement et Communication de l’Information, Telecom ParisTech, Paris, France. Her research interests include machine learning, computer vision, and visual information retrieval.

11

Guizhong Liu (M’06) received the B.S. and M.S. degrees in computational mathematics from Xi’an Jiaotong University, Xi’an, China, in 1982 and 1985, respectively, and the Ph.D. degree in mathematics and computing science from Eindhoven University of Technology, Eindhoven, The Netherlands, in 1989. He is currently a Full Professor with the School of Electronic and Information Engineering, Xi’an Jiaotong University. His research interests include nonstationary signal analysis and processing, image processing, and multimedia compression, transmission, and retrieval.