A Conditional Entropy Minimization Criterion for Dimensionality ...

Report 4 Downloads 88 Views
LETTER

Communicated by Gert Lanckriet

A Conditional Entropy Minimization Criterion for Dimensionality Reduction and Multiple Kernel Learning Hideitsu Hino [email protected]

Noboru Murata [email protected] School of Science and Engineering, Waseda University, 3-4-1 Ohkubo, Shinjuku, Tokyo 169-8555, Japan

Reducing the dimensionality of high-dimensional data without losing its essential information is an important task in information processing. When class labels of training data are available, Fisher discriminant analysis (FDA) has been widely used. However, the optimality of FDA is guaranteed only in a very restricted ideal circumstance, and it is often observed that FDA does not provide a good classification surface for many real problems. This letter treats the problem of supervised dimensionality reduction from the viewpoint of information theory and proposes a framework of dimensionality reduction based on classconditional entropy minimization. The proposed linear dimensionalityreduction technique is validated both theoretically and experimentally. Then, through kernel Fisher discriminant analysis (KFDA), the multiple kernel learning problem is treated in the proposed framework, and a novel algorithm, which iteratively optimizes the parameters of the classification function and kernel combination coefficients, is proposed. The algorithm is experimentally shown to be comparable to or outperforms KFDA for large-scale benchmark data sets, and comparable to other multiple kernel learning techniques on the yeast protein function annotation task. 1 Introduction Dimensionality reduction is a technique for obtaining a compact data representation that keeps the intrinsic information of the original data as much as possible. During the past decades, the importance of dimensionality reduction has grown as the size and dimensionality of available target data have increased. When we deal with extremely high-dimensional data, such as images, sounds, texts, and gene expressions, an appropriate dimensionality reduction of raw data helps to improve computational time and burden and also allows capturing the intrinsic structure of target data as a technique of data visualization. Neural Computation 22, 2887–2923 (2010)

! C 2010 Massachusetts Institute of Technology

2888

H. Hino and N. Murata

Fisher discriminant analysis (FDA; Fisher, 1936), one of the most famous supervised classification techniques, finds a projection axis for a good separation of classes based on the ratio of between-class covariance to withinclass covariance, and projected values on the obtained projection axis can be used as a new feature variable that compactly represents the class information of data. It is known that the optimality of FDA is assured when all the class-conditional distributions are gaussians with the same covariance structure. However, in practice, FDA often fails to find the optimal axis because this assumption rarely holds. To overcome such a problem, local Fisher discriminant analysis (LFDA; Sugiyama, 2007) that utilizes local information of data by means of the affinity matrix is used. Sugiyama (2007), successfully used LFDA as a preprocessing technique for classification, and it is shown to be superior to original FDA in some experiments. In this letter, we explore dimensionality reduction in a supervised setting from the viewpoint of information theory. We also consider nonlinearization of the proposed framework by kernel methods. We argue that the proposed framework can be used as a criterion for kernel optimization and propose a novel method of multiple kernel learning. First, as a typical technique of supervised dimensionality reduction, we interpret FDA in terms of entropy. Then we propose to use the conditional entropy as an objective for supervised dimensionality reduction. So far, many dimensionality-reduction techniques have been proposed from the viewpoint of information theory. Some of them approximate the data distribution by the mixture of gaussians, and others use surrogates of the Shannon differential entropy. Our proposed framework makes no assumption on the data distribution; therefore, it is expected to work well under any data distribution. We carried out a simple experiment with two synthetic dichotomy problems to show that the proposed technique works properly even when conventional FDA fails to find a reasonable projection axis for classification. The result of this experiment is illustrated in Figure 1. The dashed lines are the classification axes found by FDA, and the solid lines are those found by the proposed technique. For the simpler data set depicted in Figure 1a, the projected samples are nicely separated into different classes (◦ and ") on both axes found by FDA and the proposed method. Figure 1b depicts a bimodal data set, that is, samples in one class form two distinct clusters. In this case, FDA collapses the samples from different classes into a single cluster, while the proposed technique gives a perfect separation. The proposed framework of conditional entropy minimization is quite general, and it can be easily extended to dimensionality reduction with nonlinear transformations. In this letter, we propose a technique based on ker¨ nel Fisher discriminant analysis (KFDA; Mika, R¨atsch, Weston, Scholkopf, ¨ & Muller, 1999). Our proposal is optimizing data projection axes and kernel combination coefficients in the context of multiple kernel learning (MKL; Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004; Lanckriet, Deng, Cristianini, Jordan, & Noble, 2004; Lewis, Jebara, & Noble, 2006a, 2006b;

2889

3

3

A Conditional Entropy Minimization Criterion

−2

−1

0

2nd axis

1

2

FDA proposed

−3

−3

−2

−1

0

2nd axis

1

2

FDA proposed

−3

−2

−1

0

1st axis

1

2

(a) Unimodal data.

3

−3

−2

−1

0

1st axis

1

2

3

(b) Bimodal data.

Figure 1: Examples of dimensionality reduction by FDA and the proposed technique. Two-dimensional dichotomy samples are projected onto an axis. The lines in these figures denote the axes on which the data samples are projected.

Do, Kalousis, Woznica, & Hilario, 2009). While there has been much research on MKL based on support vector machines (SVM; Cristianini & Shawe-Taylor, 2000) and maximum margin criteria, our proposal is based on class-conditional entropy minimization. The rest of the letter is organized as follows. Section 2 describes an information-theoretic understanding of FDA. Section 3 argues the validity of conditional entropy for dimensionality-reduction, and a novel supervised dimensionality-reduction framework is proposed based on conditional entropy minimization criterion. In this section, an entropy estimation method and an optimization method are also explained. For linear dimensionality reduction, it is experimentally shown that the proposed framework performs comparable to other dimensionality-reduction methods or even better. In section 4, the proposed technique is utilized to derive a novel multiple kernel learning method. Section 5 is devoted to discussion of information-theoretic dimensionality-reduction techniques. The last section gives concluding remarks. 2 Information-Theoretic Aspect of Dimensionality Reduction N Given a set of vector data D = {x i }i=1 , x i ∈ Rn , a problem of dimensionality reduction is formulated as finding a good transformation f : Rn → Rm that maps a datum x i ∈ Rn to an m-dimensional vector f (x i ) = zi ∈ Rm , where m ≤ n. In linear dimensionality reduction, the transformation f can be represented by a matrix A ∈ Rn×m as

zi = AT x i ,

A ∈ Rn×m .

(2.1)

2890

H. Hino and N. Murata

In this section, we consider the dimensionality-reduction problem from the viewpoint of information theory (Cover & Thomas, 1991). When we refer to the term entropy in this letter, we mean the Shannon differential entropy for a random variable X defined as ! H(X) = − p(x) log p(x)d x, (2.2) where p is the probability density function of X. As a supervised dimensionality-reduction technique, FDA is commonly N N used. Given a sample data set D = {x i }i=1 and their class labels {yi }i=1 , yi ∈ {1, 2, . . . , C}, FDA finds a linear projection of the data that is suitable for a classification task. Let Dy be a set of data that belong to the class y, Ny = |Dy | " be the number of data in the class y, and N = Cy=1 Ny be the total number of the data. We denote the mean vector and covariance matrix of the data in Dy by µ y and ! y , respectively, and the mean vector of all the data in D by µ. In FDA, the transformation matrix A is found by maximizing the ratio of the within-class covariance matrix AT !w A and the between-class covariance matrix AT !b A of the transformed data, where the matrices !w and !b are defined by !w = !b =

C C # Ny 1 ## (x − µ y )(x − µ y )T = !y, N N x∈D y=1

1 N

C # y=1

y=1

y

Ny (µ y − µ)(µ y − µ)T .

We note that with some abuse of notation, D indicates both the data N and the index set of the data {1, 2, . . . , N}. Then the objective set {x i }i=1 of FDA is minimizing the log ratio of the transformed matrices, that is, log(|AT !w A|/|AT !b A|), where |M| denotes the determinant of a square matrix M. Since multiplication of both a denominator and a numerator of |AT !w A|/|AT !b A| by a nonzero scalar does not change the value of the objective, FDA is defined as the following constrained minimization problem: $ % min |AT !w A| = min log |AT !w A| subject to |AT !b A| = const. A

A

(2.3)

Now we give an information-theoretic understanding of the FDA optimization problem 2.3. Consider the class-conditional entropy H(AT X|Y) =

C # Ny H(AT X|Y = y) N y=1

(2.4)

A Conditional Entropy Minimization Criterion

2891

of a transformed random variable AT X. Let HG (X) be the entropy of a gaussian distribution with the same covariance structure of the random variable X. The relationship between the conditional entropy and the FDA criterion 2.3 is described by the following inequalities: H(AT X|Y) ≤ HG (AT X|Y)

(2.5) C #

= log(2π)m/2 e +

1 2

≤ log(2π)m/2 e +

1 log |AT !w A|, 2

y=1

Ny log |AT ! y A| N

(2.6) (2.7)

where e is the base of natural logarithm. The first inequality, 2.5, comes from the fact that among infinite support distributions with a fixed covariance matrix, the maximum entropy is achieved by the gaussian distribution (Cover & Thomas, 1991). The second inequality, 2.7, comes from the definition of !w and Jensen’s inequality. In general, the projection found by FDA is not Bayes optimal (Duda, Hart, & Stork, 2000). However, when a datum x in each class y is subject to a gaussian distribution with the same covariance !c , that is, all ! y ’s are equal to !c and thus !w is equal to !c , inequalities 2.5 and 2.7 become equalities. From the above argument, we can conclude that FDA is a minimization problem for an upper bound of the class-conditional entropy of a variable on the projected axes. In the next section, we propose a framework for supervised dimensionality-reduction through minimizing the class-conditional entropy. 3 Proposed Framework of Dimensionality Reduction For supervised dimensionality reduction, transformed data representation in a lower-dimensional space should be compactly aggregated in each class. A random variable that is concentrated on a certain small region has small entropy. Taking account of the fact that FDA minimizes the upper bound of the class-conditional entropy, we now propose a framework of supervised dimensionality reduction that constructs a transformation f : x (→ z that minimizes the class-conditional entropy H(Z|Y). We note that the conditional entropy H(Z|Y) is minimized by any functions that map all data x to a single point. Furthermore, if the representational power of the transformation f is too high, optimization with respect to f might result in overfitting. To avoid trivial solutions and overfitting, we need restriction or regularization in optimizing H(Z|Y). In this letter, we introduce a parameter ε to control the extent of regularization and a regularization functional $( f, D),

2892

H. Hino and N. Murata

which may depend on both the function f and the given data D. Therefore, the regularized conditional entropy minimization is defined as min H(Z|Y) + ε$( f, D).

(3.1)

f :x(→ z

The form of the regularization functional $( f, D) should be appropriately designed according to each problem at hand. For example, in the linear transformation formulation, 2.1, where the determinant of the betweenclass covariance matrix is constant, say 1, we use $( f, D) = $(A, D) = (|AT !b A| − 1)2 . In the following sections, we describe a method for estimating and optimizing the entropy. 3.1 Entropy Estimation. We first consider the minimization problem, 3.1, where the transformation is linear and expressed by AT : Rn → Rm . To minimize the entropy, we estimate the entropy of one-dimensional data in a nonparametric manner. Nonparametric entropy estimation methods are roughly divided into two categories: methods based on kernel density esti¨ mators (Beirlant, Dudewicz, Gyorfi, & Meulen, 1997) and methods based on k-nearest neighbors (k-NN) (Kozachenko & Leonenko, 1987). In this letter, we adopt a k-NN-based method proposed by Faivishevsky and Goldberger (2009) because of a small variance in the estimate, fast computation, and implementation simplicity. For n-dimensional random variable X, it is known that the Shannon differential entropy has an unbiased k-NN estimator, N

Hk (X) = ψ(N) − ψ(k) + log(c n ) +

n # log &i,k , N

(3.2)

i=1

where ψ(x) is the digamma function, c n is the volume of the n-dimensional π n/2 unit ball (i.e., c n = '(1+n/2) ), and &i,k is the distance from x i to its kth nearest neighbor. Since the k-NN estimator 3.2 is valid for all k ∈ {1, . . . , N − 1}, the differential entropy can be also estimated by the average of all estimators with different values of k. Then, averaging all Hk (X), Faivishevsky and Goldberger (2009) proposed a novel entropy estimator, called MeanNN: HMNN (X) =

N−1 1 # Hk (X) N−1 k=1

& ' N−1 N 1 # n # = log(c n ) + ψ(N) + log &i,k −ψ(k) + N−1 N k=1

i=1

= log(c n ) + ψ(N) −

N−1 # 1 # n ψ(k) + log ||x i − x j ||. N−1 N(N − 1) i)= j k=1

(3.3)

A Conditional Entropy Minimization Criterion

2893

In this letter, we estimate all the entropy by the MeanNN(MNN in short) estimator 3.3 and omit the subscript MNN in HMNN . Since it is difficult to estimate the joint entropy of a high-dimensional multivariate random variable with enough accuracy, we propose to estimate the upper bound of the joint entropy by the sum of its marginal entropies (Cover & Thomas, 1991). Let al be the lth column vector of the transformation matrix A, and consider the lth element zl = alT x of the transformed vector z. Then the marginal entropy of zl is given by ! Hl (alT X) = Hl (Zl ) = − p(zl ) log p(zl )dzl , l = 1, . . . , m, and the sum of them gives the upper bound of the joint entropy of z = (z1 , . . . , zm ) as H(Z) =

m # l=1

Hl (Zl ) ≥ H(Z).

(3.4)

The upper bound of the class-conditional entropy for a class Y = y is also defined by H(Z|Y = y) =

m # l=1

Hl (Zl |Y = y) ≥ H(Z|Y = y),

(3.5)

and the weighted sum of H(Z|Y = y) with the class prior probability defines the upper bound of the class-conditional entropy as H(Z|Y) =

C #

=

C #

y=1

y=1

p(y)H(Z|Y = y) p(y)

m # l=1

Hl (Zl |Y = y) ≈

C m # Ny # Hl (Zl |Y = y), N y=1

(3.6)

l=1

where the class prior probability p(y) is estimated by Ny /N. When we minimize the entropy of a multivariate random variable, we assume that the sum of the marginal entropies is minimized henceforth. We will show a simple experimental result to support the upper-bound argument in appendix A. 3.2 Optimization Using Gradient Descent. Now we minimize the class-conditional entropy by gradient descent. We calculate the gradient vector of the class-conditional entropies of the transformed data zl = alT x, l = 1, . . . , m and update each column of A. The optimization problem to be solved for linear dimensionality reduction is min H(AT X|Y) + ε$(A, D), A

(3.7)

2894

H. Hino and N. Murata

where the conditional entropy is calculated as H(AT X|Y) =

C C m # # Ny Ny # H(AT X|Y = y) = H(alT X|Y = y). N N y=1

y=1

l=1

Since the MNN estimate of H(alT X|Y = y) is given by H(alT X|Y = y) =

# 1 log ||alT x i − alT x j || + const., N(N − 1) i, j∈Dy , i)= j

the derivative of the marginal class-conditional entropy with respect to al is given by C # # ∂ H(alT X|Y) (x j − x i ) 2 = , T T 2 N(N − 1) ∂ al (a l (x j − x i )) y=1 i, j∈Dy , i)= j

and we can minimize H(AT X|Y) =

"m

l=1

H(alT X|Y) by gradient descent.

3.3 Quasi-Orthogonalization. Since we minimize the sum of the marginal entropies, a naıve optimization for each marginal entropy may lead us to the same single transformation vector al = a, (l = 1, . . . , m) for all the marginal entropies. To avoid this, we apply quasi-orthogonalization to the transformation matrix in each iteration of gradient descent. In order to simplify operations and accelerate convergence of the algorithm, we propose to prewhiten the data in advance. A random vector x is said to be white if its covariance matrix is the unit matrix. Let the eigenvalue decomposition of the covariance matrix be E[(x − µ)(x − µ)T ] = U)U T ; then a whitened 1 vector is given by )− 2 U T x. N Now, under the assumption that the given data D = {x i }i=1 are whitened, n×m the matrix A ∈ R obtained in each step of marginal entropy minimization is modified so as to approximately satisfy ||AT A − Im || F where Im is the m × m identity matrix and || · || F denotes the Frobenius norm. This quasiorthogonalization corresponds to defining the regularization function in equation 3.7 as $(A, D) = ||AT A − Im || F . The quasi-orthogonalization of A is realized by iterating the following three steps until convergence: Step 1: Divide A by square root of the largest eigenvalue of AT A. Step 2: A ← 32 A − 12 AAT A. Step 3: Normalize the norm of each column of A to 1. This procedure for quasi-orthogonalization is validated as follows (Hyv¨arinen, Karhunen, & Oja, 2001). Let AT A = E DE T be the eigenvalue

A Conditional Entropy Minimization Criterion

2895

decomposition of the symmetric matrix AT A, where E ∈ Rm×m is an orm thogonal matrix and D is a diagonal matrix with eigenvalues {di }i=1 of T T A A. Then, by step 2 of the procedure, A A is modified as 1 AT A(→ (3A − AAT A)T (3A − AAT A) 4 ) 1 ( = E 9D − 6D2 + D3 E T . 4

Noting that di ∈ [0, 1] because the maximum eigenvalue of the matrix AT A is normalized to one in step 1, the eigenvalues of AT A after this transformation become h(di ) =

1 (9di − 6di2 + di3 ), 4

i = 1, . . . , m.

Because h(di ) − di = d4i {(di − 3)2 − 4} ≥ 0, eigenvalues of AT A converge to 1 by iterating those three steps. In actual experiments, we iterate these three steps 2 × m times to obtain an approximately orthogonalized matrix. Summarizing the above discussion, for a linear transformation AT : x (→ z, we obtain an algorithm for minimizing the class-conditional entropy depicted in Algorithm 1. We call this algorithm LCEM: linear dimensionalityreduction algorithm based on conditional entropy minimization. We have already shown a simple example of dimensionality reduction using FDA and LCEM in Figure 1. Before showing experimental results, we note that there is a study on supervised distance metric learning based on probabilistic extension of the k-NN method with a seemingly similar objective function of ours. Supervised distance metric learning aims at obtaining an appropriate distance metric matrix W = AAT for classification. This is equivalent to learning a transformation A so that the transformed data in the same class should be concentrated in a small region, and data in different classes should be separated as much as possible. Goldberger, Roweis, Hinton, and Salakhutdinov (2005) defined a probability that a datum x j is in a neighborhood of a datum x i by a Boltzmann-type distribution: p A(x j |x i ) = "

exp(−||AT x j − AT x i ||2 ) . T T 2 k)=i exp(−||A x k − A x i || )

(3.8)

Then, defining a set Ci whose elements belong to the same class with x i , they proposed to maximize the objective function f (A) =

N # # i=1 j∈Ci

p A(x j |x i )

(3.9)

2896

H. Hino and N. Murata

N N Input: Training data D = {x i }i=1 , x i ∈ Rn and class label data {yi }i=1 , yi ∈ {1, 2, . . . , C}. The dimension m (≤ n) of the transformed data. A gradient parameter ξ > 0. Initialization: Choose initial transformation matrix A ∈ Rn×m so that N rankA = m. Whiten the given data D = {x i }i=1 using its empirical mean and covariance. Iteration: Until convergence: Gradient step: Update each column of the transformation matrix:

alT := alT − ξ

∂ H(alT X|Y) , ∂ alT

l = 1, . . . , m.

Quasi-orthogonalization step: Until convergence: 1. Divide A by square root of the largest eigenvalue of AT A. 2. A := 32 A − 12 AAT A. 3. al := al /||al ||, l = 1, . . . , m.

Output: Converged transformation matrix A.

Algorithm 1: Linear dimensionality-reduction algorithm based on conditional entropy minimization. At the gradient step, a marginalized entropy is minimized by gradient descent method for each column of transformation A, and in the quasi-orthogonalization step, columns of A are quasiorthogonalized. using gradient ascent with respect to A. This metric learning algorithm is named NCA (neighborhood component analysis). Globerson and Roweis (2006) proposed an alternative method, MCML (maximally collapsing metric learning). Let p0 (x j |x i ) be the ideal distribution of p A(x j |x i ) defined by p0 (x j |x i ) ∝

*

1, yi = y j , 0, yi = ) yj .

(3.10)

¨ Then the objective function of MCML is defined by the Kullback-Leibler divergence of p A and p0 as N #

DK L ( p0 (x|x i )|| p A(x|x i )).

(3.11)

i=1

Since the distance metric matrix AAT must be positive semidefinite, the objective function is optimized by gradient descent with the positive definiteness constraint.

A Conditional Entropy Minimization Criterion

2897

The objective function of MCML and the class-conditional entropy are ¨ similar in appearance. In MCML, we optimize the sum of the KullbackLeibler divergence: arg min A

N # i=1

DK L ( p0 (x j |x i )|| p A(x j |x i ))

= arg min − A

= arg max A

N N # # i=1 j=1

N # # i=1 j∈Ci

p0 (x j |x i ) log p A(x j |x i )

log p A(x j |x i ).

(3.12)

On the other hand, in LCEM, we optimize the class-conditional entropy: arg min H(AT X|Y) A

= arg min − A

≈ arg max A

C #

p(y)

y=1

C #

#

y=1 j∈C y

!

p(AT x|Y = y) log p(AT x|Y = y)d x

log p(AT x i |Y = y).

(3.13)

Although they look similar, MCML uses a probability p A(x j |x i ) that a datum x i selects another datum x j as its neighbor, which is different from the distribution of data themselves. Moreover, in MCML, the probability p A(x j |x i ) is restricted to the form of a Boltzmann distribution to make the objective function simple and convex. 3.4 Experimental Study. We apply the proposed dimensionalityreduction technique as a preprocess in classification task. As a measure of separability of data in the transformed space, we adopt the onenearest-neighbor classifier. We employ the IDA data sets (http://ida.first .fraunhofer.de/projects/bench/benchmarks.htm), which are standard bi¨ nary classification data sets originally used in R¨atsch, Onoda, and Muller (2001). Table 1 lists the names of data sets, the dimensionalities of feature vectors, the numbers of training data and test data, and the numbers of realizations (pairs of training and test data sets) of the data. The dimensionalities of the original data are reduced by principal component analysis (PCA), FDA, MCML, LFDA, and LCEM. We estimated suitable embedding dimensionalities for PCA, MCML, LFDA and LCEM in the same manner as R¨atsch et al. (2001) used. That is, we ran five-fold cross-validation on the

2898

H. Hino and N. Murata

Table 1: IDA Data Specifications. Data Name Banana Breast-cancer Diabetes Flare-solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform

Input Data Dimensionality

Number of Training Samples

Number of Test Samples

Number of Realizations

2 9 8 9 20 13 18 20 60 5 3 20 21

400 200 468 666 700 170 1300 400 1000 140 150 400 1000

4900 77 300 400 300 100 1010 7000 2175 75 2051 7000 1000

100 100 100 100 100 100 20 100 20 100 100 100 100

first five realizations of the training sets and estimated the reduced dimensionality by median over the five estimates in each data set. Denoting the class-conditional entropy after the tth iteration by Ht , we stopped the iteration in LCEM when |Ht − Ht−1 |/|Ht−1 | < 10−4 holds. Table 2 shows means and standard deviations of the misclassification rates in percentages. The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface. The chosen embedding dimensionalities Dim are written in the table as [Dim]. Table 2 tells us that the classification accuracies obtained by LCEM are superior to PCA, FDA, and MCML for many data sets and comparable to LFDA. We also show the classification results in the original spaces in the column labeled “Euclidean.” Compared to the classification results in the Euclidean spaces, LCEM preserves classification accuracy for most data sets, and it even improves the accuracy for some data sets. From this experiment, we can speculate that there are following tendencies among the linear-supervised dimensionality-reduction methods we tested. Among the IDA data set, Banana, Thyroid, and Waveform are multimodal data, while other data are not. Fisher discriminant analysis is not appropriate for multimodal data, as Sugiyama (2007), pointed out and also a simple example is shown in Figure 1. Maximum collapsing metric learning is originally proposed as a distance metric learning technique, and it seems not to be appropriate for dimensionality reduction. This conclusion is drawn from the fact that the optimal reduced dimensionality of MCML found by cross-validation is relatively high compared to other dimensionality-reduction methods. As for LFDA and our proposed LCEM, they show similar results. Sugiyama (2007) claims that LFDA is appropriate for multimodal data. Our LCEM shows as good performance as LFDA when applied to Banana and Thyroid but not for Waveform. At this point, it is difficult to draw a

14.0(0.8)[2] 40.7(7.1)[3] 38.4(5.0)[4] 48.6(6.9)[5] 41.8(4.5)[2] 46.3(23.9)[4] 37.3(9.5)[2] 28.0(5.1)[10] 43.9(4.9)[2] 9.1(4.4)[2] 26.4(8.4)[1] 7.6(18.8)[3] 31.7(18.7)[9]

Banana Breast-cancer Diabetes Flare-solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform

38.3(4.0) 34.9(5.1) 31.3(2.8) 36.4(1.9) 32.0(2.6) 22.9(4.1) 22.1(0.9) 31.7(1.0) 20.4(0.8) 17.9(4.9) 22.5(1.1) 3.5(0.5) 18.6(1.2)

FDA 39.6(1.3)[1] 34.5(4.4)[4] 31.3(1.9)[7] 36.6(2.0)[5] 31.4(2.4)[17] 24.5(3.4)[10] 4.1(0.6)[15] 23.5(1.1)[8] 27.0(0.7)[43] 4.9(2.1)[4] 22.5(1.1)[1] 8.0(0.7)[19] 17.8(0.7)[17]

MCML 13.7(0.8)[2] 33.3(4.6)[6] 32.3(2.6)[3] 36.8(1.9)[2] 30.2(2.47)[11] 21.6(4.3)[5] 3.7(1.0)[13] 20.4(1.0)[6] 16.4(0.8)[5] 4.3(2.3)[3] 22.6(1.5)[1] 3.5(0.4)[6] 11.7(0.7)[2]

LFDA 13.6(0.8)[2] 33.6(4.4)[4] 30.1(2.1)[3] 36.5(1.9)[3] 31.2(2.6)[9] 22.7(4.0)[3] 3.4(1.0)[16] 19.7(0.8)[8] 20.6(0.6)[2] 4.4(2.2)[4] 22.5(1.1)[1] 3.6(0.4)[2] 16.3(1.0)[17]

LCEM

13.6(0.8) 32.7(4.8) 30.1(2.1) 36.5(1.9) 29.5(2.5) 23.2(3.7) 3.4(0.5) 35.0(1.4) 28.8(1.5) 4.4(2.2) 22.5(1.1) 6.7(0.7) 15.8(0.7)

Euclidean

Notes: The numbers in brackets denote standard deviations. The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface type.

PCA

Data Name

Table 2: Average and Standard Deviation of Misclassification Rates (in Percent) of Linear Dimensionality-Reduction Techniques.

A Conditional Entropy Minimization Criterion 2899

2900

H. Hino and N. Murata

general conclusion on which method is preferable to which kinds of data, and it remains future work for us. We show further experimental results in appendix C. 4 Multiple Kernel Learning Based on Conditional Entropy Minimization Fisher discriminant analysis has been extended to a nonlinear variant known as kernel Fisher discriminant analysis (KFDA; Mika et al., 1999) and has been shown to work well for data that are not linearly separable. In this section, through KFDA and the proposed dimensionality-reduction framework, we propose a novel method of multiple kernel learning (Lanckriet, Deng et al., 2004; Lanckriet, Cristianini et al., 2004; Lewis et al., 2006b; Do et al., 2009). 4.1 Kernel Fisher Discriminant Analysis. In this section, we consider dimensionality reduction to only one dimension for simplicity. Let f (x) = aT x be a linear projection from Rn to R. Since datum x is classified by comparing the value of this function and a certain threshold value, we call f (x) a classification function. Suppose a datum x ∈ Rn is . . mapped to n. -dimensional feature space Rn by a map + : Rn → Rn ; the . classification function then becomes a projection from the n -dimensional feature space to R as f (x) = aT +(x). We note that the projection vector a in the expression f (x) = aT x is an n-dimensional vector, while a in f (x) = aT +(x) is an n. -dimensional vector. Now we use the fact that in the kernel method, with some appropriate regularity condition, we can apply the representer theorem (Shawe-Taylor & Cristianini, 2004) to get expres"N sion a = i=1 αi +(x i ) with real-valued weight parameter α = (α1 , . . . , α N ). In this case, the inner product in the feature space is written by a kernel function as /+(x i ), +(x j )0 = k(x i , x j ), and we obtain the kernel expression of the classification function as f (x) =

N #

αi k(x, x i ).

(4.1)

i=1

Mika et al. (1999) proposed KFDA, a nonlinear extension of FDA by the kernel method. Let K ∈ R N×N be the Gramian matrix of the given data set N D = {x i }i=1 such that K i j = k(x i , x j ), and ki be the ith column vector of K . With this Gramian matrix and its column vectors, a sample mean vector of each class is given by 1 # y k¯ = ki , Ny i∈D y

A Conditional Entropy Minimization Criterion

2901

and a sample mean vector of all the data is given by 1 # k¯ = ki . N i∈D Then the between-class covariance matrix Vb and the within-class covariance matrix Vw in the feature space are written as C

Vb =

1 # y ¯ k¯ y − k) ¯ T, Ny ( k¯ − k)( N

Vw =

1 ## y y (ki − k¯ )(ki − k¯ )T . N i∈D

y=1 C

y=1

y

Then the objective of KFDA is minimizing log(α T Vw α/α T Vb α), and in the same way as FDA, it is formulated as a minimization problem of α T Vw α under the constraint that α T Vb α is constant. When we use a kernel function to represent the inner product in the high-dimensional feature space, minimizing log(α T Vw α/α T Vb α) sometimes results in overfitting. In this letter, we replace the within-class covariance matrix by the regularized withinclass covariance matrix Vw + ζ K , where ζ is a nonnegative regularization parameter. Now the KFDA problem is formulated as min α T (Vw + ζ K )α α

subject to

α T Vb α = const.

(4.2)

4.2 Multiple Kernel Learning Algorithm with Conditional Entropy Criterion. In this section, we combine multiple kernel functions with a coefficient vector β, and we optimize the coefficient β and weight α of the classification function by minimizing the conditional entropy. The difficulty in choosing a suitable kernel function and kernel parameter for a given data set is a serious drawback of the kernel method. One of the approaches proposed to address this problem is multiple kernel learning (MKL), in which several kernels are adaptively combined for a given data set. For large classes of combinations of kernel functions that preserve symmetry and positive definiteness, the resulting function also becomes a new valid kernel function (Shawe-Taylor & Cristianini, 2004). Consider a parameterized family of kernel functions K = {k( · , · ; λ); λ ∈ )},

(4.3)

where λ is a parameter that takes a value in a parameter space ) and characterizes the kernel function. For example, when we consider a family of gaussian kernels, ( ) k(x j , x i ; λ) = exp −λ||x j − x i ||2 , (4.4)

2902

H. Hino and N. Murata

λ is an accuracy parameter of the gaussian kernel, and ) = {λ ∈ R; λ > 0}. Then, noting that any convex combination of kernel functions becomes a kernel function again, we choose S kernel functions k( · , · ; λs ), s = 1, . . . , S with fixed parameters λs from this family K and define a new kernel function by a convex combination of them:

k( · , · ; β, λ) =

S # s=1

βs k( · , · ; λs ),

S # s=1

βs = 1, βs ≥ 0, s = 1, . . . , S. (4.5)

The idea of our proposed MKL technique is to solve the following optimization problem: min H( f (X; α, β)|Y)

(4.6)

α,β

subject to H( f (X; α, β)) = const., ||α||2 = 1, S # s=1

βs = 1, βs ≥ 0, s = 1, . . . , S,

"N αi k(x, x i ; β, λ) where we define a classification function f (x; α, β) = i=1 depending on α and β. We note that formally, it is equivalent to equation 3.1 with a regularization function $( f, D) = $(α, β, D)

( )2 = (H( f (X; α, β)) − 1)2 + ||α||2 − 1 & S '2 & S '2 # # + βs − 1 + (βs − |βs |) , s=1

s=1

for example. Since the direct simultaneous optimization of equation 4.6 with respect to both α and β is apparently difficult, we adopt an iterative optimization approach. We denote parameters after the tth iteration by α(t) and β(t). First, let us consider optimization of α for fixed β. We write the withinand between-class covariance matrices in the feature space as Vw (β), Vb (β) to show the dependency on β explicitly, and we omit ζ K in equation 4.2 for simplicity of description. As denoted in section 2, KFDA minimizes the upper bound of the class-conditional entropy; thus, the relationship between the class-conditional entropy and the KFDA objective function is

A Conditional Entropy Minimization Criterion

2903

written as H( f (X; α, β(t − 1))|Y) ≤ HG ( f (X; α, β(t − 1))|Y)

(4.7)

C

= log(2π)1/2 e + ≤ log(2π)1/2 e +

) ( 1 # Ny log α T Vy (β(t − 1))α 2 N y=1

) ( 1 log α T Vw (β(t − 1))α , (4.8) 2

" y y where Vy = N1y i∈Dy (ki − k¯ )(ki − k¯ )T . Inequality 4.8 is an upper bound of the class-conditional entropy, and for fixed β, the optimal upper-bounding solution α is given by KFDA. We next minimize the conditional entropy with respect to the kernel combination coefficient β with fixed α. The regularization term H( f (X; α, β)) = const. contains β, and as a new optimization objective, we can put this entropy term and conditional entropy term together using a tuning parameter η > 0 such as min H( f (X; α, β)|Y) − ηH( f (X; α, β)). β

(4.9)

In this letter, we take a simple strategy to avoid adding a parameter η. The minimization problem considered here is as follows: min H( f (X; α, β)|Y) β

subject to

S # s=1

βs = 1,

βs ≥ 0, s = 1, . . . , S.

(4.10) (4.11)

As a result of this β optimization step, we achieve a new kernel function, equation 4.5, with the updated coefficient β. With this new kernel function, we can update the covariance matrices Vw (β(t)), Vb (β(t)); then we again minimize the updated objective function of KFDA with respect to α. These two steps are iterated until both α and β are converged or until satisfying some predetermined stopping criterion. We name this algorithm MCEM (multiple kernel learning algorithm based on conditional entropy minimization). It is summarized in Algorithm 2. An intuitive explanation of the algorithm is given in Figure 2. The optimization method of H( f (X; α, β)|Y) with respect to β is arbitrary. In this letter, we devised two methods: the first is a random search algorithm, and the second is based on the convex (quadratic) approximation. In the former random search algorithm, we generate P candidates {β p } Pp=1 of β by a gaussian random number generator with mean vector β(t − 1). Then we calculate the conditional entropy H( f (X; α, β p )|Y) with these candidates {β p } Pp=1 and adopt one that minimizes the conditional

2904

H. Hino and N. Murata

N N Input: Training data D = {x i }i=1 , x i ∈ Rn and class label data {yi }i=1 , yi ∈ S {1, 2, . . . , C}. Kernel parameter λ = {λs }s=1 for S element kernels S {k( · , · ; λs )}s=1 , regularization parameter ζ > 0 for KFDA. S Initialization: Initialize the combination coefficients β(0) = {βs (0)}s=1 of "S element kernels by random values such that s=1 βs (0) = 1 and βs (0) ≥ 0, s = 1, . . . , S. Repetition: Until convergence, from t = 1: α optimization step: Solve KFDA minimization problem for a fixed β(t − 1) to get α(t): minα α T (Vw (β(t − 1)) + ζ K )α

subject to α T Vb α = const., ||α||2 = 1. β optimization step: Minimize the conditional entropy of the classification function f (X; α(t), β) for fixed α(t) to get β(t): minβ H( f (X; α(t), β)|Y) subject to

S # s=1

βs = 1, βs ≥ 0, s = 1, . . . S.

Output: Converged parameters α and β, used to construct the classification "N αi k(x i , x; β, λ). function as f (x; α, β) = i=1 Algorithm 2: Multiple kernel learning algorithm based on conditional entropy minimization. The algorithm iteratively optimizes the classification function that defines one-dimensional classification axis. entropy. Although this algorithm is naıve, it works well and is applicable to arbitrary form of kernel combinations other than a convex combination. The latter algorithm is described in appendix D. Depending on the method used in β optimization step, we call the random search version MCEM.R and the quadratic approximation version MCEM.Q. 4.2.1 Related Works on Multiple Kernel Learning. Several attempts have been made to learn kernel functions from the given data. The most popular approach in the context of MKL considers a finite set of predefined element kernels that are combined so that the margin-based objective function of SVM is optimized. Lanckriet, Cristianini et al. (2004) and Lanckriet, Deng et al. (2004) have proposed a framework to combine multiple kernel functions for support vector machines (SVMs). They have modified the classification function of SVM,

f (x) =

N # i=1

yi αi k(x i , x) + b,

(4.12)

A Conditional Entropy Minimization Criterion

2905 upper bounding

and

H(f (X; α, β)|Y )

α optimization

β optimization

∼ log(αT Vb (β(0))α)

α(1), β(0)

H(f (X; α, β(0))|Y )

α(2), β(1)

∼ log(αT Vb (β(1))α) H(f (X; α, β(1))|Y )

α(1), β(1) α(3), β(2)

∼ log(αT Vb (β(2))α) H(f (X; α, β(2))|Y )

α(2), β(2)

α

Figure 2: A conceptual diagram of the proposed multiple kernel learning algorithm. Dashed curves denote level curves of the conditional entropy. Solid curves denote level curves of the upper bound of the conditional entropy, which are equivalent to the objective functions of KFDA. The proposed algorithm iterates the upper bounding approximation and KFDA to minimize the conditional entropy with respect to α with fixed β, and minimizing the conditional entropy with respect to β with fixed α.

with f (x) =

N # i=1

yi αi

S # s=1

βs ks (x i , x) + b,

(4.13)

N and maximized the margin of the SVM classifier with respect to α = {αi }i=1 S and β = {βs }s=1 simultaneously by using semidefinite programming (SDP). Recently, Do et al. (2009) proposed a novel MKL method considering the fact that the theoretical error bound of SVM depends on both the margin and the radius of the smallest sphere that contains all the training samples. They derived an iterative algorithm named R-MKL to optimize the margin and the radius with respect to the weight vector α and the combination parameter β. In the next section, we compare the performance of our proposed MCEM algorithms to the representative MKL method using SDP (Lanckriet, Cristianini et al., 2004; Lanckriet, Deng et al., 2004) and R-MKL (Do et al., 2009).

2906

H. Hino and N. Murata

4.3 Experimental Study. In the same manner as we did in the linear dimensionality reduction, we conduct experiments with one-nearestneighbor (1-NN) classifiers. As a comparative study of kernel combination techniques, we also tackle the yeast protein function annotation task. In Table 3, we show classification results by KFDA, KLFDA, MCEM.R, and MCEM.Q algorithms, where KLFDA is a kernelized version of LFDA (Sugiyama, 2007). Except for KFDA, the reduced dimensionalities are arbitrary. However, we fixed them to one for the sake of simplicity. We used gaussian kernels for all algorithms. For KFDA, we applied two methods of determining the kernel parameter. One is the so-called Jaakkola’s heuristics, which uses the median of smallest Euclidean distance between the feature vectors in one class and the other class (Jaakkola, Diekhans, & Haussler, 1999) (KFDA(H) in Table 3). The other is cross-validation by first five realizations of each data set, in the same manner as in the linear case (KFDA(CV) in Table 3). The regularization parameter ζ in equation 4.2 is fixed to ζ = 0.001 for all experiments. In the proposed MCEM algorithms, we used 20 gaussian kernels with parameters λ = (10, 9, . . . , 1, 0.75, 0.5, 0.25, 0.1, 0.075, 0.05, 0.025, 0.01, 0.005, 0.001). To see the effect of kernel combination optimization, we also see the classification result by KFDA with unweighted combination of kernel functions (KFDA(UC) in Table 3). Table 3 shows that the nonlinear dimensionality-reduction techniques based on the kernel method outperform linear dimensionality-reduction techniques shown in Table 2 for many data sets. We note that KLFDA does not work well for these data sets. We conjecture that the reason the kernel methods do not work for some data sets is their ease of separation in the original Euclidean space. It is sometimes observed that nonlinearization by the kernel method degenerates the separability when the raw data are easily separated by linear methods. From Table 3, the kernel methods perform worse than 1-NN classifications in Euclidean space for Banana, Image, and Thyroid data. We can observe that in Euclidean space, the classification error of these three data sets is relatively small, and we think it is the reason that the kernel methods do not work well for them. One of the favorable features of the proposed MCEM algorithms is that it does not need to determine kernel parameters by cross-validation like KFDA. We only need to prepare several kernel functions with different kernel parameters and relegate the optimization of the kernel parameter to the optimization of the combination parameter. We can see that without kernel parameter tuning, MCEM algorithms perform comparable to or better than KFDA. As an experiment to compare with other MKL techniques, we apply the MCEM algorithms to the problem of yeast protein function annotation. We compare the proposed techniques against SVMs using single kernel and

31.26(3.40) ◦ 31.76(4.84) 30.23(2.44) ◦ 35.48(2.09) ◦ 28.91(2.88) ◦ 21.12(3.72) 12.86(1.23) ◦ 2.06(0.45) ◦ 18.14(0.76) 5.45(2.27) 22.61(1.05) ◦ 3.21(0.45) ◦ 11.67(0.74)

Banana Breast-cancer Diabetes Flare-solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform

15.00(0.98) ◦ 31.8(4.91) ◦ 29.44(2.21) ◦ 35.66(2.18) 29.31(2.67) ◦ 21.39(3.58) 11.8(1.4) ◦ 2.06(0.38) ◦ 20.16(1.13) 5.93(2.39) 22.37(1.06) ◦ 3.21(0.45) ◦ 12.03(0.82)

KFDA(H) 16.25(1.48) 32.34(5.24) ◦ 27.34(2.63) 36.24(2.48) ◦ 26.06(2.83) ◦ 20.71(5.23) 13.13(1.33) ◦ 2.93(1.49) ◦ 17.76(1.69) 7.97(3.78) 22.57(1.30) ◦ 3.74(1.08) ◦ 10.94(1.15)

KFDA(UC) 36.79(4.44) 35.89(5.01) 36.22(2.75) 37.01(1.83) 41.90(2.72) 33.93(4.85) 28.46(1.84) ◦ 2.28(0.51) 37.95(14.84) 11.12(3.61) 22.44(1.03) 44.57(5.36) 28.85(1.88)

KLFDA (CV) 16.18(1.23) 32.44(4.29) 26.96(2.26) 36.19(1.96) ◦ 26.10(2.48) ◦ 19.86(4.73) 10.53(1.53) ◦ 2.91(1.46) ◦ 18.10(2.25) 7.16(2.90) 22.26(1.04) ◦ 3.24(0.49) ◦ 10.98(1.01)

MCEM.R 17.78(2.18) ◦ 28.13(4.93) ◦ 26.18(2.46) ◦ 35.46(1.99) ◦ 25.30(2.27) ◦ 17.48(3.79) 18.77(1.44) ◦ 2.69(1.27) 24.15(1.33) 7.87(3.27) 22.46(1.08) ◦ 3.24(0.49) ◦ 12.26(1.35)

MCEM.Q

13.64(0.76) 32.73(4.82) 30.12(2.05) 36.47(1.88) 29.46(2.47) 23.16(3.74) 3.38(0.54) 35.03(1.36) 28.77(1.52) 4.36(2.210) 22.50(1.057) 6.68(0.72) 15.83(0.65)

Euclidean

Notes: The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface type. Figures are marked by ◦ when they improved the classification results in Euclidean space based on the t-test with a significance level of 5%.

KFDA(CV)

Data Name

Table 3: Misclassification Rates (in Percentages) by KFDA, KLFDA, and MCEMs.

A Conditional Entropy Minimization Criterion 2907

2908

H. Hino and N. Murata

two MKL techniques in Lanckriet, Cristianini et al. (2004) and Do et al. (2009) using data available from the support Web site of Lanckriet, Deng et al. (2004). We use three attributions of yeast protein data represented by kernels: representing gene expression, protein domain content, and protein sequence similarity. We train 12 binary classifiers for each of 12 functional classes of yeast genes. We randomly sample from the data set to reduce its size to 500 genes and then perform three-fold cross-validation, repeating the entire procedure five times. Table 4 summarizes the mean area under the ROC curves (AUCs) over 15 trials for all techniques. We tested various values for the soft margin parameter for SDP and R-MKL by running full classification experiments and adopted the best values. We note that the proposed MCEM framework is intended to learn a kernel function that regulates the distribution of the given data in the feature space so that the data are compactly aggregated in each class. The MCEM framework is flexible enough to be used as a kernel learning preprocessing, and it can be combined with other classifiers besides KFDA. In this experiment, we also show results of the SVM classification with kernel matrices learned by MCEM algorithms. In this experiment, we simply set the regularization parameter ζ for KFDA used in MCEM algorithms to ζ = 0.001, and soft margin parameter for SVM to one. From Table 4, we see that the proposed methods show comparable accuracy to other MKL methods, such as SDP and R-MKL. 5 Discussions on Information-Theoretic Dimensionality Reduction Methods Since there are enormous numbers of studies on supervised dimensionality reduction or feature extraction, we devote this section to the literature survey. Supervised dimensionality-reduction techniques can be divided into two categories: one based on margin maximization (Weston et al., 2000; Tao, Chu, & Wang, 2008), and the other on covariance structures such as FDA and LCEM. In FDA, an equivalent gaussian distribution for each class is assumed. By considering entropy or mutual information, covariance-based approaches can be generalized. Since I (Z; Y) = H(Z) − H(Z|Y) holds, maximizing the mutual information of the transformed data Z and the class label Y is equivalent to minimizing the class-conditional entropy H(Z|Y) with a regularization term, for example, $(A) = 1/H(Z), in our approach. These general dimensionality-reduction methods, referred to as informationtheoretic dimensionality reduction, are based on the Shannon entropy. Basically, these methods need to estimate joint entropy or mutual information. Since entropy is calculated from the density function of the transformed data, there are various methods depending on density estimation. Methods of estimating density functions can be divided into parametric and nonparametric approaches. In the parametric approach, a gaussian mixture model (GMM) is often adopted to approximate the distribution of Z = AT X. For example, by gradient ascent, Leiva-Murillo and Artes-Rodriguez (2004)

0.682 0.708 0.619 0.706 0.854 0.590 0.570 0.612 0.686 0.622 0.612 0.657

1 2 3 4 5 6 7 8 9 10 11 12

0.767 0.676 0.689 0.733 0.789 0.655 0.678 0.635 0.744 0.658 0.585 0.911

Dom

0.774 0.689 0.688 0.758 0.777 0.688 0.708 0.669 0.741 0.701 0.608 0.883

Seq 0.778 0.737 0.683 0.786 0.856 0.692 0.714 0.711 0.783 0.698 0.586 0.875

SDP 0.778 0.725 0.699 0.776 0.874 0.680 0.703 0.716 0.775 0.660 0.593 0.895

R-MKL 0.784 0.736 0.699 0.769 0.804 0.690 0.710 0.700 0.752 0.705 0.613 0.885

MCEM.R 0.766 0.748 0.692 0.771 0.817 0.682 0.695 0.746 0.768 0.673 0.611 0.832

MCEM.Q 0.796 0.713 0.697 0.769 0.803 0.692 0.714 0.684 0.750 0.703 0.597 0.886

MCEM.R+ SVM

0.776 0.728 0.695 0.770 0.834 0.688 0.704 0.726 0.784 0.674 0.582 0.848

MCEM.Q+ SVM

Notes: The table lists, for each functional class (row) and each classification technique (column), the mean AUC from five times three-fold crossvalidation. The optimal mean AUC per data set is shown in boldface type. The first three columns correspond to SVMs with single kernels (gene expression, protein domain content, and sequence similarity, respectively). The SDP and R-MKL columns correspond to SVMs with the combined kernel optimized by methods in Lanckriet, Cristianini et al. (2004) and Do et al. (2009), respectively.

Exp

Function

Table 4: Comparison of MKL Techniques on the Yeast Protein Function Annotation Task.

A Conditional Entropy Minimization Criterion 2909

2910

H. Hino and N. Murata

maximized the mutual information I (Z; Y) calculated by means of density estimation by a GMM. Kaski and Peltonen (2003), Sajama and Orlitsky (2005), and Goldberger, Peltonen, and Kaski (2007) also used a GMM to estimate the conditional probability p(AT x|y). Then, using Bayes’s theorem, they estimated p(y|AT x) and maximized a conditional likelihood, L(A) =

N # i=1

p(yi |AT x i ),

(5.1)

by gradient ascent. In the nonparametric approach, no assumption is made on distributions, and in general, only a small number of tuning parameters such as kernel bandwidth are predefined. Recently He, Hu, and Yuan (2009) have proposed a supervised dimensionality-reduction method by means of entropy maximization: max H(AT X) s.t. A

H(AT X|Y) = const., AT A = Im .

(5.2)

This entropy maximization criterion is similar to our framework proposed in section 3 because our quasi-orthogonalization corresponds to a constant constraint on HG (Z) ≥ H(Z) under the assumption that all data are whitened. He et al. (2009) also gave a theoretical validation for their proposal based on the relationship between the class-conditional entropy and the objective function of FDA. However, they made a strong assumption for the data distribution on the projected space to reduce the constrained entropy maximization problem to a generalized eigenvalue problem. Furthermore, their approach makes use of the Renyi quadratic entropy (Renyi, 1960), which is defined as ! HR2 (X) = − log p(x)2 d x, (5.3) instead of the Shannon entropy. The use of the Renyi entropy for unsupervised learning such as ICA is proposed in Fisher and Principe (1997), and there is a lot of related work using the Renyi entropy to avoid the difficulty in estimating the Shannon entropy (Principe & Dongxin, 1999; Torkkola & Campbell, 2000; Torkkola, 2003; Hild, Erdogmus, Torkkola, & Principe, 2006, for instance). The Renyi quadratic entropy gives a lower bound of the Shannon entropy as H(X) ≥ HR2 (X),

(5.4)

and most existing work makes use of this property to validate maximizing the Renyi quadratic entropy instead of the Shannon entropy. However, the relationship between the class-conditional entropy and the objective function of FDA is described in terms of Shannon’s original definition of entropy, and other theoretical validations of the proposed framework depicted in appendix B also rely on the definition of the Shannon entropy.

A Conditional Entropy Minimization Criterion

2911

As noted, a lot of studies on information-theoretic dimensionality reduction exist. However, most of existing parametric approaches tackled the Renyi entropy manipulation problem instead of the original Shannon entropy manipulation. By introducing a simple entropy estimator (Faivishevsky & Goldberger, 2009) and upper bounding entropy by the sum of marginal entropies, we can estimate and optimize the Shannon entropy efficiently. 6 Conclusion In this letter, we treated the dimensionality-reduction technique as an information-theoretic optimization problem and proposed a general framework of supervised dimensionality reduction based on conditional entropy minimization. By simple experiments, we show that the proposed framework can find the optimal classification surface even when the conventional Fisher discriminant analysis fails to do so. We also clarified the mechanism responsible for the discriminative dimensionality-reduction effect obtained by the proposed criterion. We implemented a linear dimensionality-reduction technique based on the proposed framework and applied it to large-scale benchmark data sets. We demonstrated that the classification accuracy after reducing dimensionalities by LCEM is better than conventional dimensionality-reduction techniques such as PCA and FDA and comparable to the state-of-the-art methods such as LFDA. There has been an increase in research on dimensionality reduction or manifold learning techniques that take account of the local metric structure of data distribution. Besides LFDA considered in this letter, the locality preserving projection (LPP; He & Niyogi, 2003) and the Laplacian eigenmap (LE; Belkin & Niyogi, 2003) are well known as examples for techniques that use the affinity between data points explicitly. In LFDA, the FDA criterion was generalized to reflect the affinity of data points. In LPP and LE, data points are projected onto a low-dimensional space so that the points close to each other in the original space are kept close in the projected space. The optimization problem of LPP and LE is formalized with the Laplacian matrix defined by the affinity matrix of data points and reduced to the generalized eigenvalue problems. On the other hand, the objective function of the proposed framework is the class-conditional entropy of the transformed data and does not explicitly consider the locality of the data. However, in estimating entropy (Faivishevsky & Goldberger, 2009), locality of the data distribution is naturally reflected, and thus comparable performance to locality-conscious methods, such as LFDA, is obtained. It will be important future work to investigate the probability model underlying other dimensionality-reduction methods and the relationship to LCEM from the viewpoint of information theory. We also considered multiple kernel learning with the conditional entropy minimization framework. To the best of our knowledge, there is no MKL technique based on the conditional entropy criterion. The proposed

2912

H. Hino and N. Murata

algorithm is not only novel; it also worked well for real-world data. As shown in Tables 3 and 4, it can acquire comparable or superior accuracy to KFDA without kernel parameter tuning, and it is also comparable to other MKL methods. Furthermore, it is shown that the proposed MCEM framework can be used as a kernel optimization process for other classifiers such as SVM. To keep the β optimization step simple, we omitted the entropy regularization term and optimized a conditional entropy term only in equation 4.10. At the cost of optimization with respect to η in equation 4.9, we may obtain improved classification results. We considered only a linear combination of kernels in this letter; however, there are other kernel combinations. Lewis et al. (2006a) generalized a way of kernel combination that allows the coefficients of kernels to depend on data points. Our proposed framework is also applicable to such a combination to improve classification accuracy. In future work, we would like to address the relationship between the proposed framework and sufficient dimensionality-reduction (SDR) research. The problem of SDR is finding a subspace such that the projection of the data vector x onto the subspace captures the statistical dependency of the class y (response, in the literature of regression) on x as much as possible. It is of great interest to develop procedures for estimating this subspace, and it has been studied (Li, 1991; Cook & Yin, 2001; Fukumizu, Bach, & Jordan, 2009). The subspace obtained by the proposed framework is, by definition, the one in which projected data distribution has low conditional entropy. As stated in section 3 and illustrated in Figure 1b, the data must be locally distributed when conditioned by the class label in the projected subspace. Another way of characterizing of the subspace obtained by conditional entropy minimization in the context of SDR is important future work for us. The convergence properties of the proposed MCEM algorithms have not been investigated yet. The study of the property and condition of convergence remains as interesting future work. We also would like to examine techniques for simultaneously optimizing the weight parameter α and coefficients β in problem 4.6 as Lanckriet, Cristianini et al. (2004) did. Appendix A: On Entropy Estimation and Approximation Methods In this appendix, we support our selection of the entropy estimator and our approximation approach for the joint entropy with the sum of marginal entropies. We first compare two nonparametric entropy estimation methods. First is a traditional leave-one-out (LOO) method based on kernel density estiN mation (Beirlant et al., 1997). Given a data set D = {xi }i=1 , xi ∈ R, we first estimate the probability density function p(x) by N

pˆ (x; D, h) =

) ( 1 1 # exp −||x − xi ||2 /2h 2 , √ 2 N 2π h i=1

(A.1)

A Conditional Entropy Minimization Criterion

2913

Table 5: Performance of the MNN Entropy Estimator in Comparison with an LOO Entropy Estimator.

Mean square error Standard deviation

LOO

MNN

0.08154036 0.02514127

0.0155177328 0.0001276099

where h is a kernel bandwidth parameter. We determine the parameter by a simple heuristics, Silverman’s rule of thumb (Wand, & Jones, 1994). The estimated probability density function pˆ can be used to approximate the entropy of X as ˜ H(X) ≈ H(X) = −E[log pˆ (X; D, h)].

(A.2)

Then we replace pˆ (x; D, h) by pˆ (x j ; D\{x j }, h) and approximate the expectation operation by LOO method as ˜ ˆ H(X) ≈ H(X) =−

N

1 # log pˆ (x j ; D\{x j }, h). N

(A.3)

j=1

We compare this LOO and the MNN entropy estimators for exponentially distributed data. The density function of the exponential distribution x is p(x; µ) = µ1 e − µ , x ≥ 0, and in this case, the entropy can be analytically calculated as H(X) = log µ + 1. We generate N = 500 samples of random variables from exponential distributions with 10 various values of the parameter µ = (0.2, 0.4, . . . , 2.0). Table 5 shows mean squared errors and standard deviations of entropy estimations. From this table, the MNN estimator is more accurate than the LOO estimator in terms of mean squared errors. It is notable that the standard deviation of the MNN estimator is far smaller than that of LOO. This property is favorable when we evaluate the gradient of entropy. We next show a simple experimental result to support our approach of entropy estimation. Entropy estimation of high-dimensional random variables is prone to giving a poor result because of the curse of dimensionality. To avoid this problem, the joint entropy is bounded from above by the sum of marginal entropies as H(Z) ≤

m #

Hl (Zl ).

(A.4)

l=1

In general, even after decorrelating each dimension of the transformed vector Z by whitening quasi-orthogonalization, there exists a gap between "m H(Z) and H (Z l l ) that stems from higher-order moments. Because l=1 our objective of entropy estimation is to find the transformation matrix

2914

H. Hino and N. Murata

Table 6: Mean Square Error and Standard Deviation of Difference |φ J S − φ MS | and |φ J S − φ MR |.

Mean square error Standard deviation

|φ J S − φ MS |

|φ J S − φ MR |

0.09047787 0.05246031

0.1146681 0.09060831

T A that minimizes the joint entropy "m H(Z) = H(A X), it is important that the minimizer A for H(Z) and l=1 Hl (Zl ) is close enough. It is difficult to show a general result, but we show the following simple experimental result. We transform a nongaussian three-dimensional variate X to a two-dimensional subspace by a family of matrices A(φ) ∈ R3×2 with a parameter φ. Let the minimizer of the estimated Shannon joint entropy, the sum of the Shannon marginal entropies, and the sum of the Renyi marginal entropies be φ J S , φ MS , and φ MR , respectively. Then we experimentally show that |φ MS − φ J S | < |φ MS − φ MR | holds. For a nongaussian multivariate distribution, we adopt a three-dimensional gaussian mixture distribution with two components,

p(x; r, (µ1 , !1 ), (µ2 , !2 )) = r × N (µ1 , !1 ) + (1 − r ) × N (µ2 , !2 ) T

T

µ1 = (−1, 1, 2) , µ2 = (1, −1, −2)    1 0.5 0.1 1 !1 =  0.5 1 0.2  , !2 =  0.7 0.1 0.2 1 0.3



0.7 0.3 1 0.2  , 0.2 1

(A.5) (A.6) (A.7)

where r = 0.3. We define the column-orthonormal transformation matrix as A(φ) = (a(φ)1 , a(φ)2 ), where a(φ)1 = (cos(π/3) cos(φ), cos(π/3) sin(φ), sin(π/3))T

(A.8) T

a(φ)2 = (− sin(π/3) cos(φ), − sin(π/3) sin(φ), cos(π/3)) .

(A.9)

We vary the parameter φ from π/5 to 3π/5 and transform the original data N by A(φ). We generate N = 5000 data from the gaussian mixture, {x i }i=1 equation A.5, and estimate φ J S with 5000 data and estimate φ MS and φ MR with 500 data. We note that it is difficult to calculate the joint entropy analytically, so we used the MNN estimator with a lot of data (N = 5000) for joint entropy estimation. We repeat this procedure 100 times and show the mean squared differences of estimated (φ J S − φ MS ) and (φ J S − φ MR ) in Table 6. From this table, we can see that the minimizer φ MS is closer to φ J S compared to φ MR , and the estimates are more stable. We also plot one of the results of the above experiment in Figure 3. The minimum value of each estimate is acquired at φ = 0.75π, 0.72π, 0.70π, respectively. The parameter

2915

3.6 3.4 3.0

3.2

: S.Joint : S.Marginal : R.Marginal

2.8

estimated entropy

3.8

4.0

A Conditional Entropy Minimization Criterion

0.2

0.4

0.6

0.8

1.0

1.2

1.4

rotation angle (rad) Figure 3: Estimated Shannon joint entropy (solid line), the sum of marginal Shannon entropies (dashed line), and the sum of marginal Renyi entropies (dotted line). The minimum point is indicated by a circle on each curve.

values that minimize the estimated joint and the sum of marginal Shannon entropies are close, so this simple experimental result supports our upper bounding approach. Appendix B: Validations of the Proposed Framework In this appendix, we validate the proposed framework of supervised dimensionality reduction from a different point of view from section 2. In dimensionality-reduction problems, it is desirable that the transformed data compactly aggregated in each class. Putting the notion of compact representation in the perspective of information theory, a good transformation for dimensionality reduction is the one with small mutual information, I (X; Z) = H(Z) − H(Z|X),

(B.1)

because small I (X; Z) indicates a high compression rate when we regard the transformation x (→ z as a data compression process (Cover & Thomas, 1991). Since the mutual information, equation B.1, is determined only by the distribution of the original data X and transformed data Z, this equation can be regarded as a criterion for unsupervised dimensionality reduction. In supervised dimensionality reduction, the data are required to be compactly

2916

H. Hino and N. Murata

distributed in each class. In this case, it is natural to measure the goodness of the transformation by the class-conditional mutual information, I (X; Z|Y) = H(Z|Y) − H(Z|X, Y).

(B.2)

It is also natural to suppose that the transformation x (→ z is deterministic, and in such cases, H(Z|X, Y) is equal to 0, and the goodness of the transformation can be essentially measured by the class-conditional entropy H(Z|Y). From the above discussion, we claim that the proposed framework is reasonable in the context of data compression theory. We next consider the negentropy and the class-conditional negentropy of a random variable X (Hyv¨arinen et al., 2001), defined as J (Z) = HG (Z) − H(Z),

(B.3)

J (Z|Y) = HG (Z|Y) − H(Z|Y),

(B.4)

respectively. With this negentropy expression, we get H(Z) − H(Z|Y) = {HG (AT X) − HG (AT X|Y)} − {J (AT X) − J (AT X|Y)} =

1 |AT ! A| log /C − {J (AT X) − J (AT X|Y)}, T ! A| p(y) 2 |A y y=1

where ! and ! y are covariance matrices of D and Dy , respectively, and p(y) is the class prior distribution. Then the conditional entropy H(Z|Y) can be divided into three terms as H(Z|Y) = H(AT X|Y) = H(AT X) + {J (AT X) − J (AT X|Y)} − = HG (AT X) − J (AT X|Y) −

1 |AT ! A| log /C T p(y) 2 y=1 |A ! y A|

|AT ! A| 1 log /C . T p(y) 2 y=1 |A ! y A|

(B.5)

In the following, we consider the meanings of these three terms.

B.1 Joint Entropy Under Gaussian Assumption. The first term HG (AT X) of equation B.5 is the entropy when the distribution of all the data is gaussian. The value of this term is completely determined by the determinant of the covariance matrix ! of all the data, and minimizing this term is equivalent to expressing all the data compactly. However, this term becomes arbitrarily small by scalar multiplication of the data, so it does not

A Conditional Entropy Minimization Criterion

2917

influence the classification ability. As a consequence, under an assumption that the transformation for dimensionality reduction is regularized in some manner, minimizing this first term HG (AT X) does not make a significant contribution to classification accuracy. B.2 Conditional Negentropy. The negentropy of a distribution is a normalized version of entropy defined by equation B.3, and mainly used in the research of independent component analysis (ICA; Comon, 1994; Hyv¨arinen, 1999; Hyv¨arinen et al., 2001) as a measure of nongaussianity. In this sense, negentropy is also used as a measure of how interesting the data distributions. From equation B.5, we can see that minimizing the conditional entropy H(AT X|Y) contributes to maximizing the conditional negentropy J (AT X|Y). As a result, we can expect the obtained transformation to increase nongaussianity in each class of the transformed data. B.3 Heterogeneous Discriminant Analysis Criteria. The last term of the equation B.5 is exactly the same as the objective function of heteroskedastic discriminant analysis (HDA) defined by Kumar and Andreou (1998), and optimizing this term leads us to good class discrimination. In HDA, the covariance structure can differ for each class and has been investigated by many researchers (Kumar & Andreou, 1998; Hastie & Tibshirani, 1996; Loog & Duin, 2004; Zhang & Yeung, 2009) to overcome the strict assumption of FDA. Appendix C: Further Experimental Result on Linear Dimensionality Reduction We show further experimental results on linear dimensionality-reduction techniques. We first compare classification accuracy when dimensionality of data are reduced to one in Table 7. From this table, for most of data, we can conclude that LCEM performs comparable to or slightly better than other supervised dimensionality-reduction methods when dimensionality is reduced to one. We next show the misclassification rates as functions of reduced dimensionality. The result shows that LCEM works well, but overall there is no single best method that consistently outperforms the others. As seen from Figure 4, basically the misclassification rate gets smaller as dimensionality increases. However, Ringnorm data seem to have a minimum misclassification dimension near d = 7 and suggest the need for some sort of model selection procedure to find the best dimensionality.

2918

H. Hino and N. Murata

Table 7: Average Misclassification Rate (in Percentage) of Linear Techniques When Dimensionality Is Reduced to One. Data Name Banana Breast-Cancer Diabetes Flare-Solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform

PCA 36.5(0.6) 38.9(5.5) 40.0(4.2) 43.8(5.7) 42.0(2.3) 44.0(25.9) 44.0(9.0) 36.1(8.4) 42.7(4.3) 9.3(3.8) 23.0(1.6) 3.6(0.3) 36.8(19.0)

FDA 38.3(4.0) 34.9(5.1) 31.3(2.8) 36.4(1.9) 32.0(2.6) 22.9(4.1) 22.1(0.9) 31.7(1.0) 20.4(0.8) 17.9(4.9) 22.5(1.1) 3.5(0.5) 18.6(1.2)

MCML 39.4(1.3) 33.8(5.4) 40.6(2.2) 36.2(2.6) 39.9(3.3) 41.8(5.6) 29.3(1.5) 41.9(0.9) 45.4(2.0) 19.6(3.3) 22.2(1.0) 40.9(1.2) 40.4(1.2)

LFDA 36.2(1.2) 33.9(4.7) 34.1(2.4) 36.8(1.9) 38.4(3.3) 22.5(3.2) 31.2(1.6) 31.6(1.6) 20.9(0.9) 7.4(3.4) 22.6(1.5) 3.4(0.4) 18.6(1.1)

LCEM 34.4(1.6) 34.5(4.8) 30.7(2.5) 36.6(2.0) 31.8(2.8) 22.9(3.2) 22.6(1.4) 31.9(1.1) 20.6(0.6) 17.2(4.2) 22.5(1.0) 3.5(0.5) 18.7(1.1)

Note: The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface type.

Appendix D: Quadratic Optimization for MCEM.Q Algorithm We give the detailed derivation of the approximate β optimization step in the MCEM.Q algorithm. We again consider the relationship between conditional entropy and its upper bound optimized in the KFDA represented by equations 4.7 and 4.8. The final right-hand side of equation 4.8 is equivalent to the objective function of the KFDA. In KFDA, this upper bound of conditional entropy is minimized with respect to α. For kernel combination coefficient β, we can also derive the same kind of upper bound of conditional entropy. We first rewrite Vy (β) explicitly using the element kernels. Let K (s) be the Gramian matrix of the sth element kernel function k( · , · ; λs ) and ki (s) be the ith column vector of K (s). Then Vy (β) can be written as 1 # y y Vy (β) = (ki − k¯ )(ki − k¯ )T , (D.1) Ny i∈D y

ki =

S # s=1

y βs ki (s), k¯ =

S #

y βs k¯ (s).

(D.2)

s=1

Now we ignore a constant term and multiplicative factors and obtain an upper bound of the conditional entropy as C # Ny log |α T Vy (β)α| N y=1 0   0 0  C  00 # # 0 Ny 1 y y = log 00α T (ki − k¯ )(ki − k¯ )T α 00  0 N 0  Ny i∈D y=1

y

A Conditional Entropy Minimization Criterion

1.2

1.6

1.8

2.0

Mean error rate 2

4

6

1

8

4

5

6

flare−solar

german

heart

8

6

0.35 0.25

8

5

10

15

0.20

0.30 4

0.40

0.45

7

0.30

Mean error rate

0.40 0.35

Mean error rate

0.45 0.40

20

2

4

6

8

10

Reduced dimension

image

ringnorm

splice

0.40

Mean error rate

0.35 0.30

Mean error rate

0.20

0.25

0.3 0.2

5

10

15

5

10

15

20

0.15 0.20 0.25 0.30 0.35 0.40 0.45

Reduced dimension

0.4

Reduced dimension

0

10

20

30

40

50

Reduced dimension

Reduced dimension

thyroid

titanic

twonorm

4

5

0.4

60

0.3

Mean error rate

0.235 0.230

0.1

Mean error rate 3

0.225

0.15 0.10 0.05

2

Reduced dimension

0.220

0.20

0.240

Reduced dimension

1

12

0.2

Mean error rate

3

Reduced dimension

0.1

Mean error rate

2

Reduced dimension

2

Mean error rate

0.30 0.32 0.34 0.36 0.38 0.40 0.42

Mean error rate

1.4

diabetes

Reduced dimension

0.50

1.0

breast−cancer

0.30 0.32 0.34 0.36 0.38 0.40 0.42

0.30 0.25 0.20

: LCEM : LFDA : MCML : PCA

0.15

Mean error rate

0.35

0.40

banana

2919

1

2

Reduced dimension

3

5

10

15

20

Reduced dimension

0.30 0.25 0.20 0.15

Mean error rate

0.35

0.40

waveform

5

10

15

20

Reduced dimension

Figure 4: Mean misclassification rates as functions of reduced dimensionalities. Four linear dimensionality-reduction methods are used to map data into spaces lower than or equal to the original dimensionality and classified by one-nearestneighbor classifiers.

2920

H. Hino and N. Murata

 0 0  0  C  00 # # 0 T Ny 1 y T ˜ y T 0 ˜ ˜ ˜ = ( K i − K )ββ ( K i − K ) log 0α α0  Ny  00 N 0 i∈Dy y=1 0 0 0 0 C # # 0 0 ( ) Ny 1 = γ iTy ββ T γ i y 00 log 00 N 0 0 Ny i∈D y=1

y

0   0 0  0 C # #( 0 T ) 0 Ny 1 T = log 00β γ i y γ i y β 00  0 N 0  Ny i∈Dy y=1 0 0 0 0 C C # 0 0 T# 0 0 T Ny N y 0 0 0 = log β ' y β ≤ log 0β ' y β 00 = log |β T 'w β|, N N 0 y=1 0 y=1

where we bundle the ith column vectors of S Gramian matrices " S in column as K˜ i = (ki (1), . . . , ki (S)) ∈ R N×S to obtain the equality ki = s=1 βs ki (s) = K˜ i β, and bundle the average vector of columns of S Gramian matrices in y y class y in column as K˜ y = ( k¯ (1), . . . , k¯ (S)) ∈ R N×S to obtain the equality "S y y k¯ = s=1 βs k¯ (s) = K˜ y β. We also defined γ i y = ( K˜ i − K˜ y )T α ∈ R S , ' y = "C Ny 1 " T y=1 N ' y . We used Jensen’s inequality to derive i∈Dy γ i y γ i y , and 'w = Ny the last inequality. As a result, the minimization problem of the upper bound of the conditional entropy with respect to β is formulated as the following optimization problem: min β T 'w β β

subject to

S # s=1

βs = 1, βs ≥ 0, s = 1, . . . , S.

(D.3) (D.4)

This problem is quadratic optimization problem, and a unique solution is obtained efficiently by, for example, an interior point method. Acknowledgments We are grateful to Nima Reyhani for helpful suggestions. We also express special thanks to the anonymous reviewers whose comments led to valuable improvements of this letter. References ¨ Beirlant, J., Dudewicz, E. J., Gyorfi, L., & Meulen, E. C. (1997). Nonparametric entropy estimation: An overview. International Journal of the Mathematical Statistics Sciences, 6, 17–39.

A Conditional Entropy Minimization Criterion

2921

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. Cook, D. R., & Yin, X. (2001). Dimension reduction and visualization in discriminant analysis (with discussion). Australian and New Zealand Journal of Statistics, 43(2), 147–199. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Hoboken, NJ: Wiley. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Do, H., Kalousis, A., Woznica, A., & Hilario, M. (2009). Margin and radius based multiple kernel learning. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (Vol. 1, pp. 330–343). Los Alamitos, CA: IEEE Computer Society Press. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. Hoboken, NJ: Wiley-Interscience. Faivishevsky, L., & Goldberger, J. (2009). ICA based on a smooth estimation of the differential entropy. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems, 21 (pp. 433–440). Cambridge, MA: MIT Press. Fisher, J. W., & Principe, J. C. (1997). Entropy manipulation of arbitrary nonlinear mappings. In Proceedings of IEEE Workshop Neural Networks for Signal Processing (pp. 14–23). Piscataway, NJ: IEEE Press. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. Fukumizu, K., Bach, F. R., & Jordan, M. I. (2009). Kernel dimension reduction in regression. Annals of Statistics, 37, 1871–1905. Globerson, A., & Roweis, S. (2006). Metric learning by collapsing classes. In Y. Weiss, ¨ B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18 (pp. 451–458). Cambridge, MA: MIT Press. Goldberger, J., Peltonen, J., & Kaski, S. (2007). Fast semi-supervised discriminative component analysis. In Proceedings of Machine Learning for Signal Processing (pp. 312–317). CSREA Press. Goldberger, J., Roweis, S., Hinton, G., & Salakhutdinov, R. (2005). Neighborhood component analysis. In K. L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 513–520). Cambridge, MA: MIT Press. Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society, Series B, 58, 155–176. He, R., Hu, B. G., & Yuan, Z. (2009). Robust discriminant analysis based on nonparametric maximum entropy. In Proceedings of the First Asian Conference on Machine Learning (pp. 120–134). Berlin: Springer. ¨ L. Saul, He, X., & Niyogi, P. (2003). Locality preserving projections. In S. Thrun, ¨ & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 (pp. 153–160). Cambridge, MA: MIT Press.

2922

H. Hino and N. Murata

Hild, K. E., Erdogmus, D., Torkkola, K., & Principe, J. C. (2006). Feature extraction using information-theoretic learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1385–1392. Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. Hoboken, NJ: Wiley. Jaakkola, T., Diekhans, M., & Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (pp. 149–158). Menlo Park, CA: AAAI Press. Kaski, S., & Peltonen, J. (2003). Informative discriminant analysis. In Proceedings of the 20th International Conference on Machine Learning (pp. 329–336). Menlo Park, CA: AAAI Press. Kozachenko, L. F., & Leonenko, N. N. (1987). Sample estimate of entropy of a random vector. Problems of Information Transmission, 23, 95–101. Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26(4), 283–297. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Lanckriet, G. R. G., Deng, M., Cristianini, N., Jordan, M. I., & Noble, W. S. (2004). Kernel-based data fusion and its application to protein function prediction in yeast. In Proceedings of the Pacific Symposium on Biocomputing (pp. 300–311). Singapore: World Scientific. Leiva-Murillo, J. M., & Artes-Rodriguez, A. (2004). A gaussian mixture based maximization of mutual information for supervised feature extraction. In Proceedings of the Fifth International Conference on Independent Component Analysis and Blind Signal Separation (pp. 271–278). Berlin: Springer. Lewis, D. P., Jebara, T., & Noble, W. S. (2006a). Nonstationary kernel combination. In Proceedings of the 23rd International Conference on Machine Learning (pp. 553–560). San Francisco: Morgan Kaufmann. Lewis, D. P., Jebara, T., & Noble, W. S. (2006b). Support vector machine learning from heterogeneous data: An empirical analysis using protein sequence and structure. Bioinformatics, 22, 2753–2760. Li, K-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. Loog, M., & Duin, R. P. W. (2004). Linear dimensionality reduction via a heteroscedastic extension of LDA: The Chernoff criterion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 732–739. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K. R. (1999). Fisher discriminant analysis with kernels. In Proceedings of the 1999 IEEE Signal Processing Society Workshop (pp. 41–48). Piscataway, NJ: IEEE Press. Principe, J. C., & Dongxin, X. (1999). An introduction to information theoretic learning. In Proceedings of International Joint Conference on Neural Networks (pp. 1783– 1787). Cambridge, MA: MIT Press.

A Conditional Entropy Minimization Criterion

2923

¨ R¨atsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins for Adaboost. Machine Learning, 42(3), 287–320. Renyi, A. (1960). On measures of information and entropy. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (pp. 547–561). Berkeley: University of California Press. Sajama, & Orlitsky, A. (2005). Supervised dimensionality reduction using mixture models. In Proceedings of the 22nd International Conference on Machine Learning (pp. 768–775). Menlo Park, CA: AAAI Press. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press. Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research, 8, 1027–1061. Tao, Q., Chu, D., & Wang, J. (2008). Recursive support vector machines for dimensionality reduction. IEEE Transactions on Neural Networks, 19(1), 189–193. Torkkola, K. (2003). Feature extraction by non parametric mutual information maximization. Journal of Machine Learning Research, 3, 1415–1438. Torkkola, K., & Campbell, W. M. (2000). Mutual information in learning feature transformations. In Proceedings of the 17th International Conference on Machine Learning (pp. 1015–1022). San Francisco: Morgan Kaufmann. Wand, M. P., & Jones, M. C. (1994). Kernel smoothing. London: Chapman & Hall/CRC. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000). Feature selection for SVMs. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 668–674). Cambridge, MA: MIT Press. Zhang, Y., & Yeung, D-Y. (2009). Heteroscedastic probabilistic linear discriminant analysis with semi-supervised extension. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (Vol. 2, pp. 602–616). San Francisco: Morgan Kaufmann.

Received October 30, 2009; accepted May 6, 2010.