Gaussian Process for Dimensionality Reduction in ... - CiteSeerX

Report 4 Downloads 138 Views
Gaussian Process for Dimensionality Reduction in Transfer Learning∗ Bin Tong†

Junbin Gao‡

Nguyen Huy Thach§

Einoshin Suzuki¶

Abstract Dimensionality reduction has been considered as one of the most significant tools for data analysis. In general, supervised information is helpful for dimensionality reduction. However, in typical real applications, supervised information in multiple source tasks may be available, while the data of the target task are unlabeled. An interesting problem of how to guide the dimensionality reduction for the unlabeled target data by exploiting useful knowledge, such as label information, from multiple source tasks arises in such a scenario. In this paper, we propose a new method for dimensionality reduction in the transfer learning setting. Unlike traditional paradigms where the useful knowledge from multiple source tasks is transferred through distance metric, our proposal firstly converts the dimensionality reduction problem into integral regression problems in parallel. Gaussian process is then employed to learn the underlying relationship between the original data and the reduced data. Such a relationship can be appropriately transferred to the target task by exploiting the prediction ability of the Gaussian process model and inventing different kinds of regularizers. Extensive experiments on both synthetic and real data sets show the effectiveness of our method. Keywords: Dimensionality reduction, Gaussian process, Transfer learning

data points in a high-dimensional space into those in a low-dimensional space, is viewed as one of the most crucial preprocessing steps of data analysis. Traditionally, dimensionality reduction methods can be divided into two categories, which are supervised ones [18, 16] including the conventional Linear Discriminant Analysis (LDA) and unsupervised ones [21, 17, 1] including the classical Principal Component Analysis (PCA). In general, the label information is helpful for the dimensionality reduction task [27]. However, in many typical real-world applications, only a small number of labeled data points are available due to the high cost of obtaining them. As a consequence, the performance of the model learned from few labeled data is often not satisfactory. To alleviate this labeled data deficiency problem [27], two kinds of learning paradigms have been widely investigated, including semi-supervised learning [27, 20, 24] and transfer learning ([13] and references therein). Semi-supervised learning can be regarded as an extension of the conventional supervised learning paradigm by augmenting the labeled data set with unlabeled data, such that the local data structure embedded in the unlabeled data can be discovered to boost the learning performance. However, typical semi-supervised learning assumes that both the labeled data and unlabeled data are from the same task. That is, the labeled data and unlabeled data are always assumed to be drawn from the same distribution. Typically, many related data sets in 1 Introduction. different tasks with different distributions are available, In various applications, such as bioinformatics and imwhich satisfies the typical setting of transfer learning age retrieval, one is often confronted with high dimen[13]. In this paper, we consider an interesting scenario sional data. Dimensionality reduction [12], which maps often met in real applications that a plentiful label information is available in source tasks, while the data ∗ This work is partially supported by the grant-in-aid for of the target task is unlabeled. The focus of our work scientific research on fundamental research (B) 21300053 from is restricted to investigate the dimensionality reduction the Japanese Ministry of Education, Culture, Sports, Science for the target task in the transfer learning setting. and Technology, and also partially supported by Charles Sturt One major challenging issue of our work is how to University Competitive Research Grant OPA 4818. transfer useful knowledge from the source tasks to the † Graduate School of Systems Life Sciences, Kyushu University, target task to facilitate the dimensionality reduction of Japan. ‡ School of Computing and Mathematics, Charles Sturt Unithe unlabeled target data. Particularly speaking, we versity, Australia. refer to the useful knowledge as the label information, § Graduate School of Systems Life Sciences, Kyushu University, or as the local data structure in each source task. Japan. Two most relevant works [23, 26] also handle a similar ¶ Department of Informatics, Graduate School of Information problem in the metric learning context. In the two Science and Electrical Engineering, Kyushu University, Japan.

783

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

methods, the useful knowledge is transferred to the target task by exploiting distance metric learned from multiple source tasks in advance. However, there exists the following potential weaknesses, as two kinds of relationships may not be fully taken into account. One of the potential weaknesses is that the two methods may fail to discover the implicit relationship between the original data and the data in the transformed space, which would be more informative than the distance metric itself. The other is that only considering the relationships on distance metric between tasks would not necessarily reflect the underlying data relationships between tasks which are important concerns in transfer learning [13]. In addition, one significant difference on the problem setting between our work and the two methods is that we assume the data in the target task are unlabeled, while the above two methods assume some of data in the target task are labeled. This kind of problem setting would make our task more difficult. In order to handle this difficult problem and overcome the potential weaknesses, we propose a new framework named GPDRTL (Gaussian Process for Dimensionality Reduction in Transfer Learning). We firstly convert our problem into integral regression problems in parallel by using the SR method [4]. We then employ Gaussian process for regression [15] to discover the implicit relationship between the original data and the data in the reduced space. In addition, the underlying relationship on data between each source task and the target task can be also indirectly fulfilled by the predictive procedure of the Gaussian process model. In our method, we derives two variants with different regularizers which can be deemed as bridges between each source task and the target task. The utility of the regularizers is to transfer the useful knowledge from multiple source tasks to the target task. The rest of paper is organized as follows. In Section 2, the related works are discussed. Section 3 discusses the problem setting of our method. In Section 4, the preliminaries including Gaussian process for regression and Spectral regression for subspace learning are introduced. We then present our method including two variants in Section 5. The experimental results on both toy example and real data sets are discussed in Section 6. Finally, we draw a conclusion in Section 7. 2 Related Works In this section, we review past research works related to ours, including dimensionality reduction for transfer learning and metric learning for transfer learning. Recently, Pan et al. [13] invented a new dimensionality reduction method, named MMDE, for domain adaptation by exploiting the Maximum Mean Discrep-

784

ancy (MMD) criterion. A common feature space is obtained to make the distribution difference between a source task and a target task small. MMDE was extended in [14] to the out-of-sample case. Wang et al. [22] proposed a framework for transferred dimensionality reduction in the clustering context. The selection of the reduced feature space shared by a source task and a target task is determined by the discriminative analysis where the supervised information is derived from the clustering procedure in the target task. Although we also handle dimensionality reduction in transfer learning setting, our problem setting assumes that multiple source tasks are available. It is not an easy task to extend the methods above to the case of multiple source tasks. In addition, the reduced feature space derived from our method is only for the target task but not shared by all tasks. Zha et al. [23] proposed a metric learning method from auxiliary knowledge. In its framework, two kinds of regularization techniques are employed. One of them is Log-determinant regularization function, where the information-based distance, Bregman divergence, is used to measure the distance between two distance metric matrices. The other one is to exploit the manifold regularization to formulate the local structure and the relationship between distance metric matrices. Zhang et al. [26] proposed a method for transfer metric learning by using task relationships, where the RDML method [11] is extended to the setting of multi-task learning. In this method, the task relationships are formulated to avoid the negative transfer [13]. However, the data relationship between different tasks is not fully taken into consideration. In our method, such a relationship can be easily established by the Gaussian process model. 3 Problem Setting Suppose that there are m source tasks T1 , . . . , Tm , and 0 a target task T for dimensionality reduction. For the i-th source task Ti (i = 1, . . . , m), we are given the data i Xi which consists of ni data points {xij }nj=1 with the D i j-th data point xj ∈ R . Xi can be then represented by a D × ni matrix. We assume that the data in each source task are labeled. For the target task T 0 , we are given the data X 0 which consists of n0 unlabeled data 0 points {x0j }nj=1 with the j-th data point x0j ∈ RD . X 0 can be then represented by a D × n0 matrix. Our task is to make dimensionality reduction for the data in the target task. Suppose that the dimensionality of the reduced data in the target task is L (L ¿ D). If the transformation matrix for dimensionality reduction is explicitly required, our goal is to obtain a D × L transformation matrix W for the target task with an

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

orthogonal constraint W T W = I. W consists of L projective vectors {w1 , w2 , . . . , wL } where wl (l = 1, 2, . . . , L) represents the l-th projection direction for dimensionality reduction. If the transformation matrix is not required, we are directly aimed at obtaining the reduced data of the target task. The reduced data of the target task is represented as Y = {y1 , y2 , . . . , yL } where yl (l = 1, 2, . . . , L) is a vector denoting the reduced data in the l-th dimensionality. 4 Preliminary In this section, we are concerned with the preliminaries for our method, including Spectral regression for subspace learning and Gaussian process for regression. 4.1 Spectral Regression for Subspace Learning The solution of Spectral Regression for subspace learning [4] (SR) is implemented by converting the dimensionality problem into the regression problem. The SR algorithm mainly includes two steps. The first step is to obtain the reduced data without knowing the transformation matrix for dimensionality reduction. The second step is to formulate the relationship between the original data and the reduced data as a regression problem to derive the transformation matrix. In the following, we introduce the SR algorithm in the context of our problem setting. For the i-th source task, we build a graph Gi with ni vertices, each vertex represents a data point. Let i H i be a symmetric ni × ni matrix with Hpq being the weight of the edge joining vertices p and q. Let yi = [y1i , y2i , . . . , yni i ]T be the mapping from the graph Gi to one of projection directions for dimensionality reduction. The optimal yi is given by minimizing (4.1)

X

i (ypi − yqi )2 Hpq

p,q

written as (4.4) T

ai∗ = arg max

yi H i yi T

yi D i yi

T

= arg max

ai Xi H i XiT ai T

ai Xi Di XiT ai

The solution of Eq. (4.4) can be also obtained by solving an eigen-problem. Note that with different choices of H i , various dimensionality reduction algorithms can be derived, e.g., Linear Discriminant Analysis (LDA), Local Projection Preserving (LPP) [10] and Neighborhood Preserving Embedding (NPE) [9]. For the details of choosing H i for these algorithm, refer to [4]. In the SR algorithm, instead of directly deriving ai in Eq. (4.4), we firstly obtain the reduced data yi on one of the projection directions in the i-th source task by solving the eigen-problem in Eq. (4.3). We then perform the second step to obtain ai which satisfies XiT ai = yi . It turns out to be a typical regression problem, i.e., least square problem. (4.5)

ai = arg min ai

ni X

T

(ai xij − yji )2

j=1

4.2 Gaussian Process for Regression As a probabilistic regression approach, Gaussian Process (GP) model [15] is widely used in machine learning as a nonlinear regression technique. The following is a brief review for Gaussian process for regression. We are given a training set D = {xi , yi }ni=1 of n pairs of (vectorial) inputs xi and noisy (real, scalar) outputs yi . We denote X as the training data {x1 , x2 , . . . , xn }, and denote y as {y1 , y2 , . . . , yn }. The goal of Gaussian process for regression is to compute the predictive distribution of the function values f∗ (or noisy y∗ ) at test data x∗ . We assume that the noise is additive and independent in a Gaussian distribution, such that

yi = f (xi ) + ² with ² ∼ N (0, σ 2 ) The above optimization problem has the following equivalent variation. where σ 2 is the variance of the noise, and the notation N (µ, σ 2 ) for the Gaussian distribution with mean µ and iT i i y H y T covariance σ 2 . (4.2) y∗i = arg max yi H i yi = arg max T i i i Gaussian process for regression is a kind of Bayesian y Dy yi T D i yi =1 method which assumes a GP prior over functions, i.e., where Di is a diagonal matrix whose entries are column a prior of the function values, behave according to i (or row, since H i is symmetric) sum of H i , Dpp = P p(f |x1 , x2 , ..., xn ) = N (0, K) i i H . The optimal y can be obtained by solving qp q where f = [f (x1 ), f (x2 ), ..., f (xn )]T is a vector of the maximum eigenvalue eigen-problem. latent function values and K is a covariance ma(4.3) H i yi = λDi yi trix, whose entries are given by a covariance function, Kij = k(xi , xj ; θ ) where θ denotes all hyperparameIf a linear function is chosen, i.e., yji = f (xij ) (j = ters in the covariance function. The covariance func1, 2, . . . , ni ), we have yi = XiT ai . Eq. (4.2) can be tion k(xi , xj ; θ ) encodes one’s prior knowledge about the

785

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

function over the data. A typical covariance function is matrix. Due to the nature of Gaussian process for rethe Gaussian kernel gression, it is able to serve as the regression solver in ¶ µ the SR algorithm. Therefore, we are accessible to the 1 2 underlying relationship between the original data and k(xi , xj , θ ) = exp − 2 kxi − xj k 2w the reduced data in each source task. In addition, the Gaussian process for regression is capable of making where the hyperparameter w ∈ θ , which is called the prediction for unobservable data. The relationship on width of Gaussian kernel that controls the decay rate data between each source task and the target task can of the covariance. The predictive distribution of a GP be also implicitly encoded. model for the test data x∗ is given by [15]: In our problem setting, for the i-th source task, we suppose the first L eigenvectors in the eigen-problem 2 −1 p(f (x∗ )|y,X, θ ) = N (K∗,f (Kf,f + σ I) y, i (4.3) are represented by {y1i , y2i , . . . , yL }, which are also K∗,∗ − K∗,f (Kf,f + σ 2 I)−1 Kf ,∗ ) regarded as the reduced data in the i-th source task. For the i-th source task (i = 1, . . . , m) and the l-th where K∗,∗ = k(x∗ , x∗ ; θ ), KfT,∗ = K∗,f = [k(x∗ , x1 ; θ ), dimensionality of the reduced data yi (l = 1, 2, . . . , L), l ..., k(x∗ , xn ; θ )] and Kf,f is an n×n matrix whose entries we consider the data set Di defined as follows: l are denoted by k(xi , xj ; θ ) where i, j ∈ {1, 2, . . . , n}. i i i Dil = {(xi1 , yl1 ), (xi2 , yl2 ), . . . , (xini , yln )} Thus there exists α = [α1 , ..., αn ]T , such that the i i i predictive value can be represented as where y is the j-th component of y for j = 1, ..., n . lj

l

i

In total, there are L × m GP models. According to T Eq. (4.6), for the i-th source task, the predictive value θ α (4.6) y∗ = αj k(x∗ , xj ; ) = Kf ,∗ . i zlp on the l-th dimensionality of the reduced data of j=1 x0p ∈ X 0 (p = 1, 2, . . . , n0 ) in the target task is computed A solution of the GP regression given by Eq. (4.6) as follows. ni is straightforward with a given covariance function X i k(xi , xj ; θ ). However, it is not easy to define an appro- (5.8) αjil k(x0p , xij ; θli ) zlp = j=1 priate covariance function for the regression problem at hand. Thus, in order to make the GP model a practical il where αj corresponds to weights of the GP models tool, it is essential to address the problem of learning as defined in Eq. (4.6) and θli represents the learnt hyperparameters θ . hyperparameters for the covariance function under the A common choice of the objective function for marginal likelihood criterion defined in Eq. (4.7). setting the hyperparameters is the marginal likelihood Since we have L eigenvectors for each source task, p(y|X, θ ). For the GP regression, it can be derived as without loss of generality, we assume that the eigenvecthe following form by integrating out f . tors are ordered according to the descending order of the corresponding eigenvalues. Our framework is built on n 1 log p(y|X, θ ) = − log(2π) − log |K + σ 2 I| an assumption that the l-th eigenvectors of source tasks 2 2 have a latent relationship with the l-th eigenvector of 1 (4.7) − yT (K + σ 2 I)−1 y the target task that can be interpreted as the reduced 2 data of the target task in the l-th dimensionality. That The marginal likelihood can be maximized with respect is, we can view our framework as L regression problems to the hyperparameters θ by using a gradient descent in parallel. Particularly speaking, the reduced data of algorithm. For more details, refer to [15]. the target task in the l-th dimensionality is correlated with predictions from the GP models for the l-th dimen5 Our Framework: GPDRTL sionality of reduced data from multiple source tasks. In this section, we firstly discuss the basic idea of our Basically, we consider the objective function of our GPDRTL framework. We then derive two variants for framework consists of two parts, including the loss our framework, considering different loss functions and function of the target task and a regularizer. Since the different regularizers in the target task. data of the target task are unlabeled, the loss function of the target task is defined as a kind of the unsupervised 5.1 Basic Idea We can see that the first step of the learner. The utility of the regularizer is to help the SR algorithm is to convert the problem of dimension- unsupervised learner to obtain an optimal reduced data ality reduction into a regression problem by obtaining by transferring useful knowledge from multiple source the reduced data without knowing the transformation tasks to the target task. n X

786

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

Considering different viewpoints of how to transfer the useful knowledge, we derive two variants of our framework. In the first variant, inspired by the idea of smoothness of function [7], we assume that the reduced data of the target task in the l-th dimensionality smoothly match with the predictive data by the GP models for the data set Dil (i = 1, 2, . . . , m) from the ith source task. The degree of the smoothness can be interpreted as the degree of similarity. We name the first variant as GPDRTL with Function Smoothness Regularization (GPDRTL-FSR). In the second variant, by using the GP model for the target task, we preserve the weight information between the original data in the target task and the reduced data predicted from each source task. Such weight information encodes an implicit relationship between the data in the target task and the data in each source task. Inspired by the idea of modeling task relationships [26], we design a regularizer to describe both task relationships and the data relationship between each source task and the target task. We name the second variant as GPDRTL with Data and Task Relationship Regularization (GPDRTL-DTRR).

and the reduced data yl in the target task. Particularly speaking, we are only considering the similarity degree between yl and zil (i = 1, 2, . . . , m) with the same value for l. It is easy to see that G is a symmetric matrix. Eq. (5.9) can be rewritten as follows: δ = tr(FSFT )

(5.10)

where S = R−G is defined as a graph Laplacian matrix, and R is a diagonal matrix P whose entries are column sums of G, such that Rpp = Gpq . q

Now we consider the following objective function for the first variant (GPDRTL-FSR): (5.11)

min

W T W =I,Y

kW T X 0 − YT k2F + λtr(FSFT )

where λ is a parameter to control the balance between the two terms. The first part of Eq. (5.11) defines the loss function in an unsupervised way. The reduced data Y in the target task is smoothly controlled by the regularization defined in the second part of Eq. (5.11). The application of GP, here, is to model the underlying relationship between the original data and the reduced data in each source task. Since the GP model is able to make prediction for the data in the target task, the relationship on data between each source task and the target task is, therefore, automatically and implicitly established. Since S is a symmetric matrix, we· are able to¸handle A C S as a block matrix, such that S = where CT B A is an mL × mL matrix, B is an L × L matrix, and C is an mL × L matrix. Due to the specific structure of S, we can see that the block C encodes the relationships between m source tasks and the target task, to which the variable Y is only related. The second part of the objective function (5.11) can be written as follows: µ · ¸ · T ¸¶ £ ¤ A C Z T Z Y tr(FSF ) = tr CT B YT

5.2 The First Variant: GPDRTL-FSR Recall that the reduced data in the target task is defined as Y = {y1 , y2 , . . . , yL }, i.e., the reduced data in the lth dimensionality is represented as yl (l = 1, 2, . . . , L). Given the l-th dimensionality, the prediction for the reduced data yl in the target task can be obtained by GP models running on the data set Dil (i = 1, 2, . . . , m) from the i-th task. Combining the predictions from m source tasks together, we have an n0 × m matrix i i i i T Zl = [z1l , z2l , . . . , zm l ] where zl = [zl1 , zl2 , . . . , zln0 ] , i each element in zl is defined in Eq. (5.8). Given a specified value for l, it is natural to assume that yl is similar to zil (i = 1, 2, . . . , m). This idea can be viewed as a kind of the smoothness of function [7]. Let Z = [Z1 , Z2 , . . . , ZL ] by integrating Zl (l = 1, 2, . . . , L). By combining all predictions from the source tasks and the reduced data in the target task, we 0 = tr(ZAZT ) + 2tr(YCT ZT ) + tr(YBYT ) have F = [Z, Y] = [Z1 , Z2 , . . . , ZL , Y] ∈ Rn ×(m+1)L . We use fp to denote the p-th column of F. We then invent a regularizer based on the smoothness of function The first part of the objective function (5.11) can be written as follows: in the following way. kW T X 0 − YT k2F = tr(W T X 0 X 0T W ) − 2tr(W T X 0 Y) 1X (5.9) δ= Gpq (fp − fq )2 2 p,q + tr(YYT ) where G is a similarity matrix defined as follows: ½ 2 exp(− γ1 kfq − fp k ), fq ∞fp Gqp = otherwise. 0,

Therefore, the optimal Y and W can be derived by solving the objective function written as follows: J

where fq ∞fp represents a special relationship between the predictions zil (i = 1, 2, . . . , m) from GP models (5.12)

787

= tr(W T X 0 X 0T W ) − 2tr(W T X 0 Y) + tr(Y(I + λB)YT ) + 2λtr(YCT ZT ) s.t. W T W = I

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

Although the objective function (5.12) is not convex, it can be iteratively optimized with respect to W by fixing Y, and vice versa. In general, the optimization procedure of the problem (5.12) is taken as follows.

space Y for the target task, the loss function described below can be reformulated in a so-called standard multiple GP regression [2]. Here, we use a simplified version of multiple GP regression by stacking up several GP models over each dimensionality. Optimizing W when Y is fixed. For this purpose, let the predictive model for the The objective function (5.11) becomes the following data in the target task x0p (x0p ∈ X 0 ) on the l-th Procrustes problem [8]. projection direction be given by (5.13)

min kW T X 0 − YT k2F

0

W T W =I

This problem is equivalent to finding the nearest orthogonal matrix W to a given matrix X 0 Y. To find the orthogonal matrix, let the thin Singular Value Decomposition (SVD) of X 0 Y be U ΣV T , such that U is an orthogonal matrix with the size D × L and V is an orthogonal matrix with the size L × L. We then have W = U V T .

(5.14) f (x0p ) =

n X

βjl kf (x0p , x0j ) = β Tl Kf (x0p )

j=1

where Kf (x0p ) = [kf (x0p , x01 ), ..., kf (x0p , x0n0 )]T and β l = [β1l , β2l , . . . , βnl 0 ]T . By expanding Eq. (5.14) to the case considering all the data x0p (p = 1, 2, . . . , n0 ) in the target task and all the L dimensionalities, we have the following unsupervised GP model. YT = VKf,f + ²

Optimizing Y when W is fixed. ∂J = 0, then we have Let ∂Y

β 1 , β 2 , . . . , β L ]T and Kf,f is a kernel matrix where V = [β 0 0 Y(I + λB) = X W − λZC over X = {x0p }np=1 . We then consider the regularization part of the Thus, the solution of Y is given by Y = (X 0T W − objective function. In Eq. (5.8), for fixed i and l, we λZC)(I + λB)−1 . obtain a new data set over the target task. The above two steps perform adaptively until the i i i convergence condition is satisfied. The whole optimiza{(x01 , zl1 ), (x02 , zl2 ), . . . , (x0n0 , zln 0 )} tion procedure of the objective function (5.11) is summarized in Algorithm. 1. With the data set and the learnt kernel function k(·, ·; θli )1 , we learn new GP models over the data x0p Algorithm 1 GPDRTL-FSR (p = 1, 2, . . . , n0 ) in the target task: Input: Source tasks and the target task, the reduced n0 dimensionality L X i 0 0 i (5.15) zlp = Φil Output: The transformation matrix W j k(xp , xj ; θl ) + ² j=1 1: For each i-th source task Ti , we perform Eq. (4.3) i }. to obtain the reduced data {y1i , y2i , . . . , yL il 2: For each i-th source task Ti and each l-th reduced where the weights Φj can be easily determined by the dimensionality, we run the GP model and perform standard GP regression, and the weight vector is defined il il il il T Eq. (5.8) to predict the reduced data of the target as Φ = [Φ1 , Φ2 , . . . , Φn0 ] . By integrating weight vectors from L dimensionalities, we define an L × n0 task zil . matrix Ui = [Φi1 , Φi2 , . . . , ΦiL ]T . We can consider 3: Initialize Y by using PCA. that Ui indirectly depicts the relationship on data over 4: Set t = 1 L dimensionalities between each source task and the 5: while not convergence and t ≤ T do 6: Fix Y, solve the Procrustes problem (5.13) with target task. Inspired by the idea of modeling the task relationrespect to W . ∂J ships [26], we invent a regularizer considering both task 7: Fix W , optimize Y by using ∂Y = 0. relationships and the data relationship between each 8: t=t+1 source task and the target task. The regularizer is de9: end while fined as follows: e Σ eT ) tr(ΣΩ 0T

5.3 The Second Variant: GPDRTL-DTRR We firstly discuss the loss function of the objective function. Considering a regression problem over a reduced vector

788

1 Actually, in the implementation, we are only interested in the learnt kernel hyperparameters θli s.

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

e = (vec(U1 ), vec(U2 ), . . . , vec(Um ), vec(V)). Ω M = Pm ωi Ui where ωi is the i-th element of ω . We where Σ i=1 is a symmetric matrix describing the task relationships. denote M as [m1 , m2 , . . . , mL ]T . When ω and σ are fixed, the objective function It is defined as follows: µ ¶ (5.16) can be written as: σIm ω Ω= ωT 1 min kYT − VKf,f k2F + λ01 kVk2F Y,V where ω denotes the task relationships between the +λ02 tr(VT M) target task and the source tasks. If Ω is considered (5.17) as a Laplacian matrix, the regularizer becomes a kind where λ01 = λ1 + λ2 and λ02 = 2λ2 . The obof manifold regularization [10] defined on the space of solved by an iterative e Finally, the objective function of the second variant jective function (5.17) can be Σ. procedure. Note that tr(VT M) can be written as is: β TL mL ). β T1 m1 +, . . . , +β tr(β First, when Y is fixed, this problem can be decommin kYT − VKf,f k2F + λ1 kVk2F Y,V posed into L separable optimization problems. That is, e Σ eT ) for l = 1, 2, . . . , L, we have +λ2 tr(ΣΩ s.t. Ωº0 β l k2F min kyiT − β Tl Kf,f k2F + λ01 kβ βl tr(Ω) = c (5.16) β Tl ml ) +λ02 tr(β (5.18) where the first term defines the loss function over the target task, the second term represents the penalty of Lemma 5.1. We assume that both a and b are real the complexity for V. Both λ1 and λ2 are positive vectors in the column form. The following inequality real parameters to control the balance among the three is satisfied. kak kbk ≥ tr(aT b), where k · k represents 1 1 1 terms. Since Ω is considered as a Laplacian matrix, the l norm. 1 we can impose a positive semidefinite constraint for it. The last constraint in Eq. (5.16) is to restrict the It is easy to prove this lemma by noting that, for scale of Ω. The innovation of this objective function any two numbers a and b, we have |a| · |b| ≥ ab. We consists in two points. The first one is that we formulate omit the proof here for brevity. the loss function for the target task in a different According to Lemma 5.1, the problem (5.18) can be viewpoint where the multiple GP regression model is relaxed as a conventional elastic net problem [28]. used. The other point is the design of the novel regularizer, considering both task relationships and the β l k2F min kyiT − β Tl Kf,f k2F + λ01 kβ βl relationship on data between each source task and the target task. β l k1 (5.19) +λ02 kml k1 kβ In the sequel, we discuss how to solve the objective function (5.16). Although it is not a convex problem, Second, when V is fixed, the optimization problem the optimization can be adaptively performed in the (5.17) becomes the following problem. following way. (5.20) min kYT − VKf,f k2F Y Optimizing w.r.t V and Y when ω and σ are fixed. Since we assume the reduced data for the source tasks Since tr(Ω) = c, it is easy to get σ = c−1 We are orthogonal, it is natural to assume that the reduced m . e as the following form [U, vec(V )] where data for the target task, Y, has the same property, define Σ e Σ e T ) such that YT Y = I. Therefore, the problem (5.20) [6] U = [vec(U1 ), vec(U2 ), . . . , vec(Um )]. Then, tr(ΣΩ can be written as: becomes a Procrustes problem [8]. Let the thin SVD of T Kf,f VT be U ΣV T . Then Y = U V T . e Σ eT ) tr(ΣΩ µ µ ¶µ ¶¶ ¡ ¢ σIm ω UT Optimizing w.r.t ω and σ when V and Y U vec(V) = tr ωT 1 vec(V)T are fixed. Since we have constraints Ω º 0 and tr(Ω) = c, this ω vec(V)T ) + kVk2F = σtr(UUT ) + 2tr(Uω ω T ω ≤ c − 1 according constraint is equivalent to mω ω . We can to the Schur complement [3]. When optimizing with Let M be a matrix such that vec(M) = Uω see that M is a combination of Ui i = 1, 2, . . . , m as respect to ω and σ, the optimization problem is

789

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

formulated as: min

ω ,σ,Ω

s.t. (5.21)

e Σ eT ) tr(ΣΩ ωT ω ≤ c − 1 mω µ ¶ σIm ω Ω= ωT 1

The optimization problem (5.21) is an SDP problem, which can be easily solved. The above two steps perform adaptively until the convergence condition is satisfied. The whole optimization procedure of the objective function (5.16) is summarized in Algorithm. 2. Algorithm 2 GPDRTL-DTRR Input: Source tasks and the target task, the reduced dimensionality L Output: Y 1: For each i-th source task Ti , we perform Eq. (4.3) i to obtain the reduced data {y1i , y2i , . . . , yL }. 2: For each i-th source task Ti and each l-th reduced dimensionality, we run the GP model and perform Eq. (5.8) to predict the reduced data of the target task zil . 3: Initialize Y by using PCA. 4: Set t = t1 = 1 5: while not convergence and t ≤ T do 6: while not convergence and t1 ≤ T1 do 7: for l = 1 to L do 8: Fix Y, ω and σ, solve the elastic net problem (5.19) with respect to β l . 9: end for 10: Fix V, ω and σ, solve the problem (5.20) with respect to Y. 11: t 1 = t1 + 1 12: end while 13: Fix V and Y, solve the SDP problem (5.21) with respect to ω and σ. 14: t=t+1 15: end while

6 Experimental Results In this section, we empirically study our two variants, GPDRTL-FSR and GPDRTL-DTRR, on the artificial data sets and real world data sets. We use CVX solver2 to solve the SDP problem (5.21), and use GPML3 to implement the Gaussian process model. 2 http://www.stanford.edu/

~boyd/software.html

3 http://www.gaussianprocess.org/gpml/code/matlab/doc/

index.html

790

6.1 Toy example We first generate a toy data set to conduct a “proof of concept” experiment before we do experiments on the real data sets. The toy data sets are generated as follows. Suppose we have three types of tasks including type ‘A’, Type ‘B’ and Type ‘C’. For the type ‘A’ task, we have data of two classes drawn from Gaussian distributions, as shown in Fig.1a. Their respective means µ are (−1, 0)¶and (1, 0); 0.05 0 their covariance matrices are . For the 0 50 type ‘B’ task, we also have data of two classes drawn from Gaussian distributions, as shown in Fig.1b. Their respective means µ are (0, −1) ¶ and (0, 1); their covariance 50 0 matrices are . For the type ‘C’ task, 0 0.05 the distributions of two-class data follow the type ‘B’. However, their class labels are different, as shown in Fig. 1c. We examine GPDRTL-FSR and GPDRTL-DTRR in three cases. In the first case, we have the following task sequence { type ‘A’, type ‘A’, type ‘A’ }. The task sequence of the second case is defined as follows { type ‘A’, type ‘B’, type ‘A’ }. The task sequence of the third case is { type ‘C’, type ‘B’, type ‘A’ }. In the three cases, the first two tasks are deemed as source tasks, and the last task is considered as the target task. In each source task, the class labels of data are available, while the data in the target task are unlabeled. We can see that the two source tasks and the target task are very similar in the first case, while both two source tasks are different from the target task in the third case. Our goal is to derive a projection direction in the target task from 2-dimensional space to 1-dimensional space, expecting that the data of two classes on the direction can be separated well. Note that the following parameters are set for the toy example, e.g., m = 2, D = 2, L = 1, c = 3. For each source task in the three cases, supervised Linear Projection Preserving (LPP) is exploited in the SR algorithm to obtain the reduced data. In the first case, it is shown in Fig. 1d that two projection directions predicted by GP models for two source tasks are almost the same, because the target task is very similar to two source tasks. Fig.1e shows the results of our two variants and PCA in the first case. We can see that our two variants are obviously superior to PCA. The reason would be that the useful knowledge from two source tasks are well transferred to the target task. In the second case, since the distribution of the second source task is different from that of the target task, we can see from Fig.1f that the direction predicted by GP models for the second source task is worse than that predicted from the first source task. However, it is shown in Fig.1g that GPDRTL-DTRR almost keeps

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

20 10 0

2

2

1

1

0

0

−1

−1

−10 −2 −2

−1

0

1

−2

2

−10

(a) Type A 15

1 st Source Task 2nd Source Task

5

5

0

0

−5

−5

−10

−10

−15

−15

0

1

2

−10

GPDRTL−DTRR GPDRTL−FSR PCA

20

−1

0

1

−2

2

(e) Results (1st case) 20

20

1st Source Task 2nd Source Task

0

0

−10

−10

−20 1

2

(g) Results (2nd case)

−2

−1

0

1

−1

0

1

2

(f) GP prediction (2nd case)

0

0

1st Source Task 2nd Source Task

−10

10

−1

20

0

10

−2

10

(c) Type C

10

−10

0

10

−2

(d) GP prediction (1st case) 20

20

GPDRTL−DTRR GPDRTL−FSR PCA

15 10

−1

10

(b) Type B

10

−2

0

−20

2

(h) GP prediction (3rd case)

GPDRTL−DTRR GPDRTL−FSR PCA

−2

−1

0

1

2

(i) Results (3rd case)

Figure 1: Toy example. the best performance. Although GPDRTL-FSR still outperforms PCA, the predictive direction of GPDRTLFSR seems to be a tradeoff between the predictive direction from the first source task and that from the second task. The reason is probably that the regularizer for the smoothness of function restricts the reduced data in the target task to be similar to the two directions predicted by GP models for two source tasks. In the third case, Fig. 1i shows that the performance of GPDRTL-FSR becomes worse compared with its performance in the second case shown in Fig. 1g, as the direction for dimensionality reduction is more prone to be a vertical one. The reason is probably that the distributions of both two source tasks are different from that of the target task, so that the predictive directions by GP models from two source tasks, as shown in Fig. 1h, are worse than those shown in Fig. 1f. Therefore, we may have an implication that GPDRTL-FSR may require that distributions of source tasks are more likely to be similar to that of the target task. It is shown in Fig. 1i that GPDRTL-DTRR still outperforms other

791

methods. The reason is probably that the task relationships are delicately modeled in GPDRTL-DTRR, such that it is able to mitigate the negative effect incurred by irrelevant tasks [25]. 6.2 Experiments on real data sets In this section, we perform experiments on the real data sets to evaluate the effectiveness of our proposals. In our experiments, we compare GPDRTL-FSR and GPDRTLDTRR with the following typical methods: k-means (KM), PCA+kmeans (PCA+KM), TCA [14] and TDA [22]. The two methods [23, 26] are not compared due to their problem settings which, unlike ours, assume a few data in the target task are labeled. We use two metrics, clustering accuracy (AC) [24] and F1-Score [5], to measure the clustering performance. For each data set, we repeated the experiments for 10 trails, and report the averages and standard deviations. Recall that the useful knowledge from multiple source tasks we define in our paper includes either label information or the data local structure. In our experi-

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

55

45 40

PCA+KM TCA FSR+KM DTRR+KM Kmeans

50

ACC FSCORE

50 45

30

Percentage

40

35

ACC

ACC

60

PCA+KM TCA FSR+KM DTRR+KM Kmeans

50

40

30

25

35 30 25

20

20

20

15

15

10

10 8

6

4

2

8

(a) ACC: Wine-W

6

4

10

2

Kmeans

(b) F1-Score: Wine-W

PCA+KM

TCA

TDA

FSR+KM DTRR+KM

(c) Wine-W (LDA) 45

PCA+KM TCA FSR+KM DTRR+KM Kmeans

50 45

50 45

35 30

30

Percentage

35

35 30

25 20 15 6

4

(d) ACC: Wine-R

2

20 15

20

10 5

10 8

25

25

15

10

ACC FSCORE

40

40 ACC

ACC

40

PCA+KM TCA FSR+KM DTRR+KM Kmeans

55

8

6

4

2

0

(e) F1-Score: Wine-R

Kmeans

PCA+KM

TCA

TDA

FSR+KM DTRR+KM

(f) Wine-R (LDA)

Figure 2: The results on wine data set ments, Linear Discriminant Analysis (LDA) and supervised Local Projection Preserving (LPP) are employed in the SR algorithm to obtain the reduced data of each source task. Note that the supervised LPP algorithm is able to preserve the local structure of data by using label information. The parameter settings of GPDRTL are listed as follows. The value of balance parameter λ in GPDRTL-FSR is set to be 100. The values of balance parameters λ1 and λ2 in GPDRTL-DTRR are set to be 0.01. The value of c in the objective function (5.16) of GPDRTL-DTRR is set to be the total number of tasks. Since the objective function of our two variants are solved iteratively, we prescribe the maximum number of iterations as T = T1 = 10. For the parameter settings in supervised LPP, we used weighted 5-nearestneighbor for constructing the graph. In addition, the Gaussian kernel function was used, setting decay rate to be 5. When supervised LPP is exploited in the SR algorithm, we examine the performances when the reduced dimensionality varies. When LDA is used in the SR algorithm, the reduced dimensionality in the target task is set to be the number of classes minus one. Note that the TCA and TDA methods consider only one source task and one target task. For the ease of their comparison with other methods, we simply combine multiple source tasks as integral source data. Due to the discriminative characteristic, TDA can be only used to compare in the case where LDA is employed in the SR algorithm.

6.2.1 Experiments on Wine Quality Data Set The wine data set4 is about wine quality including red and white wine samples. The 11 features include objective tests, e.g., PH values, and the output is based on sensory data. The labels are given by experts with grades between 0 (very bad) and 10 (very excellent). We observe that only the grades from 3 to 8 are shared by both red wine and white wine. Therefore, we extract records with grades from 3 to 8 as the data sets used in our experiment. In total, we have 1599 records for the red wine and 4893 for the white wine. Each task is treated as the target task and the other task as the source task. We denote Wine-R by the data set where the red wine is regarded as the target task, and Wine-W by the data set where the white wine are regarded as the target task. 6.2.2 Experiments on 20 News Group Data Set We performed another experiment on the 20 Newsgroups corpus5 , which consists of approximately 20000 newsgroup documents collected evenly from 20 different newsgroups. The documents from serval different topics of newsgroups are related. For example, the newsgroups rec.sport.baseball and rec.sport.hockey are relevant to recreation. According to the typical setting of transfer learning, we employed the strategy from [5] to construct the data sets in the following way. First, we applied 4 http://archive.ics.uci.edu/ml/datasets/Wine+Quality/ 5 http://people.csail.mit.edu/jrennie/20Newsgroups/

792

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

80

70

50

48

80 PCA+KM TCA FSR+KM DTRR+KM Kmeans

58 56

65 60 55

70

52 50

50

55

45

40 40

40 80

20

60

40

20

80

(b) F1-Score: NG1-1

70

50

60

60

40

70

50 48

60

40

55

45

20

80

(f) F1-Score: NG1-3

60

40

20

80

(g) ACC: NG1-4

60

40

20

(h) F1-Score: NG1-4 75

80 ACC FSCORE

ACC FSCORE

80

70

ACC FSCORE

70

70

65

60

40 80

ACC FSCORE

75

65

50

46

20

(e) ACC: NG1-3

20

PCA+KM TCA FSR+KM DTRR+KM Kmeans

75

40 80

40

80

55

45

46

60

(d) F1-Score: NG1-2

52

65

50

48

80

PCA+KM TCA FSR+KM DTRR+KM Kmeans

54

ACC

52

20

56 PCA+KM TCA FSR+KM DTRR+KM Kmeans

75

FSCORE

54

40

(c) ACC: NG1-2

80 PCA+KM TCA FSR+KM DTRR+KM Kmeans

56

60

FSCORE

60

(a) ACC: NG1-1

ACC

60

46 80

65

70

50

60

50

60

Percentage

55

Percentage

60

60

Percentage

Percentage

65

50

48

45

46

PCA+KM TCA FSR+KM DTRR+KM Kmeans

75

54 ACC

FSCORE

ACC

52

60 PCA+KM TCA FSR+KM DTRR+KM Kmeans

75

FSCORE

PCA+KM TCA FSR+KM DTRR+KM Kmeans

54

50

50 45

45 40

55

40

40

40

35

35 Kmeans

PCA+KM

TCA

TDA

FSR+KM DTRR+KM

(i) NG1-1 (LDA)

30

30

30

30

Kmeans

PCA+KM

TCA

TDA

FSR+KM DTRR+KM

(j) NG1-2 (LDA)

Kmeans

PCA+KM

TCA

TDA

FSR+KM DTRR+KM

(k) NG1-3 (LDA)

Kmeans

PCA+KM

TCA

TDA

FSR+KM DTRR+KM

(l) NG1-4 (LDA)

Figure 3: The results on the NG1 data set the typical pre-processing steps [19]: (1) removed stop words; (2) ignored file headers; (3) selected top words by mutual information. In our experiments, we selected top 100 words as features. For each task, we then randomly selected 400 documents for a given upper level category, i.e., sci and talk. Specific details of the data set NG1 are described in Table 1. We denote NG1-i (i = 1, 2, 3, 4) by the data set where the i-th task from NG1 is regarded as the target task and the remaining tasks are the source ones. Table 1: Statistics of the NG1 data Class label Task sci talk 1 crypt politics.guns 2 electronics politics.mideast 3 med politics.misc 4 space religion.misc

set #doc 800 800 800 800

6.2.3 Analysis on Results The results on the wine data set are shown in Fig. 2. It is illustrated in Fig.2a,

793

Fig.2b, Fig.2d and Fig.2e that GPDRTL-DTRR outperforms other methods significantly when the reduced dimensionality varies from 8 to 2. The reason why GPDRTL-DTRR is superior to GPDRTL-FSR is probably that the distribution of red wine records is different from that of white wine records, such that the projection directions predicted by GP models for source tasks are different from the optimal directions in the target task. However, GPDRTL-DTRR is able to model the task relationships automatically, such that it may not suffer from the distribution difference. As shown in Fig. 2c, GPDRTL-DTRR keeps the best performance on ACC and F1-Score. The results on the NG1 data set are shown in Fig. 3. For the case where supervised LPP is used in the SR algorithm, it is presented in Fig. 3a, Fig. 3c, Fig. 3e and Fig. 3g that the performances of GPDRTL-DTRR are better than those of other methods in most cases. It is shown in Fig. 3a and Fig. 3g that GPDRTLFSR is superior to the previous methods. However, in other cases, we observe that the performances of our two variants are slightly improved or even worse compared

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

with others. It is probably that supervised LPP is not be effective to obtain optimal reduced data for the NG1 data set, so that the underlying relationships from source tasks between the original data and the reduced data may not be much helpful for the target task. We can see that GPDRTL-FSR performs well in the cases as shown from Fig. 3i to Fig. 3l. 7 Conclusions In this paper, we investigate a new problem for dimensionality reduction in transfer learning setting, where the class labels are available in multiple source tasks and the data in the target task are unlabeled. We firstly convert the problem of dimensionality reduction into integral regression problems. By using the Gaussian process model, we are able to discover the underlying relationship between the data in the original space and the reduced data in each source task. The relationships on data between each source task and the target task can be also modeled indirectly. We invent two variants, GPDRTL-FSR and GPDRTL-DTRR, by considering different loss functions and regularizers. The experimental results show the effectiveness of our methods. References [1] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15:1373–1396, 2003. [2] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multi-task Gaussian Process Prediction. In NIPS 20, pages 153–160. 2008. [3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, March 2004. [4] D. Cai, X. He, and J. Han. Spectral Regression: A Unified Approach for Sparse Subspace Learning. In ICDM, pages 73–82, 2007. [5] B. Chen, W. Lam, I. Tsang, and T. Wong. Extracting Discriminative Concepts for Domain Adaptation in Text Mining. In KDD, pages 179–188, 2009. [6] W. Dou, G. Dai, C. Xu, and Z. Zhang. Sparse Unsupervised Dimensionality Reduction Algorithms. In ECML/PKDD, pages 361–376, 2010. [7] L. Duan, I. W. Tsang, D. Xu, and T. Chua. Domain Adaptation from Multiple Sources via Auxiliary Classifiers. In ICML, 2009. [8] J. C. Gower and G. B. Dijksterhuis. Procrustes Problem. Oxford University Press, 2004. [9] X. He, D. Cai, S. Yan, and H.-J Zhang. Neighborhood Preserving Embedding. In ICCV, pages 1208–1213, 2005. [10] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J Zhang. Face Recognition Using Laplacianfaces. IEEE Trans. Pattern Anal. Mach. Intell., 27(3):328–340, 2005.

794

[11] R. Jin, S. Wang, and Y. Zhou. Regularized Distance Metric Learning: Theory and Algorithm. In NIPS, pages 862–870, 2009. [12] L. J. P. Maaten, E. O. Postma, and H. J. Herick. Dimensionality Reduction: A Comparative Review. Technical Report TiCC-TR 2009-005, Tilburg University, 2009. [13] S. J. Pan, J. T. Kwok, and Q. Yang. Transfer Learning via Dimensionality Reduction. In AAAI, pages 677– 682, 2008. [14] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain Adaptation via Transfer Component Analysis. In IJCAI, pages 1187–1192, 2009. [15] C. E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [16] I. Rish, G. Grabarnik, G. Cecchi, F. Pereira, and G. J. Gordon. Closed-Form Supervised Dimensionality Reduction with Generalized Linear Models. In ICML, pages 832–839, 2008. [17] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290:2323–2326, 2000. [18] Sajama and A. Orlitsky. Supervised Dimensionality Reduction Using Mixture Models. In ICML, pages 768–775, 2005. [19] N. Slonim and N. Tishby. Document Clustering Using Word Clusters via the Information Bottleneck Method. In SIGIR, pages 208–215, 2000. [20] M. Sugiyama, T. Id´e, S. Nakajima, and J. Sese. Semi-Supervised Local Fisher Discriminant Analysis for Dimensionality Reduction. In PAKDD, pages 333– 344, 2008. [21] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290:2319–2323, 2000. [22] Z. Wang, Y. Song, and C. Zhang. Transferred Dimensionality Reduction. In ECML/PKDD, pages 550–565, 2008. [23] Z. Zha, T. Mei, M. Wang, Z. Wang, and X. S. Hua. Robust Distance Metric Learning with Auxiliary Knowledge. In IJCAI, pages 1327–1332, 2009. [24] D. Zhang, Z. H. Zhou, and S. Chen. Semi-Supervised Dimensionality Reduction. In SDM, pages 629–634, 2007. [25] Y. Zhang and D.-Y Yeung. A Convex Formulation for Learning Task Relationships in Multi-Task Learning. In UAI, pages 733–742, 2010. [26] Y. Zhang and D.-Y Yeung. Transfer Metric Learning by Learning Task Relationships. In KDD, pages 1199– 1208, 2010. [27] X. Zhu. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005. [28] H. Zou and T. Hastie. Regularization and Variable Selection via the Elastic Net. Journal of Royal Statistical Society Series B, 67(2):301–320, 2005.

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.