Functional Data Classification for Temporal Gene Expression Data with Kernel-Induced Random Forests Guangzhe Fan, Jiguo Cao and Jiheng Wang
Abstract— Scientists and others today often collect samples of curves and other functional data. The multivariate data classification methods cannot be directly used for functional data classification because the curse of dimensionality and difficulty in taking in account the correlation and order of functional data. We extend the kernel-induced random forest method for discriminating functional data by defining kernel functions of two curves. This method is demonstrated by classifying the temporal gene expression data. The simulation study and applications show that the kernel-induced random forest method increases the classification accuracy from the available methods. The kernel-induced random forest method is easy to implement by naive users. It is also appealing in its flexibility to allow users to choose different curve estimation methods and appropriate kernel functions.
I. I NTRODUCTION Classification and discriminant analysis methods have grown in depths during the past 20 years. Fisher linear discriminant analysis (LDA) is the basic but standard approach. As the structure and dimension of the data becomes more complex in a wide range of applications, such as functional data [1], [2], there is a need for more flexible nonparametric classification and discriminant analysis tools, especially when the ratio of learning sample size to number of covariates is low and the covariates are highly correlated and the covriance matrix is highly degenerated or when the large number of covariates are generally weak in predicting the class labels. Another nonparametric approach for multivariate data classification is CART (Classification and Regression Trees), proposed by [3] and later developed in [4], [5]. The data in the space are partitioned sequentially using a set of nonparametric splitting rules and then simple classification rules are estimated in the subspace of the tree nodes. Although a single tree model is still less powerful in many hard classification problems, the ensemble of trees such as bagging [4] and random forests [5] are powerful nonparametric classification and prediction tools. But the random forests approach cannot be directly used for functional data classification. Some alternative approaches have been proposed to improve LDA for functional data classification, such as the flexible discriminant analysis [6] and the penalized discriminant analysis approach [7]. [8] developed a generalized Guangzhe Fan is an assistant professor at the Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada, N2L3G1 (email:
[email protected]). Jiguo Cao is an assistant professor at the Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada, V5A1S6 (email:
[email protected]). Jiheng Wang is a graduate student at the Department of Statistics and Actuarial Science, University of Waterloo (email:
[email protected]).
978-1-4244-6766-2/10/ $26.00 © IEEE
linear regression approach that considering the functional feature of the data. [9] proposed a nonparametric curve discrimination method based on functional principal component analysis. [10] proposed a nonparametric curves discrimination (NPCD) method taking into account both the regularity and nonlinearity problems in the classification. When the observations for individual curves are sparse, such as the time-course gene expression data, [11] used the functional principal component analysis to estimate the curves, and then functional logistic regression was used for classification. [12] proposed the Kernel-Induced Random Forests (KIRF) method for the multivariate classification problem. Kernelinduced classification trees are built using kernels of each two different training observations as candidate splitting rules. The tree algorithm applies to these kernels recursively to build one large tree. Using this concept, if we apply a random vector in choosing these kernelized splitting rules, as suggested in [5], a kernel-induced random forest (KIRF) can also be constructed for classification purpose. [12] showed that KIRF had some advantages in contrast to regular random forests. First, nonlinear relationship in the data can be recursively and locally discovered in the tree search; Second, since each observation can be used in the kernel splitting rule, there are potentially more variability among the trees in the forests and the voting power can be strengthened. Finally, the kernel-induced random forest procedure can be used in various types of data as long as appropriate kernels are defined, while traditional random forests are limited in their power in this sense. We extend KIRF for functional data classification by defining some kernels for functional data. The curves are estimated by some nonparametric smoothing methods, such as the functional principal component analysis [13], [14], [15]. Then the KIRF method is applied to implement the classification by defining the kernel functions for two curves. This method is flexible in allowing users to choose different smoothing methods and appropriate kernel functions. The rest of the article is organized as follows. Section 2 introduces the KIRF method for functional data classification. This method is then demonstrated with the classification for temporal gene expression data in Section 3. The conclusion and discussions are given in Section 4. II. METHOD Suppose we have noisy observations for m curves: yij = xi (tij ) + ij
(1)
where xi (t), i = 1, · · · , m, are unknown smooth curves, tij , j = 1, · · · , ni , are some discrete points with observations in the interval [0,T], and ij are the measurement errors with 2 mean 0 and variance σij . Our objective is to classify these m curves using kernel-induced random forests. Before defining the kernel of two curves, the curves have to be estimated from the noisy data using some nonparametric estimation methods, such as the functional principal component analysis. sparse functional data which is introduced as follows. A. Functional Principal Component Analysis When the number of observations or measurements for individual curves is small, it is hard to use the penalized spline smoothing method to estimate each curve using its own observations. [14] proposed to use functional principal component analysis (FPCA) via the Principal Analysis by Conditional Estimation (PACE) algorithm. PACE does not use pre-smoothing of trajectories which is problematic if functional data are sparsely sampled. Instead, PACE pools all observations for all curves together to estimate the covariance function, and then obtain the principal components. It has been shown to efficiently analyze the sparse functional data ([16]). The FPCA is introduced briefly below. Suppose the m curves are independent realization of a square integrable stochastic process X(t) on [0,T] with mean E[X(t)] = μ(t) and covariance function cov[X(t), X(s)] = G(s, t). The covariance function has an orthogonal expansion in L2 ([0, T ]) by Mercer’s Theorem: λg ρg (s)ρg (t) . G(s, t) = g
where the eigenvalues λg , g = 1, 2, · · · , are ordered by size, λ1 ≥ λ2 ≥ · · · , and ρg (t) is the associated eigenfunctions. We call ρg (t) functional principal components (FPCs). Each individual curve then has the following Karhunen-Lo`eve representation: cig ρg (t) . xi (t) = μ(t) + g
where cig are called FPC scores, and are calculated as T (xi (t) − μ(t))ρg (t)dt . cig = 0
The mean function μ(t) is estimated with any nonparametric smoothing methods by pooling all data together. We usually choose a small number of FPCs, say M , which explain most percentage of the total curve variation. [14] discussed how to choose the number of FPCs in details. The covariance function G(s, t) is estimated by the twodimension nonparametric smoothing on the empirical covariances m 1 [(xi (sk ) − μ ˆ(sk ))(xi (sl ) − μ ˆ(sl ))] , C(sk , sl ) = n i=1 where sk , k = 1, · · · , ngrid , are some dense grid on [0,T] . A spectral analysis is performed on the estimated covariance
matrix, and we obtain the first M eigenvalues and eigenfunctions. The number M can be chosen such that the fraction of variance explained (FVE) exceeds some threshold. FVE is calculated as J ˆ k=1 λk , J = 1, · · · , ngrid , F V E(J) = ngrid ˆ k=1 λk The FPC scores are obtained by some numerical quadrature methods, for example, the Simpson’s rule [17]. A matlab package have been developed to implement the PACE algorithm by Professor Hans-Georg M¨uller and his collaborators, and can be downloaded in the website: http://anson.ucdavis.edu/˜wyang/PACE/ . We define the kernel function of two curves as: κ(xi (t), x (t)) =
M (cig − cg )2 .
(2)
g=1
B. General Kernel Functions Besides the kernel functions in the above two subsections, there are a large number of kernel functions to choose, which are introduced as follows. Let the q-dimensional vector xi = (xi1 , · · · , xiq ) ∈ q , [18] and [19] define the inner product in q as: < xi , xj >=
q
xik xjk .
k=1
Now suppose that xi are not vectors, but rather functions with values xi (t). The nature functional counterpart for the above inner product is < xi (t), xj (t) >= xi (t)xj (t)dt . Let ν : W1 → W2 be a linear or nonlinear mapping from the input space to an inner product feature space, such that xi (t) → ν(xi (t)) , where W1 and W2 are two functional spaces. A kernel is a function κ, such that for all xi (t) ∈ W1 , κ(xi (t), xj (t)) =< ν(xi (t)), ν(xj (t)) > . One advantage of the kernel methods is that the mapped feature function ν(xi (t)) is unnecessary to be evaluated explicitly after defining the kernel function. Here we give a few examples of kernels. A polynomial kernel with degree d ≥ 2 is κ(xi (t), xj (t)) = (< xi (t), xj (t) > +c)d , where c is a tuning parameter. Another popular kernel function is the Gaussian kernel κ(xi (t), xj (t)) = exp(−||xi (t) − xj (t)||2 /(2σ 2 )) , where σ 2 is a tuning parameter, and ||xi (t) − xj (t)||2 =< xi (t) − xj (t), xi (t) − xj (t) >. The Gaussian kernel is also called a radial basis function kernel.
C. Ensemble Methods Of Tree Models 1. Bagging (Bootstrap Aggregating)[4] We have a learning set, L = {(yi , xi ), i = 1, 2, · · · , n} and a procedure (algorithm, or model) using this learning set to form a predicator φ(x, L). If we construct many bootstrap samples, {L(B) } from L and build a sequence of predictors {φ(x, L(B) )}, we can form an ensemble predictor φB (x) = voting{φ(x, L(B) )} A critical factor in bagging is the stability of the procedure constructing φ(x, L). Unstable procedures tend to favor bagging. 2. Random Forests[5] A classifier consisting of a collection of tree-structured classifiers {h(x; Θk ), k = 1, 2, · · · }, where {Θk } are independently and identically distributed random vectors. The nature and dimensionality of {Θk } depends on its use in the tree construction. Bagging is such an example. Currently, random forest use randomly selected inputs or combinations of inputs at each node to grow the tree. There are some questions related to Random Forests we need to consider: • Strength: How strong each individual tree is in prediction? We prefer choosing strong trees. • Correlation: How strong those individuals are correlated with each other? We prefer low correlation. • How many input variables at each node to grow? Usually it is a integer close to the square root of the number of x variables. We can try more or less of this value or use the default value given by software. • How many trees are needed? The more, the better. Usually we use 100 to 200 trees. More trees included will not overfit the data and the result will converge instead.
single classification tree constructed using the procedure described above is highly interpretable but quite unstable and weak in prediction. Figure 1 gives an example. A random forest algorithm is simply a replication of the classification tree procedure while introducing a random vector in the construction space. This may be implemented by limiting the number of features to be searched at every step growing the tree and/or bootstrapping the data set. The trees in a random forest are usually very large and need no pruning. Due to the instability of classification trees, they are quite diversified due to the random vector introduced in the process. Each classification tree generally is a low-bias but high-variance model. When they are combined to vote for a decision, the variance is reduced and the classification power becomes very strong. Another attractive property is that including more trees in the random forest will not overfit the training samples. However, the random forest approach described above usually works in a feature space. For functional types of data such as images, they are difficult to be used directly.
D. Kernel-Induced Random Forests A kernel is a function K such that for all xi and xj ∈ X P , i, j = 1, 2, · · · , n K(xi , xj ) =< φ(xi ), φ(xj ) >
(3)
where φ is a (non-linear) mapping from the input space to an (inner product) feature space. If the observation i is fixed in the training sample, and observation j is a new input, then the kernel function above can be treated as a new feature defined by observation i, denoted as K(xi , ·). Some popular kernels are inner product kernel, polynomial kernel and Gaussian (radial basis) kernel. A classification tree model is a recursive partitioning procedure in the feature space. Starting from the root node, at each step, a greedy exhaustive search is implemented to find the best splitting rule such as Xi < c for numerical features. If the answer is yes, then the observation will move to the left child node, and otherwise to the right child node. The procedure is implemented recursively until a very large binary tree is constructed. A large tree usually overfits the training sample. Next, a cross-validation process is employed to prune the tree back to its proper size. A
Fig. 1. An example of a classification tree with 3 terminal nodes. Note that Xi and Xj are features in the data space.
Here we use the concept of kernel-induced classification tree to overcome this limitation. Instead of using the raw signal features in the data space to construct splitting rules, kernel functions based on observations are employed. Since the definition of kernel is very flexible to handle various types of data, the potential of random forest is greatly enhanced and extended. Not only the feature space is enlarged, but also some complicated and non-linear hidden patterns between observations can be captured by the random forest learning procedure. An example of a kernel-induced classification tree is illustrated in Figure 2. After defining the kernel of two curves, the kernel-induced random forest can be used for functional data classification. We see that each curve, xi (t), in the learning set can interact with other curves via the kernel function κ(xi (t), ·). The value of this kernel function is uniquely defined by the specified curve, xi (t), and an input curve z(t). We could view such an observation-based kernel function as a new feature,
5
4
4
3
3
Gene Expression
Gene Expression
5
2 1 0
defined as κi (z(t)) = κ(xi (t), z(t)). If we have m curves in the learning set, then we will have m such kernel-induced features, κi (·), i = 1, · · · , m. The kernel induced split rule is then κ(xi (t), ·) < c. Similar to the CART procedure in traditional multivariate data space, we can recursively partition the curve space using these newly defined kernel-induced features and aim to get more homogeneous distributions of the curves in the resulting terminal nodes of the tree. The tree can grow very large with only one observed curve in each terminal node. Then we can similarly adopt the random forests procedure, i.e. using randomly selected kernels during the tree construction process so that we can simultaneously grow many trees and use them as an ensemble model to vote for classification.
1 0
−1
−1
−2
−2
−3 0
Fig. 2. An example of a kernel-induced classification tree with 3 terminal nodes. Note that Xi and Xj are observation vectors in the data space and the K(, )s are kernel functions defined on these observations.
2
50
−3 0
100
Time (minutes)
50
100
Time (minutes)
Fig. 3. The gene expression for the yeast cell cycle measured every 7 minutes between 0 and 119 minutes, published in Spellman et al. 1998. The left penal displays the 44 genes related to G1 phase regulation, and the right penal displays the 46 genes related to non-G1 phase regulation.
1.5 1 0.5 0 −0.5 150 150
100 100
III. CLASSIFICATION FOR TEMPORAL GENE EXPRESSION DATA [20] measured 6178 gene expression for the yeast cell cycle every 7 minutes between 0 and 119 minutes. Traditional methods identified and classified 90 genes to two groups: 44 genes are related to G1 phase regulation and 46 genes to non-G1 phase regulation. Figure 3 display the temporal expression of these 90 genes. The estimated covariance surface are illustrated in Figure 4. A spectral decomposition is implemented on the estimated covariance to construct the FPCs. We choose five FPCs, which explain 99.7% of total variance (FVE=99.7%). The corresponding five FPC scores are calculated for each gene. We then use the Kernel-Induced Random forests to classify the 90 genes by defining the kernel function (2). The classification error rate is 6.8% for the G1 group, 8.7% for the non-G1 group, and the overall classification error rate is 7.8%. [11] used the functional logistic regression to classify these 90 genes, and their classification error rate is 11.4% for the G1 group, 8.7% for the non-G1 group, and the overall classification error rate is 10.0%. They also tried a mixed effects model with B-spline based method proposed in [21]
50
Time (minutes)
50 0
0
Time (minutes)
Fig. 4. The estimated covariance surface from the 90 gene expression data.
and [22], and the corresponding overall classification error rate is 11.1%. The comparison of classification errors for the yeast data of the three methods are listed in Table I. IV. C ONCLUSIONS AND D ISCUSSION The multivariate data classification methods, such as the linear discriminant analysis and the random forest method, can not be directly used for classify functional data, because the functional data often have a large number of observations or measurements for each curve, in which case the multivariate data classification methods fail in the curse of dimensionality. The observations for functional data are correlated and are collected in order of time or space, which are hard to be taken into account by the multivariate data classification methods.
TABLE I A COMPARISON OF CLASSIFICATION ERRORS FOR THE YEAST DATA OF THE THREE METHODS
KIRF logistic B-spline
G1 6.8 11.4 N/A
non − G1 8.7 8.7 N/A
overall 7.8 10.0 11.1
The kernel-induced random forest method is extended for functional data classification by defining the kernel functions for two curves. When a large number of observations or measurements are available for each curve, the individual curve can be estimated by penalized spline smoothing. When the functional data are sparse, the curve estimation tends to be more accurate by pooling all functional data together using functional principal component analysis. After defining the appropriate kernel functions, the kernel-induced random forest method has improved classification accuracy than other available approaches. It is of great interest to apply the kernel-induced random forest method for discriminating other kinds of functional data by defining some appropriate kernel functions. For instance, for the human height data, the curve shapes described by the first and second derivatives contain more interesting features. In this case, the kernel function may be defined with the curve derivatives, and the kernel-induced random forest method can then be used for growth curve classification. ACKNOWLEDGEMENTS The authors thank Dr. Xiaoyan Leng for kindly providing the Yeast gene expression data and their computing codes. They also thank the inspirational discussion with Dr. HansGeorg M¨uller. This research was partially supported by two Discovery Grants from the Natural Science and Engineering Research Council of Canada to both authors. R EFERENCES [1] J. O. Ramsay and B. W. Silverman, Functional Data Analysis, 2nd ed. New York: Springer, 2005. [2] F. Ferraty and P. Vieu, Nonparametric Functional Data Analysis. New York: Springer, 2006. [3] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth, 1984. [4] L. Breiman, Bagging predictors, Machine Learning, vol. 24, pp. 123C140, 1996. [5] łł, Random forests, Machine Learning, vol. 45, pp. 5C32, 2001. [6] T. Hastie, A. Buja, and R. Tibshirani, Flexible discriminant analysis by optimal scoring, Journal of American Statistical Association, vol. 89, pp. 1255C1270, 1994. [7] łł, Penalized discriminant analysis, Annals of Statistics, vol. 23, pp. 73C102, 1995. [8] B. Marx and P. Eilers, Generalized linear regression on sampled signals and curves: a P-spline approach, Technometrics, vol. 41, pp. 1C13, 1999. [9] P. Hall, D. Poskitt, and B. Presnell, A functional data-analytic approach to signal discrimination, Technometrics, vol. 43, pp. 1C9, 2001. [10] F. Ferraty and P. Vieu, Curves discrimination: a nonparametric functional approach, Computational Statistics and Data Analysis, vol. 44, pp. 161C173, 2003. [11] X. Leng and H. G. Muller, Classification using funcitonal data analysis for temporal gene expression data, Bioinformatics, vol. 22, no. 1, pp. 68C76, 2006.
[12] G. Fan, Kernel-induced classification trees and random forests, Department of Statistics and Actuarial Science, University ofWaterloo, Tech. Rep., 2009. [13] P. Besse, H. Cardot, and F. Ferraty, Simultaneous nonparametric regressions of unbalanced longitudinal data, Computational Statistics and Data Analysis, vol. 24, pp. 255C270, 1997. [14] F. Yao, H. G. Muller, and J. L. Wang, Functional data analysis for sparse longitudinal data, Journal of the American Statistical Association, vol. 100, pp. 577C590, 2005. [15] L. Zhou, J. Z. Huang, and R. Carroll, Joint modelling of paired sparse functional data using principal components, Biometrika, vol. 95, pp. 601C619, 2008. [16] F. Yao, H. G. Muller, and J. L. Wang, Functional linear regression analysis for longitudinal data, Annals of Statistics, vol. 33, pp. 2873C 2903, 2005. [17] R. L. Burden and F. J. Douglas, Numerical Analysis, 7th ed. Pacific Grove, California: Brooks/Cole, 2000. [18] B. Scholkopf and A. Smola, Lerning with Kernels - Support Vector Machines, Regulation, Optimization, and Beyond. Cambridge, Massachusetts: The MIT Press, 2002. [19] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, UK: Cambridge University Press, 2004. [20] P. T. Speckman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, Comprehensive identification of cell cycle-regulated gene of the yeast saccharomyces cerevisiae by microarray hybridization, Mol. Biol. Cell, vol. 9, pp. 3273C3297, 1998. [21] J. Rice and C. O. Wu, Nonparametric mixed effects models for unequally sampled noisy curves, Biometrics, vol. 57, pp. 253C259, 2001. [22] M. Shi, R. E. Weiss, and J. M. G. Taylor, An analysis of paediatric CD4 counts for acquired immune deficiency syndrome using flexible random curves, J. R. Statist. Soc. C, vol. 45, pp. 151C163, 1996.