BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 10 2006, pages 1259–1268 doi:10.1093/bioinformatics/btl065
Gene expression
Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data Desheng Huang1,2 and Wei Pan2, 1
Department of Mathematics, China Medical University, Shenyang, China and 2Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455-0392, USA
Received on November 28, 2005; revised on January 16, 2006; accepted on February 20, 2006 Advance Access publication February 24, 2006 Associate Editor: Joaquin Dopazo ABSTRACT Motivation: Because co-expressed genes are likely to share the same biological function, cluster analysis of gene expression profiles has been applied for gene function discovery. Most existing clustering methods ignore known gene functions in the process of clustering. Results: To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions into a new distance metric, which shrinks a gene expression-based distance towards 0 if and only if the two genes share a common gene function. A two-step procedure is used. First, the shrinkage distance metric is used in any distance-based clustering method, e.g. K-medoids or hierarchical clustering, to cluster the genes with known functions. Second, while keeping the clustering results from the first step for the genes with known functions, the expression-based distance metric is used to cluster the remaining genes of unknown function, assigning each of them to either one of the clusters obtained in the first step or some new clusters. A simulation study and an application to gene function prediction for the yeast demonstrate the advantage of our proposal over the standard method. Contact:
[email protected] 1
INTRODUCTION
This article concerns incorporating biological knowledge into clustering genes for gene function discovery using microarray expression data, though the methodology can have more broad applications. It has been observed that genes with the same function or involved in the same biological process are likely to co-express, hence clustering genes’ expression profiles provides a means for gene function prediction; see, e.g. Eisen et al. (1998), Brown et al. (2000), Wu et al. (2002), Xiao and Pan (2005) and references therein. However, most existing approaches ignore known functions of the genes in the process of clustering. The limitations of these approaches and the importance of incorporating gene functional or other types of biological knowledge into clustering analysis have been increasingly recognized. In particular, it is well-known that results of clustering gene expression data are not stable (Zhang and Zhao, 2000; Kerr and Churchill, 2001). In general, with relatively high noise levels of genomic data, it is recognized that incorporating biological knowledge into statistical analysis is a reliable way to maximize statistical efficiency and enhance the interpretability of
To whom correspondence should be addressed.
analysis results. Hanisch et al. (2002) proposed incorporating a metabolic pathway while Cheng et al. (2004) incorporating the Gene Ontology (GO) (Ashburner et al., 2000) into clustering. Both approaches work by first defining a distance metric based on a biological network, either a pathway or GO. Then this metric is combined with the usual expression-based metric using their average, which is used in a clustering algorithm. As to be discussed, owing to incomplete biological knowledge, such a combined metric is in general biased: it over-estimates the distance between any two genes of unknown function, as compared with the genes with known functions. Specifically, two genes that in truth share the same function, which however is not known yet, will have an incorrectly large distance in the biological function metric, and thus an incorrectly large distance in the combined metric. We propose a novel idea to handle this problem. Among other attempts, Fang et al. (2006) used GO to guide clustering; i.e. only GO nodes are possible clusters. A downside is that it does not permit the discovery of new gene functional categories. Pan (2006) and Huang et al. (2006) proposed two modifications to standard model-based clustering so that the genes sharing the same biological function are allowed to have a common prior probability in any cluster while the genes with different functions may have varying prior probabilities. The two modifications are only applicable to model-based clustering, which has been extensively studied in the context of clustering gene expression data (e.g. Yeung et al., 2001; Broet et al., 2002; Ghosh and Chinnaiyan, 2002; McLachlan et al., 2002; Pan et al., 2002); see McLachlan and Peel (2002) for a nice introduction to model-based clustering. Here we consider another class, distancebased clustering, ranging from partitional to hierarchical methods, some of which are among the most competitive for gene expression data; see Datta and Datta (2003). To be concrete, we focus on the K-medoids method, also called partitioning around medoids (PAM) (Kaufman and Rousseeuw, 1990), which is a robust version of the K-means and has been shown to work well for gene expression data (van der Laan et al., 2003). In our proposal, we assume that each gene of a genome can be assigned into at least one, possibly several, of a few groups: one group contains the genes of unknown function while each of the other contains the genes sharing a common function. The grouping can be based on biological pathways or gene functional annotation systems, such as GO (Ashburner et al., 2000), MIPS (Mewes et al., 2004) and KEGG (Kanehisa, 1996). As usual, we can calculate a distance matrix for all the genes based on their gene expression profiles using, for example, Euclidean distance or Pearson’s
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[email protected] 1259
D.Huang and W.Pan
correlation. Then a two-step procedure is applied. In the first step, we shrink the distance towards 0 for any two genes coming from the same functional group. Then, leaving out the genes of unknown function, we cluster the genes with known functions using this shrinkage distance metric. In the second step, while fixing the clustering results for the genes with known functions, we cluster the left-out genes using the original expression-based distance metric. The genes of unknown function are assigned to either clusters of the genes with known functions or their own new clusters, corresponding to possibly new functional categories. Incorporating gene functions into clustering is still an unsupervised learning task, not supervised learning. In particular, because of possible heterogeneity in each gene functional group, we assume neither that the genes from the same functional group come from the same cluster, nor that a cluster consists of genes from the same functional group. This is consistent with an important aim of clustering to discover new gene functions. This article is organized as follows. In Section 2, we first briefly review the standard K-medoids method, then propose our new method. In Section 3, we demonstrate the advantage of our proposal using simulated data, and then using real data for gene function prediction. We end the paper with a short discussion.
2
METHODS
2.1
Set-up
K-medoids clustering
The K-medoids method is similar to the K-means method, but more robust. The main difference is that, rather than using the mean of each cluster as the centroid of the cluster as in the K-means, K-medoids finds an observation from the cluster to be the centroid. Specifically, given a distance matrix D ¼ (dij), for a specified number of clusters, say K, it works in the following steps: Step 0. Randomly select K observations ik as medoids, k ¼ 1, . . . , K. Step 1. Find the cluster membership of each observation i CðiÞ ¼ argmink d i‚ ik : Step 2. Update the medoids: the new medoid for cluster k is observation ik with X ik ¼ argminfi:CðiÞ¼kg dij: Cð jÞ¼k
Step 3. Repeat Steps 1 and 2 until convergence.
1260
2.3
New method
2.3.1 Algorithm We first define a new distance metric dij based on an expression-based distance metric dij and gene functional annotations: rd ij if there is an f such that 1 f F and i‚ j 2 Gf dij ¼ dij otherwise‚ where 0 r 1 is a shrinkage parameter to be determined. Trivially, r ¼ 1 leads to the standard method of ignoring gene functions in the process of clustering. There are two major steps in our new clustering method. In the first step, we apply the K-medoids to the genes in G1, . . . , GF using the new distance matrix D ¼ ðdij Þ, obtaining clusters for the genes with known functions. Here we have to supply the number of clusters, say K0. In the second step, we apply a modified K-medoids to the original distance matrix D such that the genes in G0 can be assigned to either one of the K0 clusters obtained earlier or one of K1 new clusters, while fixing the medoids and cluster assignments of any gene in G1, . . . , GF. For simplicity of notation, we assume that the first n0 genes are in G0 and the remaining n n0 ones are in G1, . . . , GF, indexed as gene 1, . . . , gene n. Apply a K-medoids to the genes {n0 + 1, . . . , n} (i.e. those with known functions in G1, . . . , GF) using the shrinkage distance matrix D, with K0 clusters. Suppose that the K0 medoids are fik : k ¼ 1‚. . . ‚ K 0 g and each gene i has its cluster membership C(i) for i ¼ n0 + 1, . . . , n. Step 2. Apply a modified K-medoids to the genes {1, . . . , n0} (i.e. those of unknown function in G0) using the expression-based distance matrix D, creating K1 new clusters while fixing the K0 clusters obtained in Step 1. Step 2.0. Randomly select K1 genes from {1, . . . , n0} as medoids, say ik for k ¼ K0 + 1, . . . , K0 + K1. Step 2.1. Find the cluster membership of each gene i 2 {1, . . . , n0} Step 1.
First, based on gene expression profiles, we can calculate an expressionbased distance matrix D ¼ (dij) where dij is the distance between gene i and gene j, i, j ¼ 1, . . . , n. The distance metric used can be Pearson’s correlation, L1 distance, L2 distance, etc. Second, based on F known gene functions (or pathways), each of the n genes of a genome can be assigned into one or more than one of F + 1 groups, G0, G1, . . . , GF; G0 contains the genes of unknown function while each of G1, . . . , GF contains the genes belonging to one of the F function categories. Note that we allow a gene to have more than one known function and hence be in more than one group. The goal of the study is, as in many previous studies (e.g. Eisen et al., 1998), to formally or informally predict functions of the genes in G0 by clustering them along with other genes in G1, . . . , GF. For instance, if a gene in G0 is in a cluster enriched with the genes from G1, then it is reasonable to predict that the gene has the same function as those in G1. The method proposed here is applicable to any distance-based clustering algorithm; i.e. the method works as long as clustering can be conducted with a supplied distance matrix. For concreteness, we restrict attention to one of such algorithms, the K-medoids method or PAM (Kaufman and Rousseeuw, 1990).
2.2
Some authors have found good performance of K-medoids for clustering gene expression data (e.g. van der Laan et al., 2003). In addition, some diagnostic tools are available, and the method can be modified for very large datasets; all these functions are implemented in R package cluster. The K-medoids method is also available in Michael Eisen’s popular software Cluster (http://rana.lbl.gov/EisenSoftware.htm).
CðiÞ ¼ argmin1kK0 +K 1 di‚ ik : Step 2.2. Update the K1 medoids: for K0 + 1 k K0 + K1, the new medoid for cluster k is gene ik with X d ij: ik ¼ argminfi:CðiÞ¼k‚ 1in0 g CðjÞ¼k‚ 1jn0
Step 2.3. Repeat the above two steps until convergence. It is emphasized that in Step 2.1, only genes in G0 are assigned to any of the K0 + K1 clusters; in contrast, genes in G1, . . . , GF remain in the K0 clusters obtained using the shrinkage distance matrix. In Step 2.2, we only update the centroids for the K1 new clusters. Note that in Step 2, we have to use the original expression-based distance matrix D, not the shrinkage distance matrix D. Because of incomplete biological knowledge, d ij is in general biased for the true distance between any two genes if any of the two is not annotated (i.e. in G0). Consider an extreme case with r close to zero: under the shrinkage distance metric D, the genes in G1, . . . , GF will be correctly assigned to their corresponding clusters; however, all the genes in G0 are quite distant from any of G1, . . . , GF, thus forming their own clusters even if some in G0 indeed belong to one of G1, . . . , GF. The methods proposed in Hanisch et al. (2002) and Cheng et al. (2004) are limited in this regard. The above proposed method is flexible enough to allow multiple known functions for genes; i.e. the functional groups G1, . . . , GF may overlap: first, the shrinkage distance dij is still well-defined if gene i is in two or more
Biological knowledge in distance-based clustering
functional groups; second, dij does not change if both genes i and j co-appear in two or more groups, as compared with that in only one group; third, the clustering algorithm only uses the shrinkage distance matrix D in Step 1. Note that we allow K1 > 0 so that a gene of an unknown function may not be assigned into any cluster containing genes with known functions. Owing to the possibility of some undiscovered gene functions, this flexibility is desirable. Furthermore, even no new gene functions are needed, the use of K1 > 0 allows some genes not to be clustered and thus not classified into one of the known functional categories, possibly because of inconclusive evidence from too different expression profiles.
2.4.1 Silhouette The standard method of selecting the number of clusters in PAM is the use of average silhouette (Kaufman and Rousseeuw, 1990). For any gene i and cluster C, define the distance between i and C as X dði‚ CÞ ¼ dij = j C j ‚ j2C
as the average distance between i and any gene j that belongs to C. Suppose that gene i is in cluster C(i), define ai ¼ dði‚ CðiÞÞ‚
bi ¼ min dði‚ CÞ: C6¼CðiÞ
2.3.2 Justifications for shrinkage distance metric
Here we provide two ways to justify shrinking the expression distance. The first is based on a more general form of the shrinkage distance metric d ij ¼ rd ij + ð1 rÞmij ‚
ð1Þ
Then the silhouette for gene i and the average silhouette are defined respectively as X sil ¼ sili =n: sili ¼ ðbi ai Þ= max ðai ‚ bi Þ‚ i
where mij is the population distance between gene i and gene j. So far we have simply used mij ¼ 0 if both genes share a common biological function, and mij ¼ dij otherwise. More generally, mij can represent the distance between the two genes based on a metabolic pathway or gene functional annotations; see Hanisch et al. (2002) and Cheng et al. (2004) for such a metric. In this way, our method can be extended to incorporate a whole biological network, such as GO, into clustering. There is a Bayesian interpretation for (1): for instance, if we assume that dij j dij Nðdij ‚sÞ‚
dij Nðm‚tÞ‚
d ij
with mij replaced by m (e.g. Carlin and then the posterior mean of dij is Louis, 2000). Here, it can be regarded that dij is a distance measured with error because of the noise in and a limited number of microarray experiments, and dij is the true distance. Hence, the shrinkage distance d ij is a shrinkage or Bayesian estimator of the true distance after incorporating biological knowledge of gene functions. The second justification for the shrinkage distance is based on the intuitive idea that a given gene expression dataset only provides partial information on the underlying true relationships among the genes. Formally, suppose that all relevant attributes for a gene are x1 and x2, where x1 is the observed expression profile of the gene while x2 represents missing attributes. If complete attributes are available, then the corresponding distance is dij(x1, x2), which is a function of two distances, dij(x1) and dij(x2), based on the first and second sets of attributes respectively. For simplicity, suppose that d ij ðx1 ‚ x2 Þ ¼ rd ij ðx1 Þ + ð1 rÞdij ðx2 Þ:
ð2Þ
Assuming that gene function is related to the missing attributes x2, and that dij(x2) 0 for any genes i and j sharing the same function, we see that the shrinkage distance dij is an approximation to dij(x1, x2), the true distance based on complete data. Again dij(x2) can in general represent the distance based on other biological data sources, such as GO; in this way, our method can be extended to incorporate other data sources.
A model with the largest average silhouette is preferred. In the current context, we experimented with using both the original distance metric D and the shrinkage metric D, and found that, as expected, the former always favored r ¼ 1 while the latter favored r ¼ 0. As an analog to supervised learning, this phenomenon is related to the bias of a training error rate. Therefore, regarding the genes in G0 as test data (while G1, . . . , GF as training data), we propose using only the test data when calculating the average silhouette.
2.4.2 Cross-validation As to be discussed, the average silhouette does not work well in choosing the shrinkage parameter r. As an alternative, we use V-fold cross-validation (CV) (Breiman et al., 1984). The main idea is that, for the purpose of model selection, we treat the problem as a supervised learning task with the known functions of the genes as the outcome. The procedure is to (1) partition the genes in G1, . . . , GF into about equally-sized V subgroups; (2) treat the genes in subgroup v as having no known function, include them in G0 and cluster all the genes as before; (3) calculate the prediction accuracy rate (see the next section for more details) for the genes in subgroup v; (4) repeat the above two steps for v ¼ 1, . . . , V. The final CV accuracy rate is the average of the V accuracy rates of step (3). To estimate the optimal combination of tuning parameter values, such as that for (K0, K1, r), we try various combinations of their values and choose the one with the highest CV accuracy rate. CV has its own limitations in the current context, mainly in its assumption that the goal of clustering is to identify groups of genes corresponding to (or, more generally, closely related to) the known functions of groups G1, . . . , GF. In particular, in the context of new gene function discovery, if many genes of unknown function (i.e. in G0) turn out to have a function different from G1, . . . , GF, then CV will not work, as to be shown later; on the other hand, at the genomic scale, one may argue that this is not typical.
3 2.4
Model selection
We regard our proposed clustering method as an exploratory data analysis tool, as the use of hierarchical clustering (Eisen et al., 1998), K-means clustering (Tavazoie et al., 1999) and self-organizing maps (Tamayo et al., 1999) under similar contexts. As is demonstrated in an important application of clustering analysis for gene function discovery (Wu et al., 2002), various values of the tuning parameters, the shrinkage parameter r, the numbers of clusters K0 and K1, can be tried, and then a statistical enrichment analysis can be conducted to predict gene functions. In some applications, it may be desirable to have an objective model selection criterion to select these parameters. In general, model selection is a quite difficult problem for clustering, though there is huge literature on this (see Dudoit and Fridlyand, 2002 and references therein). Here we report some preliminary results in our current context.
RESULTS
3.1
Simulated data
3.1.1 Performance We did a simulation study to evaluate the performance of the methods. Bivariate data were generated from three true clusters with bivariate Normal distributions respectively: N((0, 0)0 , sI), N((2, 0)0 , sI) and N((1, 1)0 , sI), where I is a 2 · 2 identity matrix. Three simulation set-ups were considered: Set-up 1: s ¼ 1; Set-up 2: s ¼ 3; Set-up 3: s ¼ 1, with 18 noise attributes added, each with a standard Normal distribution N(0, 1). Hence, each observation was a 20-element vector with the last 18 as noises.
1261
D.Huang and W.Pan
Table 1. Average predicted counts in each class and average prediction accuracy (Acc) for test data in 1000 simulations
Set-up
Truth
Standard: r ¼ 1 1 2
1
1 2 3 Acc 1 2 3 Acc 1 2 3 Acc
72 10 44 0.6375 36 33 95 0.2925 48 33 83 0.4000
2
3
New: r ¼ 0.9 1 2
3
17 76 49
11 14 107
37 50 74
27 17 31
46 45 50
6 22 67
74 11 48 0.6425 47 28 69 0.3800 62 11 44 0.3700
14 73 42
12 16 110
37 47 73
16 25 58
26 51 121
12 38 35
It was assumed that the genes coming from the same cluster shared the same biological function. For each simulation set-up, we assumed that some genes from the first two clusters had known functions while others had unknown functions. Specifically, G1 contained 200 genes from cluster 1, G2 contained 200 genes from cluster 2 and G0 contained 100, 100 and 200 genes from the three clusters respectively. As before, the genes from G1 and G2 were known to share the same function while those from G0 did not have known functions. We considered the ideal case with K0 ¼ 2 and K1 ¼ 1. We tried r ¼ 0, 0.1, 0.2, . . . , 1. Some representative results from 1000 simulations were shown in Table 1. Note that the results were for test data; e.g. the prediction accuracy was calculated as the percentage of correct predictions for the genes in test data (i.e. G0). It can be seen that the standard and the new methods had close performance in set-up 1. However, as noise level increased (set-up 2), or noise attributes were added (set-up 3), the new method outperformed the standard method with an appropriate r. 3.1.2 Silhouette As to be discussed in Section 3.2.3, it was surprising that the average silhouette did not work in selecting the shrinkage parameter. The reason turns out to be that, more generally, the average silhouette cannot be used for variable selection, as demonstrated by the simulation study below. For each training simulated dataset, we generated 50 observations from each of the two bivariate Normal distributions N((0, 0)0 , I) and N((1, 1)0 , I), representing two clusters respectively. The PAM was applied to the training data with (1) two attributes or (2) only the first attribute to obtain two clusters. We generated another 500 observations from the two distributions, respectively, to form a test dataset. Then each observation in the test data was assigned to one of the two clusters closer to it. Under each case, we calculated the number of misclassified observations, and the corresponding average silhouette for the test data. Note that, in calculating the average silhouette, we only used the first attribute for both cases. Below simulation results were based on 1000 independent replications. First, using both attributes in clustering, the mean number of misclassified observations and mean average silhouette (with
1262
3
New: r ¼ 0.7 1 2 74 11 48 0.6425 47 28 69 0.3800 50 8 34 0.4325
3
14 73 42
12 16 110
37 47 73
16 25 58
24 58 101
26 34 65
New: r ¼ 0.5 1 2 74 11 48 0.6425 43 26 67 0.3800 50 8 34 0.4325
3
14 73 42
12 16 110
36 48 72
21 26 61
24 58 101
26 34 65
standard deviation in parentheses) were 265.4 (38.3) and 0.286 (0.130) respectively; in contrast, using only the first attribute, those numbers were 312.9 (15.2) and 0.550 (0.012). Hence, although it was confirmed that using both attributes gave better clustering performance with fewer misclassifications, its average silhouette was smaller than that of clustering with only the first attribute! In fact, even if calculating the average silhouette with two attributes after clustering with both attributes, the mean average silhouette was only 0.340 (0.019), still smaller than that of clustering with only one attribute. In summary, this example clearly demonstrated that the average silhouette could not be used as a criterion to select attributes: although clustering with complete attributes gave better performance than using partial attributes, the average silhouette incorrectly indicated otherwise. As discussed in Section 2.3.2, the shrinkage method can be regarded as an approximation to use complete attributes while the standard method uses only partial attributes, therefore the average silhouette cannot be used to select between the shrinkage method and the standard method, or more generally, the shrinkage parameter.
3.2
Gene function prediction using gene expression profiles
An important application of microarray experiments is to cluster gene expression profiles for gene function discovery; see, e.g. Eisen et al. (1998), Brown et al. (2000), Wu et al. (2002), Zhou et al. (2002), Xiao and Pan (2005) and references therein. The premise is that genes sharing the same biological function are likely to co-express. Here we considered such an application for yeast Saccharomyces cerevisiae. We used a large dataset containing 300 microarray experiments with gene deletions and drug treatments (Hughes et al., 2000). The data were centered and scaled so that the sample mean and variance of the expression profile for each gene were 0 and 1, respectively. Gene functions were downloaded from the MIPS database (Mewes et al., 2004). For illustration, we considered three gene functions, mitotic cell cycle and cell cycle control, mitochondrion and c-compound and carbohydrate utilization, called as classes 1, 2 and 3 respectively in the sequel. There were 319, 312 and 218 genes in the three classes respectively.
Biological knowledge in distance-based clustering
For the purpose of evaluation, we randomly selected 200 genes from each of the first two classes and treated their functions as known while others (i.e. 119, 112 and 218 genes from the three classes respectively) as unknown. More specifically, 200 genes from class 1 consisted of G1, and G2 contained 200 genes from class 2; for the remaining genes, we did two experiments: first, G0 contained the 119 and 112 genes from the first two classes respectively; second, G0 contained the 119, 112 and 218 genes from the three classes respectively. Note that the genes in G0 were treated as test data: their functions were pretended to be unknown and predicted based on clustering, and then compared with their true functions. 3.2.1 Two-class experiment Only the first two classes were considered: G0 contained only the 119 and 112 genes from the first two classes. Recall that there are two steps in our methods, generating K0 and K1 clusters respectively. For any of the K0 clusters generated in step one, we counted the numbers of the genes from the training data (i.e. G1 and G2), and assigned the cluster to class 1 or 2 depending on whether there were more genes from G1 or G2; if there was an equal number of the genes from G1 and G2, we randomly assigned one of the two classes to the cluster. Any gene in the test data (i.e. G0) assigned to one of the K0 clusters was predicted to be in the same class as that of the cluster. We conservatively treated any genes assigned to any of the K1 clusters (generated in step two) as an incorrect prediction, though it was probably better to treat it as no prediction being made. This is called hardclassification. The hard-classification scheme went with the majority class in a cluster while ignoring the winning margin, i.e. the difference between the majority and minority proportions. Perhaps a better method is soft-classification: for any cluster C, if the sample proportion of the genes in class s in the training data was pC,s, then a test gene assigned to cluster C was predicted to be in class s with probability pC,s. In this way, we calculated the expected counts and thus a prediction accuracy. The results for K1 ¼ 0, 1 and 2, based on hard- and softclassifications, are shown in Figures 1 and 2, respectively. It is clear that the shrinkage method mostly dominated the standard method; the advantage of the shrinkage method was more evident with soft-classification. 3.2.2 Three-class experiment Now consider a more practical situation: we know the functions of some genes, and the other genes may or may not have the known functions. Specifically, we knew two gene functional groups, G1 and G2, but for the remaining genes in G0, some (the 119 and 112 genes respectively) indeed had one of the two functions while others (i.e. the 218 genes from class 3) did not. To predict the function of a gene in G0, if it was assigned to one of the K0 clusters (generated in Step 1), we used the same rule as discussed before. However, if it was assigned to any of the K1 clusters (generated in Step 2), it was predicted to have function of class 3. The prediction accuracy rates based on hard- and softclassifications were shown in Figures 1 and 2 respectively, evidently indicating that the new method had a much better performance than that of the standard method.
3.2.3 Model selection Figure 3 presents the average silhouettes for various combinations of the shrinkage parameter r, and the numbers of clusters K0 and K1. First, for the standard method (i.e. r ¼ 1), the general trend of the average silhouette paralleled that of the prediction accuracy (Figs 1 and 2) for various values of K0 and K1. For example, the average silhouette was maximized at (K0, K1) ¼ (2, 0) for the two-class problem and at (K0, K1) ¼ (2, 2) for the three-class problem respectively, leading to correct identification of (K0, K1) that maximized the prediction accuracy. However, as discussed in Section 3.1.2, it was confirmed that the average silhouette appeared to incorrectly favor the standard method over the shrinkage method. Hence, we conclude that the average silhouette can be used to select the numbers of clusters, but not the shrinkage parameter r. As a comparison, CV performed well in the two-class experiment (Fig. 4): the CV accuracy estimates closely reflected the patterns of those for test data (Figs 1 and 2). It should also work for other similar multi-class problems. As mentioned earlier, however, CV did not work in the three-class experiment (results not shown) because the third class was new, not covered by training data.
4
DISCUSSION
With relatively high noise levels in genomic data, the importance of incorporating biological knowledge into statistical analysis has been increasingly recognized: in addition to several methods reviewed earlier for clustering, there are also emerging works of incorporating pathway or functional annotations into detecting differential gene expression (e.g. Mootha et al., 2003; Al-Shahrour et al., 2005; Pan, 2005; Tian et al., 2005), and into classification (Lottaz and Spang, 2005). Nevertheless, the current practice seems to be mainly restricted to using biological knowledge as evaluating criteria to validate analysis results after the analysis is done; see two recent reviews by Handl et al. (2005) and Khatri and Draghici (2005), and many references therein. We regard our method as an exploratory statistical analysis tool, as in popular applications of hierarchical clustering (Eisen et al., 1998), K-means (Tavazoie et al., 1999) and self-organizing maps (SOMs) (Tamayo et al., 1999) in clustering gene expression profiles. Although we studied formal model selection methods to determine a proper choice of some tuning parameters, we only achieved mixed results. First, for the standard method with the shrinkage parameter r ¼ 1, the average silhouette could correctly select the numbers of clusters; however, the average silhouette did not work well in choosing a proper value for the shrinkage parameter. Indeed, it is a surprising finding that, in general, the average silhouette cannot be used to select attributes in clustering. Second, if it is reasonable to assume that there are only a small proportion of the genes in new functional categories not known yet, which may be the case for model organisms, cross-validation can be applied to choose the shrinkage parameter (and the numbers of clusters). As a challenging topic, model selection in the context of clustering needs more further investigations. On the other hand, as an alternative to model selection, model averaging can be applied (e.g. Medvedovic and Sivaganesan, 2002; Wu et al., 2002): a number of various parameter values can be tried and then their corresponding clustering results are combined. There is some similarity between our proposal and mixture discriminant analysis (MDA) (Hastie and Tibshirani, 1995).
1263
D.Huang and W.Pan
Fig. 1. Prediction accuracy (based on hard-classification) for two classes (left panels) and for three classes (right panels) with gene expression data: from top to the bottom row. K1 ¼ 0. 1 and 2 respectively.
However, there is a key difference: ours is for unsupervised learning while MDA is for supervised learning. In MDA, all the genes known to have the same function are clustered separately from others, and the genes of unknown function are forced to be assigned to the above clusters. In contrast, in our proposal, we allow genes having different functions to go to the same clusters; in particular, some genes of unknown function can form new clusters.
1264
In our numerical examples, we have assumed that, based on some biological knowledge, such as metabolic networks or gene functional annotations, the genome can be partitioned into several groups, in which one of the groups consists of the genes of unknown function while each of the others contained the genes sharing a common function. More generally, as discussed in Section 2.3, our method can be extended to cases where biological knowledge is in the form of a network, such as a pathway or the GO: we can
Biological knowledge in distance-based clustering
Fig. 2. Prediction accuracy (based on soft-classification) for two classes (left panels) and for three classes (right panels) with gene expression data: from top to the bottom row. K1 ¼ 0. 1 and 2 respectively.
simply adopt the approach of Hanisch et al. (2002) and Cheng et al. (2004) to define a distance metric based on the network, combine the network-metric with an expression-based metric using shrinkage formula (1) (where mij and dij correspond to the two metrics respectively), and proceed as before. We emphasize that a departure of our method from Hanisch et al. (2002) and Cheng et al.
(2004) is that in Step 2 of our algorithm, we need to use the expression-based metric to avoid bias for the genes of unknown function. Although we used a simple majority vote for gene function prediction, other more elaborate methods that account for possibly different sizes of functional categories using a hypergeometric
1265
D.Huang and W.Pan
Fig. 3. Average silhouette for two classes (left panels) and for three classes (right panels) with gene expression data: from top to the bottom row. K1 ¼ 0. 1 and 2 respectively.
distribution can be easily applied (Tavazoie et al., 1999). In addition, rather than predicting gene functions, clustering genes with their expression profiles can be also used as the first step to infer transcriptional regulatory networks (Tavazoie et al., 1999). Our proposal applies to a large class of distance-based clustering,
1266
including various versions of partitioning methods (e.g. PAM) and hierarchical clustering, while those of Pan (2006) and Huang et al. (2006) are only suitable to model-based clustering. It is unclear how to extend the idea to other existing clustering methods, such as K-means and SOMs (e.g. Tavazoie et al., 1999; Tamayo
Biological knowledge in distance-based clustering
Fig. 4. Five-fold CV accuracy based on hard-classification (left panels) and solf-classification (right panels) for two classes with gene expression data: from top to the bottom row. K1 ¼ 0. 1 and 2 respectively.
et al., 1999; Tseng and Wong 2005). These are interesting topics to be studied in the future.
ACKNOWLEDGEMENTS The authors thank Guanghua Xiao and Peng Wei for helpful discussions and assistance with the gene expression data.
The authors are grateful to the reviewers for stimulating and constructive comments. D.H. was supported by National Natural Science Foundation of China grant No. 70503028 and a CMU grant, W.P. by NIH grant HL65462 and a UM AHC Development grant. Conflict of Interest: none declared.
1267
D.Huang and W.Pan
REFERENCES Al-Shahrour,F. et al. (2005) Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics, 21, 2988–2993. Ashburner,M. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29. Breiman,L., Friedman,J.H., Olshen,R.A. and Stone,C.J. (1984) Classification and Regression Trees. Chapman and Hall, New York. Broet,P. et al. (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol., 9, 671–683. Brown,M.P. et al. (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267. Carlin,B.P. and Louis,T.A. (2000) Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall/CRC Press, Boca Raton, Florida. Cheng,J. et al. (2004) A knowledge-based clustering algorithm driven by gene ontology. J. Biopharm. Stat., 14, 687–700. Datta,S. and Datta,S. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics, 19, 459–466. Dudoit,S. and Fridlyand,J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol., 3, research0036. Eisen,M. et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868. Fang,Z. et al. (2006) Knowledge guided analysis of microarray data. J. Biomed. Inform., in press. Ghosh,D. and Chinnaiyan,A.M. (2002) Mixture modeling of gene expression data from microarray experiments. Bioinformatics, 18, 275–286. Handl,J. et al. (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics, 21, 3201–3212. Hanisch,D. et al. (2002) Co-clustering of biological networks and gene expression data. Bioinformatics, 18, 145–154. Hastie,T. and Tibshirani,R. (1995) Discriminant analysis by mixture modelling. J. R. Statist. Soc. B., 58, 155–176. Hastie,T., Tibshirani,R. and Friedman,J. (2001) The Elements of Statistical Learning. Data mining, Inference, and Prediction. Springer, New York. Huang,D. et al. (2006) Combining gene annotations and gene expression data in model-based clustering: a weighted method. OMICS, (in press). Hughes,T.R. et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126. Kaufman,L. and Rousseeuw,P.J. (1990) Fitting Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. Kanehisa,M. (1996) Toward pathway engineering: a new database of genetic and molecular pathway. Sci. Technol. Japan, 59, 34–38. Kerr,M.K. and Churchill,G.A. (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA, 98, 8961–8965.
1268
Khatri,P. and Draghici,S. (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21, 3587–3595. Lottaz,C. and Spang,R. (2005) Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics, 21, 1971–1978. McLachlan,G.J. et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422. McLachlan,G.J. and Peel,D. (2002) Finite Mixture Model. John Wiley & Sons, Inc., New York. Medvedovic,M. and Sivaganesan,S. (2002) Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics, 18, 1194–1206. Mewes,H.W. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res., 32, D41–D44. Mootha,V.K. et al. (2003) PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, 34, 267–273. Pan,W. (2005) Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat. Appl. Genet. Mol. Biol., 4, Article 12. Pan,W. (2006) Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics, 22, 795–801. Pan,W. et al. (2002) Model-based cluster analysis of microarray gene-expression data. Genome Biol., 3, Research0009. Tamayo,P. et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912. Tavazoie,S. et al. (1999) Systematic determination of genetic network architecture. Nat. Genet., 22, 281–285. Tian,L. et al. (2005) Discovering statistically significant pathways in expression profiling studies. Proc. Natl Acad. Sci. USA, 102, 13544–13549. Tseng,G.C. and Wong,W.H. (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16. van der Laan,M.J. et al. (2003) A new partitioning around medoids algorithm. J. Stat. Comput. Sim., 73, 575–584. Wu,L.F. et al. (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet., 31, 255–265. Xiao,G. and Pan,W. (2005) Gene function prediction by a combined analysis of gene expression data and protein–protein interaction data. J. Bioinform. Comput. Biol., 3, 1371–1389. Yeung,K.Y. et al. (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977–987. Zhang,K. and Zhao,H. (2000) Assessing reliability of gene clusters from gene expression data. Funct Integr Genomics, 1, 156–173. Zhou,X. et al. (2002) Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA, 99, 12783–12788.