BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 7 2005, pages 1069–1077 doi:10.1093/bioinformatics/bti095
Gene expression
Clustering of gene expression data using a local shape-based similarity measure Rajarajeswari Balasubramaniyan1 , Eyke Hüllermeier2,∗ , Nils Weskamp2 and Jörg Kämper1,∗ 1 Max-Planck
Institute for Terrestrial Microbiology, Department of Organismic Interactions, Karl-von-Frisch-Straße, 35043 Marburg, Germany and 2 Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Straße, 35032 Marburg, Germany Received on July 2, 2004; revised on September 22, 2004; accepted on October 11, 2004 Advance Access publication October 28, 2004
ABSTRACT Motivation: Microarray technology enables the study of gene expression in large scale. The application of methods for data analysis then allows for grouping genes that show a similar expression profile and that are thus likely to be co-regulated. A relationship among genes at the biological level often presents itself by locally similar and potentially time-shifted patterns in their expression profiles. Results: Here, we propose a new method (CLARITY; Clustering with Local shApe-based similaRITY) for the analysis of microarray time course experiments that uses a local shape-based similarity measure based on Spearman rank correlation. This measure does not require a normalization of the expression data and is comparably robust towards noise. It is also able to detect similar and even time-shifted sub-profiles. To this end, we implemented an approach motivated by the BLAST algorithm for sequence alignment. We used CLARITY to cluster the times series of gene expression data during the mitotic cell cycle of the yeast Saccharomyces cerevisiae. The obtained clusters were related to the MIPS functional classification to assess their biological significance. We found that several clusters were significantly enriched with genes that share similar or related functions. Contact:
[email protected]; eyke@mathematik. uni-marburg.de Availability: Upon request from the authors.
INTRODUCTION DNA microarray technology allows for studying the expression of thousands of genes in parallel (Shalon et al., 1996; Brown and Botstein, 1999; Duggan et al., 1999; Lipshutz et al., 1999; Hegde et al., 2000). Especially, the data obtained from microarray analyses provide us with the opportunity to analyze the relationship between genes in response to a particular experimental condition. Thus, it becomes possible to extract biologically meaningful knowledge, e.g. to assign functionality to genes whose function is yet unknown and to assemble them into genetic networks (Gerstein and Jansen, 2000; Slonim, 2002). If the expressions of genes are measured at various time points during the course of an experimental study, each gene can be ∗ To
whom correspondence should be addressed.
characterized by means of an expression profile in the form of a time series (sequence). Grouping genes that share similar expression profiles into clusters is usually the first step towards the understanding of the huge amount of data produced by a DNA array experiment. The clustering of expression profiles is done in two steps. First, the similarity or distance between pairs of profiles is determined, using standard measures such as, e.g. Euclidean distance or Pearson correlation. In the second step, a clustering method is employed in order to group the expression profiles on the basis of their pairwise similarities. A clustering structure thus obtained can support the understanding of the transcription-control mechanism of the organism under study. For an in-depth analysis that gives more insight into the functional and biochemical role of the genes in the clusters, this transcriptional information can be combined with additional information on protein–protein interactions, sub-cellular localizations or functional classifications. Most methods that have been applied to time series microarray data so far derive similarity degrees between genes by comparing complete expression profiles, typically after these profiles have been normalized. Moreover, these methods do often filter the data by removing profiles containing only insignificant changes (2-fold changes are usually considered as significant). This approach has several drawbacks. First, with regard to filtering the data, it deserves mentioning that genes encoding regulatory factors, such as transcriptional regulators that are the acting switches in a transcriptional network, may be expressed only at very low levels. Moreover, small changes of these levels may produce significant changes in the expression of the genes regulated by the corresponding factors. By considering only significant changes of expression values, such key players in a regulatory network might be missed. Second, while normalization is clearly reasonable when comparing profiles whose expression values, in terms of absolute numbers, differ by orders of magnitude, it may also lead to undesirable side effects. For example, normalization is extremely sensitive toward outliers and noisy data. Third, a relationship between genes often presents itself by locally rather than globally similar patterns in their expression profiles. That is, the complete profiles share similar sub-profiles, which might furthermore be time-shifted, as, e.g. in the case of transcriptional
© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[email protected] 1069
R.Balasubramaniyan et al.
regulators and the genes controlled by them. Apart from that, it is conceivable that not all genes respond to all conditions (or time points) in an experiment. In all these cases, the use of a global similarity measure, i.e. the computation of a similarity degree between the complete expression profiles, will not reveal the relationship between two genes, simply because this degree might be rather small. In this paper, we propose a new method for time course experiments that addresses all of the aforementioned problems. In particular, we employ a shape-based similarity measure based on the Spearman rank correlation. This measure circumvents the problem of normalization and is more robust towards noise in data. Moreover, our approach does not require profiles to be globally similar. Rather, it also detects similar and even time-shifted sub-sequences of expression values. To this end, we implemented an approach motivated by the BLAST algorithm for sequence alignment (Altschul et al., 1990). The remainder of the paper is organized as follows: In Section 2, we give a brief overview of similarity (distance) measures and existing algorithms for discovering relationships between gene expression profiles. In Section 3, we describe our algorithm CLARITY in detail. In Section 4, we discuss the results that we have obtained for the time series data of gene expression during the mitotic cell cycle of the yeast Saccharomyces cerevisiae, focusing on the biological significance of clustered genes. The paper concludes with a brief summary in Section 5.
RELATED WORK Various methods for extracting meaningful information from gene expression data have already been proposed in literature. This section gives a brief overview of related work on similarity measures. For a more comprehensive and general overview on microarray time series data analysis we refer the interested reader to the review of Möller-Levet et al. (2003). The most popular and probably most simple measures that have been used in the context of gene expression data are the Pearson correlation, a statistical measure of (linear) dependence between random variables, and the Euclidean distance (Gerstein and Jansen, 2000; Slonim, 2002). Apart from that, more sophisticated measures have been employed such as, e.g. mutual information (Herwig et al., 1999). A method for approximating an ideal similarity measure by training a neural network has been suggested in (Sawa and Ohno-Machado, 2003). Spellman et al. (1998) introduced a method particularly suitable for cyclic data such as, e.g. cell cycle time series. Roughly speaking, they compare the Fourier transformations of the time series rather than the original series themselves. Qian et al. (2001) addressed the problem of identifying local similarities in gene expression data by means of a method that is based on the Smith–Waterman algorithm for (local) sequence alignment. However, they first normalize the data by converting each expression value to its z-score (i.e. subtracting the mean and dividing by the standard deviation of a profile). A global normalization of this type is critical since it is not in agreement with the goal to discover local similarities. In fact, one should realize that normalizing a complete profile means that two sub-profiles cannot be compared independently of all other expression values. All of the aforementioned indices are numerical measures in the sense that they depend on the concrete numbers in the expression profiles. As opposed to this, Wen et al. (1998) suggest a shape-based similarity measure that compares two profiles on the basis of qualitative
1070
changes of expression values. Thus, two sequences are considered as similar if they increase and decrease more or less simultaneously. However, this measure is still a global one in the sense that all time points are taken into consideration. As mentioned in the previous section, exclusively looking for global similarity is often too demanding and comes along with a risk of missing interesting (local) relationships. Filkov et al. (2002) proposed a kind of ‘edge detection’ method for periodic datasets with small sequences. This method looks for local regions in pairs of expression profiles where major changes in expression occur (edges). Roughly speaking, the profiles are regarded as similar if they do have similar edges. Kwon et al. (1999) suggested an ‘event-based’ edge detection method. An event in a specific time interval is considered as the directional change of the gene expression curve at that instant. This method converts the raw data to a string of events, such as: ‘R’ representing changes greater than a certain (upper) threshold value, ‘F’ for changes less than a (lower) threshold and ‘C’ for insignificant changes. The event strings are then aligned using a modified version of the Needleman–Wunsch algorithm for global sequence alignment. By converting a time series into a sequence of events such as increase or decrease, the methods in the last paragraph tend to oversimplify the original data. This makes them robust toward noise and outliers, but also loses a lot of information contained in the original time series. Our method, detailed below, can be seen as a good compromise between the numerical and event-based approaches described in this section.
METHOD Recall that we are looking for local relationships (similarities) between expression profiles and, furthermore, that we seek to incorporate the possibility of time-shifts. Thus, we consider two genes’ expression profiles X and Y , represented by sequences (x1 . . . xn ) and (y1 . . . yn ), as similar if there are similar subsequences X[i, j ] and Y [k, l], where X[i, j ] =def (xi , xi+1 . . . xj ) for 1 ≤ i ≤ j ≤ n. In analogy to sequence analysis, one can consider a tuple (X[i, j ], Y [k, l]) as a (local) ‘alignment’ (of length j − i + 1 = l − k + 1).
Exact similarity computation The simplest approach for discovering local, time-shifted relationships between two profiles is of course to enumerate all possible alignments in a systematic way. Thus, the similarity SIM(X, Y ) between two profiles X and Y of length n is computed as SIM(X, Y ) =
def
max SIMk (X, Y ),
kmin ≤k≤n
(1)
where SIMk (X, Y ) is given by max
1≤i,j ≤n−k+1
S(X[i, i + k − 1], Y [j , j + k − 1])
(2)
for an underlying basic similarity measure S. Our particular choice of S will be discussed and motivated below. As can be seen, SIMk (X, Y ) corresponds to the similarity of the best alignment of length k. Of course, deriving the similarity between two profiles from very short local alignments is questionable (and usually not statistically significant). Therefore, the parameter kmin specifies a lower bound to the length of an alignment.
Approximate similarity computation A straightforward implementation of Equation (1) leads to a ‘sliding window’ algorithm [i.e. SIMk (X, Y ) is computed by sliding two windows of size k over X and Y ] whose time complexity is O(n3 ). Note that the complexity is reduced to O(n2 ) if no time-shifts are allowed [and, hence, i = j in
Clustering of gene expression data
Equation (2)]. Both cases are completely acceptable for small n. For longer expression profiles, however, the exact computation of SIM(X, Y ) might become too expensive. In that case, we suggest the use of an approximate algorithm that is inspired by the well-known BLAST method for sequence alignment. The idea of this approach is to find an initial ‘hit’ in the form of a short optimal alignment. Then, in a second step, this alignment is extended in both directions. More precisely, our heuristic approach works as follows: (1) Hit: SIMk (X, Y ) is computed for k = kmin . Suppose that this similarity degree is obtained for the ‘best match’ X[ax , bx ], Y [ay , by ], i.e. SIMk (X, Y ) = S(X[ax , bx ], Y [ay , by ]). If the best match is not unique, the second step is carried out for all other candidates as well.
60 50 40 30 20 10 0 –10
(2) Extend: The similarity degrees
–20
S(X[ax − u, bx + v], Y [ay − u, by + v]) are derived for all 0 ≤ u ≤ min{d, ax − 1, ay − 1}
15
and the best match (X[ax − u∗ , bx + v ∗ ], Y [ay − u∗ , by + v ∗ ]) is determined. In the case of ties, longer matches are preferred to shorter ones. If there are still several optimal matches, one of them is chosen at random.
10
(3) Iterate: The optimal local alignment is updated by setting ax ← ax − u∗ , bx ← bx + v ∗ , ay ← ay − u∗ , by ← by + v ∗ , and the second step is repeated. This process is iterated until the optimal alignment does not change (u∗ = v ∗ = 0).
0
Basic similarity measure So far, the algorithm outlined above is completely independent of the basic similarity measure S. As noted before, measures commonly used in gene expression analysis include the Euclidean distance and the Pearson correlation. Such numerical measures are easy to compute but suffer from some disadvantages. Particularly, they are quite sensitive toward outliers and measurement errors, a point of critical importance in connection with gene expression data. Moreover, in the context of expression analysis, we prefer a concept of similarity that is based on the qualitative behavior or, say, the shape of the profiles to one that is very sensitive to the precise values of fold changes. Our similarity measure S is therefore defined by the Spearman rank correlation (SRC). The SRC between two profiles X and Y is Given by n 6 SRC(X, Y ) = 1 − (rX (xi ) − rY (yi ))2 , n(n2 − 1) i=1
where rX (xi ) is the rank of xi in the profile (x1 . . . nn ): rX (xi ) = k ⇔ |{j | xj < xi }| = k − 1. Actually, we used an extended version of the SRC which takes the possibility of ties, i.e. xi = xj for i = j , into account. The SRC satisfies −1 ≤ SRC(X, Y ) ≤ 1 for all X, Y . To exemplify the aforementioned difference between the Pearson correlation and SRC, Figure 1 shows two profiles (of length 7), which are highly correlated according to the latter but almost uncorrelated according to the
2
3
4
5
6
7
Fig. 1. The SRC for the two profiles is 0.93 whereas the Pearson correlation is 0.3.
0 ≤ v ≤ min{d, n − bx , n − by }
The parameter d in the second step is a pre-specified constant that determines the size of the neighborhood to be searched and, hence, the complexity of this step, which is obviously O(d 2 ). Note that the ‘myopic’ strategy obtained for d = 1 carries a high risk of getting caught in local maxima. On the other hand, experience has shown that large values for d are usually not necessary for this type of ‘look-ahead search’. Most often, sufficiently good approximations or even exact results are already obtained for d = 2.
1
5
5 10 15
1
2
3
4
5
6
7
Fig. 2. The SRC for the two profiles is 0.93 whereas Equation (3) yields a similarity of 0.3. former. This is mainly caused by the comparatively large value of the third fold change in one of the sequences. As opposed to this, the SRC correctly reflects the fact that both profiles have a rather similar shape. In fact, even though SRC ignores some information, it seems that it retains just the relevant information, making it much more robust than the Pearson correlation. In this connection, it should also be noted that SRC retains more information than a frequently used qualitative measure that compares the sign of the first-order differences (i.e. the ups and downs): n−1 1 sgn((xi+1 − xi )(yi+1 − yi )), n
(3)
i=1
where sgn(z) is the sign of z. For example, this measure suggests a similarity of only 0.3 for the two profiles in Figure 2, even though both profiles do again have a rather similar shape. Figure 4 shows an example obtained from the mitotic cell cycle time course experiment (Cho et al., 1998, see below) of two expression profiles where local SRC yields a similarity of 0.8, whereas Pearson correlation gives only 0.4. The overall similarity of two profiles X and Y , as defined by Equation (1), is the maximum of similarity degrees for sequences of different lengths k.
1071
R.Balasubramaniyan et al.
1
significance
0.8 0.6 0.4 0.2 0
1
0.5
0 measure
0.5
1
Fig. 3. The empirical distribution functions of the (maximal) SRC obtained from simulations for n = 17 and k = 15 (solid line in green) resp. k = 10 (dashed line in blue).
In order to guarantee the comparability of the similarities SIMk (X, Y ), kmin ≤ k ≤ n, these similarities have been defined by their corresponding P -value rather than by the SRC directly. Thus, if S ∗ denotes the optimal SRC that has been found for sequences of length k, then SIMk is given by the probability of obtaining a correlation of at most S ∗ under the null hypothesis of completely unrelated profiles (of length n). Note that the statistical distribution of this measure is an extreme value distribution which depends on the parameters k and n. As there is apparently no simple analytical expression for this distribution, we derived approximations from simulated data. Figure 3 shows the result of such a simulation for profiles with 17 time points. In order to decide whether or not two profiles X and Y are ‘significantly similar’, we also need the P -value of the overall similarity SIM(X, Y ). Again, we derived approximations of these P -values from simulated distributions.
EXPERIMENTAL RESULTS Data and clustering We used data from a mitotic cell cycle time course experiment in the yeast S.cerevisiae that included 6331 open reading frames and has been measured over 17 time points by Cho et al. (1998). The yeast cell cultures were synchronized using the so-called Cdc28 arrest and sampled at uniform intervals covering nearly two complete cycles of cell cycle. The experiments were done using Affymetrix oligonuleotide array. The dataset is scaled to account for the experimental differences between the arrays used. As a first step some data points that appeared to be aberrant were eliminated. The dataset was then converted to ratio style measurements by dividing each measurement by the average value of the measurements for that gene as described in Spellman et al. (1998). A total of 6145 genes were taken for further analysis. We first applied CLARITY with kmin = 15 in order to calculate the pairwise similarity matrix between the expression profiles of the individual genes, i.e. we allow for time-shifts of maximally two time points as longer shifts are hard to explain from a biological point of view. Additionally, it is difficult to obtain significant degrees of similarity if much shorter sub-profiles are used as can be seen from the simulations described above. Figure 5 shows an example of a time-shifted relationship among genes that was described by Yu et al. (2003) and that was also successfully identified by our approach.
1072
Clusters of genes were derived from the similarity matrix thus obtained using CLUTO, a program package for graph-based clustering (Karypis, 2002, http://www.cs.umn.edu/∼ cluto). CLUTO first constructs a graph where each gene is represented by a node, and edges between nodes are labeled with corresponding similarity degrees. Dense regions in this graph thus correspond to sets of genes that are highly co-expressed and thus form good candidates for clusters. For reasons of computational efficiency, and in order to avoid a bias of the results due to insignificant relationships between genes, we simplified the graph in a preprocessing step: The edge between two nodes is deleted whenever the corresponding similarity degree falls below a similarity threshold. In order to derive a clustering structure, CLUTO partitions the graph thus obtained by means of optimal (minimal) cuts. This is repeated in a recursive manner until a pre-specified number of clusters has been constructed. As can be seen, two critical parameters have to be specified for our clustering approach, namely the number of clusters and the similarity threshold. Fortunately, we found that in our case the clustering results are quite robust toward variations of the above parameters. More specifically, computations with various similarity thresholds showed that thresholds above 0.7 yield almost identical clustering structures. Likewise, we found that the clustering structure did not change appreciably if the number of clusters was raised above 25. We therefore decided to use this number to obtain a maximal number of ‘independent’ sets of genes. Additionally, the generated clusters appear to be quite homogeneous and show a high internal similarity. See Figure 6 for two examples.
Functional evaluation The MIPS functional categories (Mewes et al., 2002) provide comprehensive information about the known functions of the genes in the yeast genome. In several cases it has been shown that genes with similar functions can be co-expressed (Eisen et al., 1998). To elucidate the biological significance of the clusters generated by our procedure, the clusters produced from CLUTO were mapped to the 400 different MIPS functional categories, and one or several predominant categories were assigned to each cluster. Moreover, in order to prove that the occurrence of a predominant category is statistically significant, we derived corresponding P -values for each cluster. The probability of observing at least m ORFs from a functional category within a cluster of size n is given by P =1−
m−1 fi i=0
−f · Nn−i N
(4)
n
where f is the total number of genes within a functional category and N is the total number of genes within the genome (6331)1 . For reasons of numerical stability, we computed the above value using the binomial distribution as an approximation of the hypergeometric distribution. Clusters with P -values less than 10−4 and average similarity among cluster members smaller than 0.5 were not reported. We found several clusters to be significantly enriched with genes of a similar function2 . The results are summarized in Table 1. 1 This P -value obviously ignores the problem of multiple-hypothesis testing and should therefore be interpreted with caution. 2 The parameters for edge pruning and vertex pruning of the CLUTO program
were set to 0.2 to increase the internal similarity of the generated clusters.
Clustering of gene expression data
2
1
0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
–1
–2
YPL241C Cin2
YPL256C Cln2
Fig. 4. The local SRC for this pair of genes is 0.8 over the first 15 time points, whereas the global Pearson correlation is only 0.4. For each of the time points, the expression ratio is plotted. The genes Cin2 and Cln2 were detected as coinduced in a Cdc28-13 mutant during late G1 phase of the cell cycle (Cho et al., 1998).
1
0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
–1
–2
YOR372C (Ndd1)
YHR178W (Stb5)
YDR318W (Mcm21)
Fig. 5. An example for a time-shifted relationship that was also identified by Yu et al. (2003). For each of the time points, the expression ratio is plotted. The profiles are shifted one time point, starting from time point 2. These genes are found in the same cluster 19 if one uses kmin = 15. If one sets kmin = 17, they are found in different clusters.
In general, the clusters can be divided into those that follow the periodicity of the cell cycle [clusters 3, 4, 8; see Fig. 6(a) for an example] and those that are not cell cycle related [clusters 0, 5, 6, 15, 16 and 21; see Fig. 6(b) for an example]. Among the clusters with non-periodically expressed genes, the most significant functional grouping occurs in cluster 0. This cluster consists of 43 genes, 33 of them being associated with ribosome biogenesis (P -value 5.0 × 10−34 ). Cluster 16 also contains a significant number of protein synthesis related genes (46 out of 168, P -value 10−18 ), including 29 ribosomal genes (P -value 1.2 × 10−20 ). In addition, this cluster contains 14 genes related to amino acid metabolism (P -value 1.5 × 10−4 ), indicating that genes within this cluster might play a role in protein synthesis and related functions. Cluster 21 has a significant enrichment of genes that can be related to energy (P -value 4.5 × 10−12 ) in the broader sense, including genes related to mitochondrial protein synthesis (4.9 × 10−11 ), mitochondrial organization (P -value 5.8 × 10−15 ), and energy and
carbohydrate metabolism (P -value 2.5×10−7 ). Six among the seven genes for the nuclear encoded proteins of the cytochrome C oxidase protein complex IV (Cox4, Cox5a, Cox7, Cox8, Cox12 and Cox13) are present within this cluster, and, similarly, several genes encoding proteins of the mitochondrial protein synthesis turnover complex (Mrpl10, Mrpl17, Mrpl24, Mrpl28, Mrpl3, Ydr116C and Ypr100W). These findings support the functional relationship of these genes; the proteins encoded should be co-expressed in stoichiometric amounts required for the assembly of the respective protein complexes. One of the advantages of the CLARITY algorithm is that timeshifted relations can be discovered. The time-shifted relations constitute up to 55.4% of the total number of relationships within individual clusters (Table 2). To elucidate whether the implementation of time-shifted relations can aid the discovery of biological implications, we analyzed cluster 15, comprising the highest portion of time-shifted correlations, in more detail. We compared this cluster with the clusters from Tavazoie et al. (1999) that have been generated using Euclidean distance and
1073
R.Balasubramaniyan et al.
18 16 14 12 10 8 6 4 2 0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
18 16 14 12 10 8 6 4 2 0
Fig. 6. The profiles of two generated clusters. (a) a cluster of genes that are highly related to the cell cycle, (b) a cluster of cell-cycle independent genes. For each of the time points, the ranks of the expression value are plotted to clarify why the method assigned the genes to the same cluster. See Table 1 for a quantitative analysis of these clusters.
Table 1. Analysis of the relations among n entries of the respective clusters
Cluster
Number of genes
Number of relationships
Number of time-shifted relationships
Percentage
0 3 4 5 6 7 15 16 19 21
43 74 60 61 86 113 77 168 130 201
903 2701 1770 1830 3655 6328 2926 14028 8385 20100
20 822 368 407 3 711 1622 653 3242 7062
2.2 30.4 20.8 22.2 0.08 11.2 55.4 4.7 38.7 35.0
For each of the 21 n(n−1) possible relationships, the number of time-shifted relationships found by CLARITY is shown. Some of the clusters are constituted heavily by timeshifted relationships. Disregarding of such relations would thus likely lead to a different cluster structure.
1074
k-means clustering. From 77 genes in cluster 15, 38 were found in clusters 4, 5 and 8 of Tavazoie et al. The remaining 39 genes have not been assigned to any cluster. Among such genes is Tps3, encoding the regulatory component of the trehalose-6-phosphate synthase/phosphatase complex consisting of Tps1p, Tps2p and Tps3p. Although Tps1, encoding the trehalose-6-phosphatase synthase is present in cluster 8 of Tavavoie et al., the time-shifted relation with Tps3 has not been detected by these authors [see Fig. 7(a)]. Similarly, Put1 (Proline oxidase) and Put2 (Delta-1-pyrroline-5carboxylate dehydrogenase) are found to be in cluster 15, but were found to be in different clusters by Tavazoie et al. Put2p in conjunction with Put1p converts proline to glutamate in the mitochondrion. In addition, cluster 15 harbors several other genes involved in glutamate metabolism that show local similarities with Put1 and Put2: Put4, a high affinity proline permease, Agp1, the principal transporter of asparagine and glutamine, Dip5, an amino acid permease for the transport of alanine, glycine, serine, asparagine and glutamine, and the glutamate decarboxylase Gad1 [see Fig. 7(b)]. As expected, among the clusters with a periodic profile, many of the genes encode proteins with functions in cell-cycle
Clustering of gene expression data
Table 2. Results of our clustering analysis (for each of the generated cluster, we show some MIPS functional categories that were found to be significantly enriched)
Cluster number
Internal similarity
Number of ORFs (n)
0
0.737
43
3 4
0.766 0.740
74 60
5
0.755
61
6 8
0.803 0.824
86 113
15 16
0.732 0.722
77 168
19
0.622
130
21
0.800
201
MIPS category Number of Code ORFs(m)
ORFs within category (k)
P -value
Name
380 234 492 271 157 492 354 832 75 31 624 271 103 157 492 354 186 186 380 234 492 354 430 343 186 121 260 93 37 25 378 380 234 52 100 128
Protein synthesis Ribosome biogenesis Cell cycle DNA processing DNA recombination and repair Cell cycle Mitotic cell cycle control Transcription rRNA synthesis tRNA synthesis Protein fate DNA processing DNA synthesis and replication DNA recombination and repair Cell cycle Mitotic cell cycle control Amino acid metabolism Amino acid metabolism Protein synthesis Ribosome biogenesis Cell cycle Mitotic cell cycle control mRNA synthesis Transcription control Amino acid metabolism Amino acid biosynthesis Energy Energy: respiration Energy: glycolysis TCA cycle C-compound and carbohydrate metabolism Protein synthesis Ribosome biogenesis Mitochondrial ribosomal proteins Cytoplasmic and nuclear degradation Control of cellular organization: mitochondria
33 33 16 10 8 19 11 24 10 3 19 35 17 17 29 19 9 14 46 39 33 18 19 16 15 10 33 14 6 6 32 29 20 14 11 26
7.5 · 10−27 5.0 · 10−34 9.5 · 10−5 6.9 · 10−5 2.1 · 10−5 9.3 · 10−8 1.8 · 10−4 1.0 · 10−6 1.9 · 10−10 2.2 · 10−4 4.7 · 10−4 1.7 · 10−20 4.3 · 10−13 6.0 · 10−10 1.0 · 10−8 8.3 · 10−6 1.0 · 10−4 1.5 · 10−4 4.7 · 10−18 1.2 · 10−20 1.5 · 10−9 1.7 · 10−4 7.3 · 10−4 8.9 · 10−4 1.9 · 10−6 4.4 · 10−5 4.5 · 10−12 2.5 · 10−7 1.4 · 10−4 9.4 · 10−6 2.5 · 10−7 6.4 · 10−6 2.3 · 10−5 4.9 · 10−11 8.2 · 10−5 5.8 · 10−15
05 05.05 03.03 03.01 03.01.05 03.03 03.03.01 04 04.01.01 04.03.01 06 03.01 03.01.03 03.01.05 03.03 03.03.01 01.01 01.01 05 05.01 03.03 03.03.01 04.05.01 04.05.01.04 01.01 01.01.01 02 02.13 02.01 02.10 01.05 05 05.01 05.01.01 06.13.01 30.16
dependent processes, like DNA- processing, DNA-synthesis and DNA-replication. The periodic clusters are defined by the timing of the maximum expression of the genes within the cluster. For example, cluster 8 can be considered as G1 specific as it harbors 95 out of 300 reported genes that peak during the G1 phase of the cell cycle, while cluster 3 includes 33 out of 197 genes regulated in the M phase (Spellman et al., 1998). Let us finally note that not all clusters show a significant enrichment for any of the functional categories. It can be assumed that the genes in these clusters participate in several of the processes defined by the functional categories.
CONCLUSION We presented a new approach to solve the problem of finding locally similar regions in gene expression profiles. These local regions can
be time-shifted to allow, e.g. for the detection of transcription control relationships. Our measure of similarity is based on Spearman rank correlation and can be seen as a good compromise between numerical measures (like, e.g. Pearson correlation or Euclidean distance) and simple qualitative measures (like, e.g. measures that consider only ‘ups’ and ‘downs’ of a time series) that ignore much of the relevant information. Simulations were performed to assess the statistical significance of the obtained degrees of similarity. The actual comparison of the profiles is then performed with a heuristic Sliding window approach. The latter has the advantage that it does not impose restrictions on properties of the similarity measure as do methods that rely on dynamic programming (Qian et al., 2001). For example, Spearman rank correlation could not be used with the approach of Qian et al. (2001), as the former does not allow one to calculate the similarity of two profiles directly from the similarities of its sub-profiles.
1075
R.Balasubramaniyan et al.
1
0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
–1 YBR126C (Tps1)
YMR261C (Tps3)
3 2 1 0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
–1 –2 –3 YCL025C Agp1 YLR142W Put1 YPL265W Dip5
YCLX09W AGP1 YMR250W Gad1
YHR037W Put2 YOR348C Put4
Fig. 7. Examples of expression profiles that are involved in related biological processes and that have been assigned to the same cluster by CLARITY. The relations among these genes were not detected by the previous work of Tavazoie et al. (1999).
We applied our method to a dataset of gene expression profiles from the yeast S.cerevisiae that was measured to study the mitotic cell cycle (Cho et al., 1998). The similarities among the profiles were then used to assign co-expressed genes to clusters. Some of the obtained clusters contain mainly cell-cycle related genes that show—as expected—a periodic behavior. Other clusters contain genes that have a non-periodic expression profile and thus are not directly related to the cell cycle. This result is expected, as our approach considers similar profiles, which show a similar shape— and a cyclic behavior is just one of the many possible shapes of a profile. The obtained clusters were then compared against an existing functional classification of the genes of S.cerevisiae. Some of the clusters showed a significant enrichment of genes of a particular functional category, reflecting the fact that genes with a similar function are often co-regulated and thus co-expressed.
ACKNOWLEDGEMENTS We would like to thank S. Tornow. I. Tetko and W. Mewes for stimulating discussions and helpful advice.
1076
REFERENCES Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Brown,P. and Botstein,D. (1999) Exploring the new world of the genome with DNA microarrays. Nat. Genet. Suppl., 21, 33–37. Cho,R.J., Campbell,M.J., Winzeler,E.A., Steinmetz,L., Conway,A., Wodicka,L., Wolfsberg,T.G., Gabrielian,A.E. Landsman, Lockhart,D.J. and Davis,R.W. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2, 65–73. Duggan,D.J., Bittner,M., Chen,Y., Meltzer,P. and Trent,J.M. (1999) Expression profiling using cDNA microarrays. Nat. Genet. Suppl., 21, 10–14. Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci., USA, 95, 14863–14868. Filkov,V., Skiena,S. and Zhi,J. (2002) Analysis techniques for microarray time-series data. J. Comput. Biol., 9, 317–330. Gerstein,M. and Jansen,R. (2000) The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function? Curr. Opin. Struct. Biol., 10, 574–584. Hegde,P., Qi,R., Abernathy,K., Gay,C., Dharap,S., Gaspard,R., Hughes,J.E., Snesrud,E., Lee,N. and Quackenbush,J. (2000) A concise guide to cDNA microarray analysis. Biotechniques, 2, 548–550. Herwig,R., Poustka,A.J., Muller,C., Bull,C., Lehrach,H. and O’Brien,J. (1999) Largescale clustering of cDNA-fingerprinting data. Genome Res., 9, 1093–1105.
Clustering of gene expression data
Karypis,G. (2002) CLUTO—a clustering toolkit. Technical Report 02-017, Dept. of Computer Science, University of Minnesota. Kwon,A.T., Hoos,H.H. and Ng,R. (1999) Inference of transcriptional regulation relationships from gene expression data. Bioinformatics, 19, 905–912. Lipshutz,R.J., Fodor,S.P.A., Gingeras,T.R. and Lockhart,D.J. (1999) High density synthetic oligonucleotide arrays. Nat. Genet. Suppl., 21, 20–24. Mewes,H.W., Frishman,D., Güldener,U., Mannhaupt,G., Mayer,K., Mokrejs,M., Morgenstern,B., Münsterkoetter,M., Rudd,S. and Weil,B. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 31–34. Möller-Levet,C., Cho,K.H., Yin,H. and Wolkenhauer,O. (2003) Clustering of gene expression time-series data. Technical report, University of Rostock, Department of Computer Science. Qian,J., Dolled-Filhart,M., Lin,J., Yu,H. and Gerstein,M. (2001) Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. J. Mol. Biol., 314, 1053–1066. Sawa,T. and Ohno-Machado,L. (2003) A neural network-based similarity index for clustering DNA microarray data. Comput. Biol. Med., 33, 1–15.
Shalon,D., Smith,S. and Brown,P. (1996) A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res., 6, 639–645. Slonim,D.K. (2002) From patterns to pathways: gene expression data analysis comes of age. Nat. Genet. Suppl., 32, 502–508. Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M., Brown,P., Botstein,D. and Futcher,B. (1998) Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297. Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. and Church,G.M. (1999) Systematic determination of genetic network architecture. Nat. Genet., 22, 281–285. Wen,X., Fuhrman,S., Michaels,G.S., Carr,D.B., Smith,S., Barker,J.L. and Somogyi,R. (1998) Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci., USA, 95, 334–339. Yu,H., Luscombe,M.N., Qian,J. and Gerstein,M. (2003) Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet., 19, 422–427.
1077