BIOINFORMATICS
Vol. 20 no. 16 2004, pages 2759–2766 doi:10.1093/bioinformatics/bth323
Mining gene expression data based on template theory Zheng Rong Yang School of Engineering and Computer Science, Exeter University, Exeter EX4 4QF, UK Received on August 21, 2003; revised on November 6, 2003; accepted on November 13, 2003 Advance Access publication May 27, 2004
INTRODUCTION DNA microarrays are a new technology for comparing gene experimental measurements in different populations of cells rapidly and efficiently because the technique can monitor thousands of genes at the same time (Schena et al., 1995; Lockhart et al., 1996; Shalon et al., 1996). A gene profile is a collection of DNA microarray hybridization experiments. A gene can be represented by a set of experimental measurements. Each measurement represents a ratio of strength of expression of a gene under two different experimental conditions (DeRisi et al., 1997). The numerator and the denominator denote the measurements under experimental conditions B and the A, respectively. Genes of similar function are believed to have similar patterns of expression (Eisen et al., 1998).
Bioinformatics vol. 20 issue 16 © Oxford University Press 2004; all rights reserved.
With the rapid increase in gene data, it becomes progressively hard to find underlying biological knowledge that governs the patterns without the support of pattern recognition algorithms. Analysing the gene expression data to search for the correlation relationship among genes is challenging and cluster analysis has been the most popular pattern recognition algorithms in this area. The great biologic merit of clustering genes has been recognised using gene expression data (DeRisi et al., 1997). Eisen et al. (1998), indicated that genes of similar function should demonstrate similar patterns of expression. Therefore, the importance of clustering genes is to validate and understand genes related to different biological or medical property, such as cancer. Of many clustering algorithms, hierarchical clustering algorithms have been most widely used (DeRisi et al., 1997; Chu et al., 1998; Eisen et al., 1998; Iyer et al., 2001; Spellman et al., 1998; Herrero et al., 2001). However, hierarchical clustering algorithms have some problems, such as weak robustness, non-uniqueness and linear ordering. They make the interpretation of the resulting hierarchical relationship much harder to determine (Lukashin and Fuchs, 2001). Except for hierarchical clustering algorithms, other algorithms have also been used, such as self-organizing map (Tamayo et al., 1999), minimum spanning tree (Xu et al., 2002), binary tree clusters (Sultan et al., 2002), co-clustering algorithm (Hanisch et al., 2002) and coupled two-way clustering (Getz et al., 2000). The other involvement in analysing gene data covers cluster structure optimization (Lukashin and Fuchs, 2001), missing value estimation (Troyanskaya et al., 2001), gene classification (Brown et al., 2000) and sample class discovery (Li et al., 2003). Gene clustering does not aim to formulate clusters and interpret clusters only. In fact, the found knowledge structure is the key objective for classification of novel genes. However, little attention has been paid to the problem of representing the found knowledge structure effectively for enhancing the performance of inference. Template theory has been one of the most powerful theories in cognitive psychology (Gobet, 1998) and pattern recognition (Dasarathy, 1991, 1994; Kuncheva and Bezdek, 1998; Bezdek
2759
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
ABSTRACT Motivation: It is understood that clustering genes are useful for exploring scientific knowledge from DNA microarray gene expression data. The explored knowledge can be finally used for annotating biological function for novel genes. Representing the explored knowledge in an efficient manner is then closely related to the classification accuracy. However, this issue has not yet been paid the attention it deserves. Result: A novel method based on template theory in cognitive psychology and pattern recognition is developed in this study for representing knowledge extracted from cluster analysis effectively. The basic principle is to represent knowledge according to the relationship between genes and a found cluster structure. Based on this novel knowledge representation method, a pattern recognition algorithm (the decision tree algorithm C4.5) is then used to construct a classifier for annotating biological functions of novel genes. The experiments on five published datasets show that this method has improved the classification performance compared with the conventional method. The statistical tests indicate that this improvement is significant. Availability: The software package can be obtained upon request from the author. Contact:
[email protected] Z.R.Yang
SYSTEMS AND METHODS Fuzzy c-means algorithm The FCM algorithm is a powerful pattern recognition algorithm (Bezdek, 1981) and has been widely used in different areas, including the analysis of gene data. For instance, combined with binary hierarchical clustering FCM has been used to partition genes so as to construct a discriminant function (Szeto et al., 2003). FCM was also used to recognize potential transcripion factor sites in genomic sequences (Pickert et al., 1998). FCM does not make any assumption about data distribution, hence it is superior to the widely used model-based clustering algorithms (Fraley and Raftery, 1998; Yeung et al., 2001; McLachlan et al., 2002) when data in each cluster do not follow a normal distribution.
Template theory There are four popular cognitive psychology theories, namely chunking theory, seek theory, long-term working memory theory and template theory (Gobet, 1998). Amongst them, the template theory (Chase and Simon, 1973) has been the best among the four (Gobet, 1998). Using the template theory, an inference process is based on the abstracted knowledge from cases studied. This means that after a learning process has been completed, all the real cases are thrown away leaving abstracted knowledge representing the most powerful discriminating factors for future inference. The most significant advantage of using the template theory is that possible noise in the original collected data can be removed and hence the inference process can be very efficient. The search for templates is not very difficult. We can define the centre
2760
of each cluster as a template since it is an abstract of all the genes in the cluster. In pattern recognition, template theory has also been widely investigated for classification analysis (Hart, 1968; Dasarathy, 1991, 1994; Kuncheva and Bezdek, 1998; Bezdek et al., 1998). In these studies, template theory was also called prototype theory, where classification tasks were conducted using the nearest neighbour method.
Decision tree algorithm, C4.5 Decision tree algorithms are a type of popular pattern recognition algorithm, with the basic principle to divide and conquer. Each decision node in a tree divides a dataset into two parts. The partition only happens when it needs to. When a partition is needed, it means that there are two classes of patterns in the dataset. This means that the dataset is not pure or the purity is not satisfied. In practice, impurity like entropy is commonly used. The most commonly used decision tree algorithms are ID3, C4.5, C5 (Quinlan, 1986) and CART (Breiman et al., 1984). C4.5 is a free software package (Quinlan, 1986). In C4.5, each variable (experimental measurement in this study) is selected through maximizing its capability to separate the given data. To measure this capability, impurity is used. If a variable can separate two classes perfectly, the impurity measure will be zero. For a variable with the impurity measure as zero, the genes whose experimental measurement values are less than an optimally determined threshold belong to one class, whereas for genes whose experimental measurement values are greater than the threshold belong to the other class. The variable with the lowest impurity measure will be selected first as the root node in a decision tree. For instance, if experimental measurement α has a lower impurity measure than experimental measurement β, α will be selected as a decision node before β. After a C4.5 tree has been constructed, a decision will be made in a progressive manner. The root will pass the decision to the left sub-tree if the specified experimental measurement of a novel gene is less than an optimally determined threshold associated with the root node, otherwise the right sub-tree. Regarding the top node as a root in a sub-tree, the above process continues until approaching a leaf with a class label attached. When approaching a leaf, the novel gene will then be classified as a member of the class associated with the leaf. In this study, it is assumed that a single gene experimental measurement may not contain enough information for accurate classification. The use of the templates is therefore proposed. Each gene will be represented according to its relation with the found templates. Technically, each gene is represented using the Mahalonobis distance between itself and the found templates. For a cluster structure with
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
et al., 1998). The basic principle is to discard the original data and use templates for inference. This study aims to use the template theory to investigate a novel knowledge representation method for gene expression data. This method will make the representation of the found knowledge structure simpler and more robust. The basic principle is to cluster gene expression data using the fuzzy c-means (FCM) algorithm (Bezdek, 1981) at first followed by the search for templates. Finally, the classified genes are represented in terms of their relationship with the templates rather than the original experimental measurements. From this, a pattern recognition algorithm (the decision tree algorithm C4.5 in this study) is used to construct a classifier for future inference based on this novel knowledge representation method. The method has been validated on five published datasets. The t-test on the experiments shows that the classifiers constructed using templates demonstrate much superior performance compared with the classifiers constructed using the original experimental measurement of genes.
Mining gene expression data based on template theory
Table 1. Six data sets for simulation
No.
Author, year
Publish details
Name
Experimental measurements
Genes
1 2 3 4 5
DeRisi et al. (1997) DeRisi et al. (1997) DeRisi et al. (1997) Horton and Nakai (1996) Yeung et al. (2001)
Science Science Science ISMB Bioinformatics
TUP1 YAP1 Complete Yeast Yeast
13 13 55 8 17
6153 6153 6153 1484 6601
ORF
Ch.1
Ch.1bkg
Ch.2
Ch.2bkg
Green
Red
[G–R]
G/R
R/G
G/Rexpt II
R/Gexpt II
Avg. G/R
Avg. R/G
YHR007C YBR218C YAL051W YAL053W YAL054C YAL055W YAL056W YAL058W YOL109W YAL065C
8705 6549 6219 4151 2824 4355 3924 3792 6320 4059
1880 1311 1866 1256 1876 1253 1858 1292 1240 2270
13 546 8321 4950 7190 1996 4294 3217 4418 7933 9436
1626 1484 1502 1447 1517 1458 1501 1421 1417 1874
6825 5238 4353 2895 948 3102 2066 2500 5080 1789
11 920 6837 3448 5743 479 2836 1716 2997 6516 7562
5095 1599 905 2848 469 266 350 497 1436 5773
0.57 0.77 1.26 0.5 1.98 1.09 1.2 0.83 0.78 0.24
1.75 1.31 0.79 1.98 0.51 0.91 0.83 1.2 1.28 4.23
0.54 0.74 0.92 0.49 0.85 1.36 1.07 0.91 0.88 0.39
1.84 1.35 1.09 2.03 1.17 0.73 0.94 1.09 1.14 2.54
0.56 0.75 1.09 0.5 1.41 1.23 1.14 0.87 0.83 0.31
1.79 1.33 0.94 2.01 0.84 0.82 0.89 1.14 1.21 3.38
m clusters, each gene will be represented by a vector with m elements, each of which is the Mahalonobis distance with a specific template. Instead of inputting the original experimental measurements of genes into the C4.5 package, the Mahalanobis distances between genes and templates in a cluster structure will be inputted into the C4.5 package for constructing a classifier referred to as a template classifier.
Algorithm Step 1. To cluster genes using fuzzy c-means algorithm. Step 2. To classify genes into different clusters. Step 3. To find a template for each cluster. Step 4. To calculate the Mahalonobis distance (Duda and Hart, 2002) between each gene and each template. Step 5. To represent each gene in terms of its Mahalonobis distances with all the templates. Step 6. To apply C4.5 to the genes represented by their Mahalonobis distances with all the templates. Step 7. To use the constructed classifier for annotating biological functions for novel genes.
Data Table 1 shows a list of datasets used for this study. Datasets 1–3 were collected from DeRisi et al. (1997). Dataset 4
was collected from Horton and Nakai (1996). Data set 5 was collected from Yeung et al. (2001). Details of these six datasets are listed in Table 1. Table 2 shows a few lines of the TUP1 dataset. All 13 experimental measurements are listed in this table.
DISCUSSION For all the five datasets, cluster analysis was conducted at first. After clustering, each gene was classified and one template was found for each cluster. Each gene was then represented by the Mahalonobis distance between it and each template. In comparison, each gene was also represented using the original experimental measurements with a label assigned after clustering. Genes were divided into two sets, one for constructing a C4.5 classifier and the other for testing. In comparison, each gene was also represented using the original experimental measurements with a label assigned after clustering. A C4.5 classifier constructed using the original experimental measurements to represent genes was called a non-template classifier whereas one constructed using the Mahalonobis distances a template classifier. Two methods were verified through varying the number of clusters from 2 to 10. After a classifier was constructed, the testing data was inputted to the classifier for model assessment. Figure 1 shows the classification accuracy on the testing data for varying the number
2761
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
Table 2. A few lines of TUP1 dataset
Z.R.Yang
Fig. 1. The classification accuracy on DeRisi’s TUP1 dataset.
2762
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
of clusters from 2 to 10 for DeRisi’s TUP1 dataset. It can be seen that template classifiers performed better than the non-template classifiers. Note that this difference increased when the number of clusters increased. Figure 2 shows a C4.5 tree constructed for DeRisi’s TUP1 dataset with two clusters. The original gene experimental measurements were used to represent the genes. It can be seen that 21 nodes were used for branching and 22 terminals were used for decision-making. It showed a very complicated decision-making system. Each inequality was used for passing the decision from one node to the other based on the value of the experimental measurement used by the current node. For instance, if the value of the experimental measurement ‘X3’ on a gene was ≤0.14 (normalized value), the decision was passed to the other experimental measurement ‘X5’, otherwise ‘X6’. Note ‘A’ and ‘B’ mean two clusters. ‘X3’, ‘X5’ and ‘X6’ are three experimental measurements ‘Ch.2’, ‘Green’ and ‘Red’ (Table 2). Figure 3 shows a C4.5 decision tree constructed using the template theory for DeRisi’s TUP1 gene data with two clusters. Each inequality was used to pass the decision from one node to the other based on the Mahalonobis distance to the template used by the current node. For instance, if the Mahalonobis distance between a gene and the second template ‘T2’ was ≤0.5, the decision was passed to the first terminal, where the gene was classified as a member of the first cluster denoted by ‘A’ (the other was denoted by ‘B’). It can be seen that only one node was used for branching and two terminals were used for decision-making. In comparison, it can be seen that a C4.5 classifier constructed using template theory was much simpler than the one constructed using the original experimental measurements. It should be noted that the templates found through cluster analysis are used only to represent the knowledge discovered. To classify a novel gene into clusters, a classifier must be constructed using this knowledge. The classification of a novel gene will be based on the generalization of the knowledge about the gene, i.e. the relationship between the novel gene and the templates. In this study, C4.5 decision trees are the classifiers constructed for classification
and prediction. In other words, C4.5 decision trees are a way to efficiently organize the knowledge discovered for the purpose of classification and prediction of novel genes. Figure 4 shows the histograms of the first four gene experimental measurements for DeRisi’s TUP1 gene data, where the horizontal and the vertical axes represent the experimental measurement intervals and the hits, respectively. The open and filled bars represent the genes in the first and the second clusters, respectively. It can be seen that none of them demonstrated a good capability of discrimination. Because of this, the C4.5 classifier structure was too complicated with 21 nodes and 22 terminals (Fig. 2). In comparison, the histograms of the Mahalonobis distances were shown in Figure 5, which were similarly constructed as the ones in Figure 4. The open and the filled bars represented two clusters of genes. It can be seen that two classes of genes were well separated using either template. This was the reason that the template C4.5 classifier had a much simpler structure. For instance, template 2 was used in Figure 3 for constructing a template classifier. Figure 6 shows the template values as well as their standard deviations for two clusters in DeRisi’s TUP1 dataset. All the gene experimental measurement values were normalized prior to using the FCM clustering algorithm. It can be seen that experimental measurements 2, 4, 7, 8, 9, 10, 11, 12 and 13 were not sensitive to the cluster structure. Meanwhile, experimental measurements 1, 3, 5 and 6 were well separated. Figures 7–10 show the classification accuracy on testing datasets for DeRisi’s YAP1 dataset, DeRisi’s complete dataset, Horton and Nakai’s dataset and Yeung’s dataset. It can be seen that in all these cases, the performance of the template classifiers was superior to that of the non-template classifiers. Figure 11 show the P -values of the paired t-tests on the five datasets. In each test group, two groups contained a measured classification accuracy of the non-template classifiers and the template ones. The null hypothesis was that the template classifiers do not show better performance than the non-template ones. If the probability was high, we would not be able to deny the null hypothesis. For instance, the probability was 0.05% for the DeRisi’s TUP1 dataset, hence it was impractical to believe that the null hypothesis could stand and we denied the null hypothesis and stated that the template classifiers were superior to the non-template ones. Among them, the P -values of the ttest on the DeRisi’s complete dataset was only 0.003106. This means that the probability that the template classifiers do not outperform the non-template ones is very rare (0.3%) and hence the null hypothesis has been strongly denied. Although the largest P -value occurred at DeRisi’s TUP1 dataset was 0.047539, the null hypothesis still cannot be accepted.
Mining gene expression data based on template theory
Fig. 3. The C4.5 tree constructed for DeRisi’s TUP1 gene data with two clusters using the template theory. Note ‘A’ and ‘B’ denoted two clusters, respectively.
From the above discussion, we can see that the use of the template theory can improve the classification accuracy significantly compared with the use of the original experimental measurement values for inference. It is therefore reasonable to suggest using the Mahalonobis distance for
representing genes for constructing gene predictors. This is because the experiments in this study have clearly shown that the use of the found templates for representing genes will remove possible noise and enhance the inference efficiency. It has been noted that the difference in performance varies with datasets. This is because the distribution of gene experimental measurements is determined by many factors. For instance, it is impossible to expect that the experimental measurements from different sets of genes will show similar distribution. Therefore, the correlation pattern among the experimental measurements varies. We know that non-template classifiers constructed using C4.5 are based on the assumption that there is very weak correlation among the variables (experimental measurements) whereas template classifiers are constructed using templates that are more likely to be independent after cluster analysis.
2763
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
Fig. 2. The C4.5 tree constructed for DeRisi’s TUP1 gene data with two clusters using the original experimental measurements to represent the genes. Note ‘A’ and ‘B’ denoted two clusters, respectively. ‘X3’, ‘X5’ and ‘X6’ are three experimental measurements ‘Ch.2’, ‘Green’ and ‘Red’.
Z.R.Yang
1000
800 Experiment measure (Ch.1)
600 400
0
0.2 0.4 0.6 0.8 Measurement intervals
0
1
0
0.2 0.4 0.6 0.8 Measurement intervals
1
2000
1500
Experiment measure (Ch.2.bkg)
Experiment measure (Ch.2)
1500 Hits
Hits
500
1000 500
0
0.2 0.4 0.6 0.8 Measurement intervals
1
0 0
0.2 0.4 0.6 0.8 Measurement intervals
1
Fig. 4. The histograms of the first four gene experimental measurements for DeRisi’s TUP1 gene data. 400
Opened bars: genes of cluster 1 Filled bars: genes of cluster 2
Hits
300 200 100 0
0
400
0.2 0.4 0.6 0.8 Intervals of Mahalanobis distance with template 1
Opened bars: genes of cluster 1 Filled bars: genes of cluster 2
300
Hits
1
200 100 0
0
0.2 0.4 0.6 0.8 Intervals of Mahalanobis distance with template 2
1
Fig. 5. The histograms of the Mahalonobis distances.
Fig. 6. The template values and their standard deviations for the DeRisi’s TUP1 dataset.
2764
Fig. 7. The classification accuracy on DeRisi’s YAP1 dataset.
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
1000
0
400 200
200 0
Experiment measure (Ch.1.bkg)
600 Hits
Hits
800
Mining gene expression data based on template theory
This is why the template C4.5 classifiers never perform worse than the non-template C4.5 classifiers among these five experiments.
ACKNOWLEDGEMENTS The author would like to thank Dr R. Quinlan for the free C4.5 package and the reviewers for their valuable comments.
REFERENCES
Fig. 9. The classification accuracy on Horton and Nakai’s yeast dataset.
Accuracy
100 90 80 70 Non-template
60
Template
50 2
3
4
5 6 7 8 Number of clusters
9
10
Fig. 10. The classification accuracy on Yeung’s dataset.
P-values
0.05 0.04 0.03 0.02 0.01
Fig. 11. The paired t-test on five datasets.
Yeung
UIC
Complete
YAP1
TUP1
0
Bezdek,J. (1981) Pattern Recognition with Fuzzy Objective Function Algorithm. Plenum Press, New York. Bezdek,J., Thomas,R., Reichherzer,R., Lim,G.S. and Attikiouzel,Y. (1998) Multiple-prototype classifier design. IEEE Trans. Syst. Man Cybernet., 28, 354–359. Breiman,L., Friedman,J.H., Olshen,R.A. and Stone,C.J. (1984) Classification and Regression Trees. Chapman & Hall/CRC. Brown,M.P.S., Grundy,W.N., Lin,D., Cristianini,N., Sugnet,C.W., Furey,T.S.,Jr, Ares,M. and Haussler,D. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci., USA, 97, 262–267. Chase,W.G. and Simon,H.A. (1973) Perceptron in chess. Cogn. Psychol., 4, 55–81. Chu,S., DeRisi,J.L., Eisen,J., Mulholland,D., Botstein,D., Brown,P.O. and Herskowitz,I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699–705. Dasarathy,B.V. (1991) Nearest Neighbour (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA. Dasarathy,B.V. (1994) Minimal consistent set (MCS) identification for optimal nearest neighbour decision systems design. IEEE Trans. Syst. Man Cybernet., 24, 511–517. DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680–686. Duda,R.O. and Hart,P.E. (2002) Pattern Classification and Scene Analysis. Wiley, New York. Eisen,M., Spellman,P., Brown,P. and Botstein,D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci., USA, 95, 14863–14868. Fraley,C. and Raftery,A.E. (1998) How many clusters? Which clustering method? Answer via model-based cluster analysis. Comput. J., 41, 578–587. Getz,G., Levine,E. and Domany,E. (2000) Coupled two-way clustering analysis of gene microarray data. Proc. Natl Acad. Sci., USA, 97, 12079–12084. Gobet,F. (1998) Expert memory: a comparison of four theories. Cognition, 66, 115–152. Hanisch,D., Zien,A., Zimmer,R. and Lengauer,T. (2002) Co-clustering of biological networks and gene expression data. Bioinformatics, 18, S145–S154. Hart,P.E. (1968) The condensed nearest neighbour rule. IEEE Trans. Inform. Theory, 14, 515–516. Herrero,J., Valencia,A. and Dopazo,J. (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17, 126–136.
2765
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
Fig. 8. The classification accuracy on DeRisi’s Complete dataset. It can be seen that the template classifiers performed much better than the non-template classifiers. Note that the difference was 20% when the number of clusters was 10.
Z.R.Yang
2766
Shalon,D., Smith,S.J. and Brown,P.O. (1996) A DNA microarray system for analysing complex DNA samples using twocolour fluorescent problem hybridisation. Genome Res., 6, 639–645. Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol., 9, 3273–3297. Sultan,M., Wigle,D.A., Cumbaa,C.A., Maziarz,M., Glasgow,J., Tsao,M.S. and Jurisica,I. (2002) Binary tree-structured vector quantization approach to clustering and visualizing microarray data. Bioinformatics, 18, S111–S119. Szeto,L.K., Liew,A.W., Yang,H. and Tang,S. (2003) Gene expression data clustering and visualization based on a binary hierarchical clustering framework. J. Vis. Lang. Comput., 14, 341–362. Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitrovsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci., USA, 96, 2907–2912. Troyanskaya,O., Cantor,M., Sherlock,G., Brown,P.O., Hastie,T., Tibshirani,R., Botstein,D. and Altman,R.B. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525. Xu,Y., Olman,V. and Xu,D. (2002) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18, 536–545. Yeung,K.Y., Fraley,C., Murua,A., Raftery,A.E. and Ruzzo,W.L. (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977–987.
Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 27, 2013
Horton,P. and Nakai,K. (1996) A probablistic classification system for predicting the cellular localization sites of proteins. Proc. Intell. Syst. Mol. Biol., 4, 109–115. Iyer,V.R., Horak,C.E., Scale,C.S., Botstein,D., Snyder,M. and Brown,P.O. (2001) Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature, 409, 533–538. Kuncheva,L.I. and Bezdek,J. (1998) Nearest prototype classification: clustering, genetic algorithms, or random search? IEEE Trans. Syst. Man Cybernet., 28, 160–164. Li,W., Fan,M. and Xiong,M. (2003) SamCluster: an integrated scheme for automatic discovery of sample classes using gene expression profile. Bioinformatics, 19, 811–817. Lockhart,D.J., Dong,H., Byrne,M.C., Follettie,M.T., Gallo,M.V., Chee,M.S., Mittmann,M., Wang,C., Kobayashi,M., Horton,H. and Brown,P.O. (1996) Expression monitoring by hybridisation to high density oligonucleotide arrays. Nat. Biotechnol., 14, 1675–1680. Lukashin,A.V. and Fuchs,R. (2001) Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics, 17, 405–414. McLachlan,G.J., Bean,R.W. and Peel,D. (2002) A mixture modelbased approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422. Pickert,L., Reuter,I., Klawonn,F. and Wingender,E. (1998) Transcription regulatory region analysis using signal detection and fuzzy clustering. Bioinformatics, 14, 244–251. Quinlan,J.R. (1988) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA. Schena,M., Shalon,D., Davis,R.W. and Brown,P.O. (1995) Quantitative monitoring of gene expression patterns with a DNAmicroarray. Science, 210, 467–470.