GERC: Tree Based Clustering for Gene Expression Data H. A. Ahmed, P. Mahanta Deptt. of Comp. Sc. and Engg. Tezpur University Napalm -784028, India Email:
[email protected],
[email protected] D. K. Bhattacharyya Deptt. of Comp. Sc. and Engg. Tezpur University Napalm -784028, India Email:
[email protected] Abstract—Measurement of gene expression using DNA microarrays have revolutionized biological and medical research. This paper presents a divisive clustering algorithm that produces a tree of genes called GERC1 tree along with the generated clusters. Unlike a dendrogram, a GERC tree is a general tree and it is an ample resource for biological information about the genes in a dataset. The leaves of the tree represent the desired clusters. The clustering method was tested with several real-life datasets and the proposed method has been found satisfactory. Keywords-Hierarchical Clustering; Recursive Clustering; Gene Expression Data; Mean Squared Residue
I. I NTRODUCTION DNA microarray technology has proved to be a fundamental tool in studying gene expression data. Due to the large number of genes and complex gene regulation networks, clustering is a useful exploratory technique for analyzing such data. It divides data of interest into a small number of relatively homogeneous groups or clusters where the intra group object similarity is minimized and the inter group object dissimilarity is maximized. In this paper we present a polythetic divisive clustering algorithm that operates in two distinct steps. The first step generates a number of initial clusters and in the second step, these initial clusters are further processed to form a set of finer clusters. The strengths of the algorithm include the following (a) Extraction of initial clusters is fast and independent of traditional proximity measures. (b) During discovery of final clusters, a proximity measure called MRS is used to find mutual similarity among genes within a particular initial cluster. (c) The algorithm is capable of tuning the threshold to be used to decompose a node to its child nodes itself. (d) The algorithm stores the tree structure which can be later used in other applications. (e) The algorithm allows overlapping of genes among child nodes of a particular node. The rest of the paper is organized as follows. In Section II, we discuss related work. Section III presents the algorithm. Experimental results are reported in Section IV. Finally, conclusion and future work are presented in Section V. 1 GERC
stands for Gene Expression Recursive Clustering
Jugal K. Kalita Deptt. of Computer Science University of Colorado Colorado Springs, USA Email:
[email protected] II. R ELATED W ORK A large number of clustering techniques have been reported for analyzing gene expression data, such as Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [1], Self Organizing Tree Algorithm (SOTA) [2], Divisive Correlation Clustering Algorithm (DCCA) [3], Density-Based Hierarchical clustering method (DHC) [4] and Dynamically Growing Self-Organizing Tree (DGSOT) [5]. A comparison of these algorithms is presented in the Table I. Based on our survey, we observe that most hierarchical algorithms focus on the final clusters. Biologists are not only interested in the clusters of genes, but also in the relationships among the clusters and their sub-clusters, and the relationship among the genes within a cluster [6]. A clustering algorithm, which also provides some graphical representation of the cluster structure is favored by biologists. To address this issue, this paper presents a hierarchical clustering algorithm that generates a tree along with the set of clusters. Table I COMPARISON OF EXISTING HIERARCHICAL TECHNIQUES Technique UPGMA
Approach Hierarchical
DGSOT
Hierarchical, Model based
SOTA
Hierarchical, Model-based
Any
DHC
Density based, Hierarchical
Pearson’s correlation
Graph based Hierarchical
Pearson’s correlation
DCCA
Proximity Measure Euclidean / pearson’s correlation Any
Input Parameters cut-off for the dendrogram Heterogeneity threshold, Cluster separation threshold Heterogeneity threshold, Distance value threshold radius, similarity thresh-old, minimum no .of object None
III. T HE P ROPOSED A LGORITHM GERC is a polythetic divisive hierarchical clustering algorithm that operates in two distinct steps. The algorithm is dependent on four parameters, viz., reference gene, step down ratio, preferred node volume in a cluster and an MRS threshold. It operates on any high dimensional numeric
domain to produce a set of clusters. The last three parameters can be statistically computed from the expression data. A. Data Preprocessing The various preprocessing tasks we carry out are presented in this section. 1) Handling missing values: We use the Local Least Squares Imputation method [7] to compute missing values in the datasets. 2) Normalization: The datasets are normalized using a common statistical method called Z score normalization [6] or Mean 0 Standard Deviation 1 normalization. 3) Discretization: The normalized matrix G is converted to the discretized matrix Gd in the following manner. 0, if G(i, j) = G(i, j + 1) d if G(i, j) < G(i, j + 1) G (i, j) = 1, −1, if G(i, j) > G(i, j + 1)
where, G(i,j) is the value of ith row and jth column of G. B. Proximity Measure: Mean Residue Similarity
MRS is a Mean Squared Residue [8] based measure that works satisfactorily to detect coherence of constant valued genes, constant row genes, constant column genes and additive genes. The need for detecting these different types of genes in clustering of gene expression data has been discussed in [9]. But unlike Mean Squared Residue, Mean Residue Similarity can operate in a mutual mode, i.e., it can detect mutual relationship between a pair of genes. MRS of a gene g1 =(a1 , a2 ,..., an ) with respect to another gene g2 =(b1 , b2 ,..., bn ) is defined by % M RSλ (g1 , g2 ) = |ai − amean − bi + bmean | (1) i∈λ
where amean is the mean of all the elements of gene g1 , bmean is the mean of all the elements of gene g2 and λ is the set of selected conditions. Following definitions and theorems provide the theoretical basis and soundness of the MRS measure. Definition 1: Expression pattern of a gene is defined as the discretised form of the gene. Two genes are said to have similar expression patterns if their discretized values are the same. Mathematically, two genes gi and gj have similar expression pattern if Gd (i, k) = Gd (j, k), for k = 1, 2, 3..., n
where n is the total number of conditions Definition 2: For two genes gi =(a1 , a2 ,...., an ) and gj =(b1 , b2 ,..., bn ) if a1 =a2 =...=an =b1 =b2 =...=bn , the genes are called constant valued genes. Definition 3: For two genes gi =(a1 , a2 ,...., an ) and gj =(b1 , b2 ,..., bn ) if a1 =a2 =...=an and b1 =b2 =...=bn , the genes are called constant row genes.
Definition 4: For two genes gi =(a1 , a2 ,..., an ) and gj =(b1 , b2 ,..., bn ) if a1 =b1 , a2 =b2 ,..., an =bn , the genes are called constant column genes. Definition 5: For two genes gi =(a1 , a2 ,...,an ) and gj =(b1 , b2 ,...,bn ) if b1 =a1 +d, b2 =a2 +d,..., bn =an +d, where d is an additive constant, the genes are called additive genes. Theorem 1. MRS of two constant valued genes is zero. Proof: Let the two genes be g1 =(a1 , a2 ,..., an ) and g2 =(b1 , b2 ,..., bn ). Since the two genes are constant valued, a1 =a2 =...=an =b1 =b2 =...=bn =x (say). Now mean of the two genes &n will be, amean =bmean =x. MRS{1,2,...,n} (g1 ,g2 ) = &i=1 |ai -amean -bi +bmean | n =&i=1 |ai -bi +amean -bmean | n = i=1 |x-x+x-x| = 0 Theorem 2. MRS of two constant row genes is zero. Proof: Let the two genes be g1 =(a1 , a2 ,..., an ) and g2 =(b1 , b2 ,..., bn ). Since these are constant row genes, a1 =a2 =a3 =...=an =x (say) and b1 =b2 =b3 =...=bn =y (say). Now mean of the first gene & n amean =1/n i=1 (x)=1/n(n×x)=x. Similarly mean of the second &n gene bmean =y. MRS{1,2,...,n} (g1 ,g2 ) = &i=1 |ai -amean -bi +bmean | n =&i=1 |(ai -amean )-(bi -bmean )| n =&i=1 |(x-x)-(y-y)| n = i=1 |0-0|= 0 Theorem 3. MRS of two constant column genes is zero. Proof: Let the two genes be g1 =(a1 , a2 ,..., an ) and g2 =(b1 , b2 ,..., bn ). Since these are constant column genes, a1 =b1 , a2 =b2 ,..., an =bn . Now mean of two genes will be amean =bmean =m (say). & n MRS{1,2,...,n} (g1 ,g2 ) = &i=1 |ai -amean -bi +bmean | n =&i=1 |(ai -amean )-(bi -bmean )| n =&i=1 |(ai -m)-(bi -m)| n =&i=1 |(ai -bi )-(m-m)| n = i=1 |0| = 0 since ai =bi Theorem 4. MRS of two additive genes is zero. Proof: Let the two genes be g1 =(a1 , a2 ,..., an ) and g2 =(b1 , b2 ,..., bn ). Since the genes are additive genes, bi =ai +d, for i=1,2,3,...,n. Here d is an&additive constant. Now mean of n the first gene amean =1/n i=1 (ai )=m (say). Mean the second & gene &nof & n n bmean =1/n i=1 (bi )=1/n (a +d)=d+1/n (a )=m+d. i=1 i &n i=1 i MRS{1,2,...,n} (g1 ,g2 ) = &i=1 |ai -amean -bi +bmean | n =&i=1 |(ai -bi )-(amean -bmean )| n =&i=1 |(ai -ai -d)-(m-m-d)| n = i=1 |d-d| = 0
C. Proposed Algorithm: GERC
Following definitions, symbols and notations are used to describe the GERC algorithm. Definition 6: Two genes gi and gj are said to be neighbours of each other with respect to a threshold β if dist(gi ,gj ) ≤ β.
Table II SYMBOLIC REPRESENTATIONS SYMBOL Nc Rg gi T C η(i) Nη dist(gi ,gj ) α δ neighbourβ (gi ) Exp(gi ) father(η(j)) degreeβ (gi ) η(j).threshold η(0)
MEANING Total number of condtions Reference gene ith gene Preferred Node Volume, minimum no of genes in a cluster initial cluster ith node Number of tree nodes MRS between ith andjth genes as in the distance matrix MRS threshold Step down ratio Neighbours of ith gene with respect to MRS threshold β Expression pattern of ith gene presented by ith row of the discretized matrix father of jth node Degree of ith gene with respect to MRS threshold β MRS threshold of jth tree node The root node of the GERC tree containing all the genes
Definition 7: Degree of a gene gj with respect to threshold β is defined as the number of genes gk that are within β neighbourhood of gj . Definition 8: An initial cluster is defined as a set S of all genes G such that for any two genes gi , gj ∈G, Exp(gi ) = Exp(gi ) in terms of Nc numbers of conditions. Definition 9: A fine cluster is defined as a set of genes S such that for any two genes gi , gj such that i Exp(gi ) = Exp(gi ) in terms of Nc numbers of conditions, and ii dist(gi ,gi )≤α. In the first step of GERC, all genes which have expression pattern similar to the reference gene in terms of N2c +1 number of conditions are placed in an initial cluster. In the second step of the algorithm, this initial cluster is processed to produce finer clusters. While processing an initial cluster, all the genes are put in the root node. Then iterative clustering is performed using a tuned MRS threshold on genes of each node to produce its child nodes. The process is repeated till all nodes in the tree contain a number of genes greater than user specified preferred node volume. The leaves of the tree represent the generated clusters. The algorithm can be used to produce a wide range of clusters by considering more than one reference genes. If we want to explore the entire dataset we can use each of the available genes or a set of stochastically selected genes as reference genes. The algorithm is presented next. STEP 1: Place Rg in C Add all genes gk to C such that Exp( gk )= Exp( Rg ) in terms of N2c +1 number of conditions STEP 2: Place all the genes gk such that gk ∈ C in η(1) Initialize Nη to 1
η(1).threshold= α/ δ for each η(j), j=1,2,..., Nη , if number of genes in η(j)≥T do Compute distance matrix among all genes gk s such that gk ∈ η(j) using MRS Compute degreeη(j).threshold×δ ( gk ) and neighbourη(j).threshold×δ ( gk ) for each gk ∈ η(j) Place all the genes gk ∈ η(j) in L while L is not empty do Find gk ∈ L such that degreeα (gk ) is highest Increment Nη Create a new node η(Nη ) Add neighbourη(j).threshold×δ ( gm ) and gm to ηi (Nη ) η(Nη ).threshold= η(j).threshold×δ father( η(Nη ))= η(j) Remove gm and neighbourη(j).threshold×δ (gm ) from L end while if η(Nη ) is the only child of father( η(Nη )) then Decrement Nη Bicols ← Bicols ∪ { p} η(j).threshold= η(j).threshold×δ Decrement j end if end for Make η(0) father of η(1) i.e. father( η(1))= η(0) IV. E XPERIMENTAL R ESULT We implemented the GERC algorithm in MATLAB and tested it on the three publicly available benchmark microarray datasets mentioned in Table III. Figure 1 and Figure 2 present a part of the tree generated from an initial cluster for Dataset2 and Dataset3 respectively. Table III DATASET USED IN THE PAPER Dataset
Name
Dataset1
Subset of Yeast Cell Cycle Yeast Diauxic Shift Yeast Cell cycle
Dataset2 Dataset3
Source http://faculty.washington/ .edukayee/cluster http://www.ncbi.nlm.nih/ .govgeo/query Sample input files in expander
Instances/ Attributes 384/17 6089/7 698/72
A. Cluster Quality The GERC algorithm was compared with various clustering algorithms and the results were validated using average homogeneity score [10] and p values. The homogeneity values for the GERC algorithm and some other algorithms are reported in Table IV. We observe that the homogeneity value for GERC is highest, from which we can conclude that the coherence of the clusters produced by GERC is better than that produced by competing algorithms. To compute pvalue, we used a web based tool called FuncAssociate [11].
ACKNOWLEDGMENT This paper is an outcome of a research project supported by (1) DST, Govt. of India in collaboration with ISI, Kolkata and (2) National Science Foundation, USA under grants CNS-095876 and CNS-085173. R EFERENCES [1] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci. U.S.A., vol. 95, pp. 14 863– 14 868, 1998.
Figure 1.
Tree generated from an initial cluster for Dataset2
[2] J. Dopazo and J. Carazo, “Phylogenetic reconstruction using an unsupervised neural network that adopts the topology of a phylogenetic tree,” Journal of Molecular Evolution, vol. 44, pp. 226–233, 2002. [3] A. Bhattacharya and R. K. De, “Divisive correlation clustering algorithm (dcca) for grouping of genes,” Bioinformatics, vol. 24, pp. 1359–1366, June 2008. [4] D. Jiang, J. Pei, and A. Zhang, “Dhc:a density-based hierarchical clustering method for time series gene expression data,” Bioinformatic and Bioengineering, IEEE International Symposium on, vol. 0, p. 393, 2003. [5] F. Luo, L. Khan, F. Bastani, I.-L. Yen, and J. Zhou, “A dynamically growing self-organizing tree (dgsot) for hierarchical clustering gene expression profiles,” Bioinformatics, vol. 20, pp. 2605–2617, November 2004.
Figure 2.
Tree generated from an initial cluster for Dataset3
The clusters obtained by GERC on Dataset1 were found to contain genes associated with DNA metabolic process, DNA replication, nuclear chromosome part, chromosomal part, cell cycle process and cell cycle with p-values of 1.67 × 10−11 , 2.34 × 10−12 , 1.35 × 10−9 , 1.98 × 10−12 , 3.68× 10−8 and 1.12 × 10−10 respectively, being some of the highly enriched GO categories.
Dataset3
Method Applied K-means[12] SOM[13] CLICK[10] DCCA[3] GERC K-means SOM CLICK[10] DCCA[3] GERC
No. of. Clusters 16 16 3 15 10 5 6 5 43 11
[7] H. Kim, G. H. Golub, and H. Park, “Missing value estimation for dna microarray gene expression data: local least squares imputation,” Bioinformatics, vol. 21, pp. 187–198, 2005. [8] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proc. of the 8th ISMB. AAAI Press, 2000, pp. 93–103. [9] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological data analysis: A survey,” IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 1, pp. 24–45, 2004.
Table IV C OMPARISON OF HOMOGENEITY VALUES Datasets Dataset1
[6] S. Sharma, R. Das, and D. K. Bhattacharyya, “An effective density based hierarchical clustering technique to identify coherent patterns from gene expression data,” Proc of PAKDD 2011, Advances in Knowledge Discovery and Data Mining,, vol. 6634/2011, pp. 225–236, 2009.
Homogeneity 0.671 0.710 0.549 0.818 0.878 0.577 0.514 0.501 0.699 0.8053
[10] R. Sharan and R. Shamir, “Click: A clustering algorithm with applications to gene expression analysis,” in ISMB’00. AAAI Press, Menlo Park, CA, pp. 307–316. [11] G. F. Berriz, O. D. King, B. Bryant, C. Sander, and F. P. Roth, “Characterizing gene sets with FuncAssociate.” Bioinformatics (Oxford, England), vol. 19, pp. 2502–2504, 2003.
V. C ONCLUSION AND F UTURE W ORK
[12] J. A. Hartigan and M. A. Wong, “A k-means clustering algorithm,” JSTOR: Applied Statistics, vol. 28, pp. 100–108, 1979.
The complexity of our approach and quality of the clusters can be improved by using a more efficient and effective approach to tune the threshold for subsequent levels of the GERC tree. The values of α, T and δ can also be calculated statistically from the gene expression data.
[13] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” Proc. Natl. Acad. Sci. U.S.A., vol. 96, pp. 2907–2912, 1999.