Biclustering and Feature Selection Techniques in Bioinformatics Bhavik Desai1 , Pankaj Andhale1 , Manjeet Rege1 , and Qi Yu2 1
Department of Computer Science, Rochester Institute of Technology, Rochester, NY, USA {bxd9449,pma7893,mr}@cs.rit.edu 2 Department of Information Sciences and Technology, Rochester Institute of Technology, Rochester, NY, USA
[email protected] Abstract. The paper describes several data mining techniques, developed to solve problems which are faced by biologists in Bioinformatics.Several biclustering algorithms which perform clustering on the two dimensions simultaneously are described. Other techniques described in this paper include feature selection methods which help in reducing noise and improving the performance of the classification model. Keywords: biclustering, data transform, feature selection, filter methods.
1
Introduction
Biology is the most important science because understanding various biological processes in living organisms provide reasoning for several or possibly all diseases. Tremendous amount of data is being constantly generated from several biological experiments which are part of genomics research. The biologists needed techniques to interpret this data, which led to the introduction of Bioinformatics.
2
Biclustering Algorithms
Clustering algorithms can either cluster genes or conditions with respect to set of conditions or genes respectively, but it cannot cluster genes and condition simultaneously.In general, Biclustering algorithms can cluster along two dimensions, row and column simultaneously. The important points that one should know about biology while applying biclustering to gene expression data are: a cellular process consists of small numbers of active genes (participating genes) and this process becomes active in only set of conditions, and a gene can participate in several cellular process simultaneously. To meet these criteria while biclustering, a cluster of genes/conditions should be defined with respect to set of conditions/genes respectively, the clusters formed should not be exhaustive or/and exclusive and gene/condition should be able to belong to several or no clusters. The problem of biclustering becomes complex due to presence of noise in gene expression data generated from various experiments. R. Kannan and F. Andres (Eds.): ICDEM 2010, LNCS 6411, pp. 280–287, 2012. c Springer-Verlag Berlin Heidelberg 2012
Biclustering and Feature Selection Techniques in Bioinformatics
2.1
281
Biclustering Algorithm
Problem Defination While explaining several biclustering algorithms, we would be using definition described in this section. The gene expression data is represented as m × n matrix, where m is the number of genes and n is the number of conditions and ai,j is a real value representing the expression level of gene i in condition j.
Fig. 1. Gene expression data matrix
In the given matrix, row cluster is defined as subset of rows (I) that show similar behavior across all the columns (n) whereas column cluster (J) is defined as subset of columns that show similar behavior across all rows (m). A bicluster (I, J) can be defined as subset of genes (I) that exhibit similar behavior across subset of conditions (J) or vice-versa where I = {i1 , ..., ik such that k m} and J = {j1 , ..., js such that s n}. Thus a bicluster (I, J) is (k × s) submatrix of the original (m × n) matrix. 2.2
Weighted Bipartite Graph Representation of Data Matrix
A weighted bipartite graph is defined as G(V, E), where E is the set of edges in the graph and V is the vertices divided into two sets: L and R such that V = L U R and all the edges in E have one end in R and the other end in L. The gene expression data matrix defined in the previous section can be viewed as weighted bipartite graph where in m genes would be nodes belonging to L and n conditions would be nodes belonging to R and the matrix element ai,j would be the weight of the edge from node i L to node j R. The weighted bipartite graph for the gene expression data matrix is show in the Fig.2. The complexity of determining bicluster in matrix depends upon the score function used to determine the quality of bicluster. In Fig.3, even if we consider the weights ai,j to be 0 or 1 then still finding a bicluster is equivalent to finding biclique in the bipartite graph, which is a NP-complete problem. The case becomes more complex when weights ai,j are real values rather than 0 or 1. Using the graph theory, it is proved that all the variants for determining biclusters are NP-complete problem. Thus, most biclustering algorithm use heuristic approach, some of them apply data preprocessing techniques before actually applying the algorithm in order to better find the pattern of interest.
282
B. Desai et al.
Fig. 2. Weighted Bipartite graph for gene expression data matrix
2.3
Types of Bicluster
There are four major types of biclusters: (1) bicluster with constant values, (2) bicluster with constant values on rows or columns, (3) bicluster with coherent values and (4) bicluster with coherent evolution. The first 3 types of biclusters analyze the data matrix and try to discover subset of rows or columns that shows similar behavior. While in coherent evolution type of bicluster, the elements of matrices are considered as symbols and the behavior is examined, to find if they follow certain order, or shows coherent positive or negative changes with respect to normal values. Thus in this type of bicluster the coherent behavior is found without taking into account its exact numeric value. 2.4
Weighted Bipartite Graph for Gene Expression Data Matrix
In Fig.3, the table(a) shows the ideal bicluster with constant values whereas table(b) and (c) shows constant rows bicluster and constant columns bicluster respectively. The table(d) shows additive model of coherent value type of bicluster whereas table(e) shows multiplicative model of coherent values table(f ) shows the actual constant value bicluster which is contaminated with noise. The table(g), (h), (i), (j) are examples of coherent evolution where in table(g) and (h) the state changes along rows and columns respectively. In table(i) the value increases from 1st column to second then decreases and then again increases in column four table(j) shows negative or positive changes with respect to normal value, irrespective of their numeric value. Notations Consider a data matrix A = (X, Y ) where X is the set of rows and Y is the set of columns. Let (I, J) be the bicluster consisting of I rows which are subset of X and J columns which are subset of Y . ai,j is an element at ith row and j th column of the matrix. We will use the following notations: Mean of ith row of bicluster jJai,j (1) ai,J = (1/|J|) ∗
Biclustering and Feature Selection Techniques in Bioinformatics
283
Mean of j th row of bicluster aI,j = (1/|I|) ∗ Mean of the bicluster aI,J = (1/|I||J|) ∗
iIai,j
(2)
jJ, iIai,j
(3)
For finding a constant bicluster, the approach involves arranging rows and columns in such a order that similar values cluster together but this can be done with noiseless ideal data matrix. Ideally, in a constant value bicluster (I, J), were all values are same, for all iI and jJ, ai,j = μ, where μ is the mean of all the values in the cluster.But for real data with noise ai,j = μ + ηi,j where ηi,j is the noise associated with the data. In Fig.3.a shows the ideal bicluster with constant values whereas (f ) shows constant bicluster in noisy data.The most common merit score function used to determine the quality of constant bicluster is the variance. Variance (4) (I, J) = iI, jJ(ai,j − aI,J )2 Cheng and Church [3] used mean square residue as the merit function to evaluate the quality of the bicluster. The residue r(ai,j ) of an element ai,j in a bicluster with coherent values is defined as r(ai,j ) = ai,j − ai,J − aI,j + aI,J
(5)
The mean square residue is defined as H(I, J) = (1/|I||J|) ∗
iI, jJr(ai,j )2
Fig. 3. Types of bicluster
(6)
284
B. Desai et al.
This merit function needs to be minimized to determine better bicluster with coherent values. In Fig.3, d and e shows bicluster with coherent values.[1] describes bicluster as Order preserving sub-matrix (OPSM) wherein a bicluster is set of rows whose values are in linear order across set of columns. In Fig.3.i, the coherent evolution bicluster has the order ai,4 > ai,2 > ai,3 > ai,1 . In such bicluster with coherent evolution the emphasis is on order of values across columns rather the exact numeric values. In Fig.3.g and h the state changes across rows and columns respectively thus it is an example of coherent evolution. Whereas j portrays sign change across rows and columns irrespective of their numeric values. There can be an overlapping bicluster that consist of two different type of bicluster . Types of bicluster are shown in Fig.3.
3 3.1
Biclustering Methods Iterative Row and Column Clustering
For biclustering, the first method that comes in mind is using the existing clustering algorithms on rows and columns of data matrix separately then combining the result to get the bicluster. Coupled two-way clustering (CTWC) [4]uses iterative row and column clustering approach. CTWC starts with one pair of columns and rows wherein each pair consists of set of all columns and set of all rows. To find stable row and column clusters, hierarchal clustering algorithm is applied on each set of rows and columns, thus finding a set of bicluster at a time. A tunable parameter T is used for controlling the decision of biclustering. At the beginning T = 0, when there exist a single bicluster with all rows and columns. With the increase in the value of T the current bicluster gets divided at each step, until at certain high value of T which consist of bicluster with single row and column. The stability of a bicluster is measured by the control parameter T by fixing the range of value T at which the cluster remained unchanged. T generally constitutes the range of values required to break the data into single row and column. CTWC maintains two lists for stable clusters (rows and columns clusters) and a list for pair of rows and columns subset. At each step one subset of rows and one subset of columns are clustered and combined. The new formed bicluster is added to the rows and columns list. The iteration continues until cluster of predefined size is formed and no new cluster that satisfies the stability criteria is found.
4
Recent Biclustering Approaches
Important approaches in computer science such as greedy iterative search, exhaustive and divide and conquer is implemented by Block Clustering for biclustering. These technique are computationally very fast but one may miss a good bicluster as it can be divided before being determined. Distribution parameter identification approach assumes certain statistical model and determines the parameter used to generate the data by minimizing certain score functions in various iterations. Kluger et al.[7]. Techniques [8] used to improve Minimum Sum Squared Residue Coclustering are as follows
Biclustering and Feature Selection Techniques in Bioinformatics
4.1
285
Data Transformation
Data transformation is performed as a preprocessing step and is important in various data mining tasks as variance of the variable will decide its relevance in the model. 4.2
Incremental Local Search(LS)
The local search strategy basically searches for the move of some row/column that maximizes the change in the score function. This strategy overcomes the problem described above. The detailed description about how the local search is performed can be found in [3].[8] described improved Minimum Sum Squared Residue Coclustering algorithm that uses local search to escape the local minima. The algorithm is described wherein the updating the cluster and local searching is performed alternately. In first step the row and column clusters are updated then in the second step, the local search is performed to check for row/column that can be moved in order to improve the score value. Each iterations performs this two steps. A very different approach of biclustering is presented in [6] which explain diametrical clustering of anti-correlated genes. It is observed that genes which are functionally related exhibit strong anti-correlation in their expression levels. This can be the case because a gene may be strongly suppressed in order to allow some other to be fully expressed. Such genes which are functionally related still get clustered into different groups. The algorithm described in [6]clusters highly correlated and anti-correlated genes into a diametric cluster. The positively and negatively correlated genes can be separated using a simple post processing step.
5
Feature Selection Techniques
One of the important steps in discriminant analysis is feature selection wherein a subset of features are selected for the discriminant system instead of using all the attributes or features within the data. The computational cost reduces with reduction in the features or dimensions. The classification accuracy is improved with reduction in noise. More interpretable features are available which can help in identifying and monitoring the function type or target diseases. For example, Golub et al [5] showed that 50 informative genes are usually sufficient for a two class cancer subtype classification. Feature selection is usually performed under two approaches: filters and wrappers. The features are selected on the basis of their intrinsic characteristics which basically determine the discriminative power with respect to the target class. In wrapper type method the feature selection process is wrapped around the learning algorithm i.e. the feature is selected by determining its usefulness by estimating the accuracy of the learning method.There are two approaches for starting the search of best features. F orward selection and Backward elimination.
286
B. Desai et al.
Filter methods are much faster than Wrapper methods but classification accuracy is better delivered by wrapper than that delivered by filters. The disadvantage of wrapper method is that it requires classifier to be called repeatedly during the feature selection process. Some of the above described methods are: (a) Filter methods: Filter methods filters for selecting genes from among all the genes. Some of the filters are: Information gain(I): This parameter measures the significance of the feature with respect to the class. Inf ormation gain = H(class) − H(class/f eature) H(class) = − H(class) = −
(7)
classi classP (classi ) .log2 (P (classi ))
(8)
classi classP (classi ) .log2 (P (classi ))
(9)
H(class/f eature) =
− f eaturei f eatureP (f eaturei ) × classi classP (classi /f eaturei ) ×
(10)
log2 (P (classi /f eaturei )) But information gain is biased for feature with high variance. This biasness was removed in symmetric uncertainty [9] as SU = 2 × inf ormation gain/(H(class) + H(f eature) )
(11)
The filter methods can be applied in two steps as follows: 1. Rank all the features using certain filters similar one described above. 2. Choose the higher ranked n-1 features as the best feature subset. (b) Wrapper method: The feature selection can be performed using wrapper methods in following steps: 1. Select a machine learning algorithm like Naïve Bayes to evaluate the score of the feature subset. 2. Select a searching algorithm like forward selection explained above. 3. Search the feature space for subset of features by keeping track of the best subset of features. 4. The best subset is considered as output of feature selection. The feature selection for unsupervised clustering is like chicken and egg problem as the feature selection needs to be performed depending upon the prior knowledge of cluster structure which in turn needs to be determined. Approach described in [2] identifies irrelevant genes rather than relevant genes and improves clustering by discarding them and using the remaining relevant genes. In this technique the non-discriminant genes are ordered in middle thus can be discarded.
6
Conclusion
Data mining techniques such as clustering, biclustering, feature selection, discriminant methods and graph models have been explained throughout the paper.
Biclustering and Feature Selection Techniques in Bioinformatics
287
One can combine two or more approaches or methods explained above to derive a method that performs better than the existing algorithm to solve a particular problem. Apart from the application of these techniques to the field of bioinformatics, the algorithms presented in this paper are domain independent.
References 1. Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: The order-preserving submatrix problem. In: 6th Computational Biology International Conference (2002) 2. Ding, C.H.Q.: Unsupervised feature selection via two-way ordering in gene expression analysis. NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, USA 3. Dhillon, I.S., Guan, Y., Kogan, J.: Iterative clustering of high dimensional text data augmented by local search. In: 2nd IEEE International Conference Data Mining, ICDM (2002) 4. Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. Proc. Natural Academy of Sciences US (2000) 5. Golub, T., Slonim, D.K., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring 6. Dhillon, I.S., Marcotte, E.M., Roshan, U.: Diametrical clustering for identifying anti-correlated gene clusters. Bioinformatics, 1612–1619 (2003) 7. Klugar, Y., Basri, R., Chang, J.T., Gerstein, M.: Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research, 703–716 (2003) 8. Livne, O.E., Golub, G.H.: Scaling by binormalization. Numerical Algorithms, 97– 120 (2004) 9. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling W.T.: Numerical recipes in c (1988)