Gene Expression Data Mining for Functional Genomics using Fuzzy ...

Report 4 Downloads 131 Views
Gene Expression Data Mining for Functional Genomics using Fuzzy Technology

Reinhard GUTHKE, Wolfgang SCHMIDT-HECK, Daniel HAHN and Michael PFAFF1 Hans Knöll Institute for Natural Products Research, PO BOX 100813, 07708 Jena, Germany 1 BioControl Jena GmbH, Wildenbruchstr. 15, 07745 Jena, Germany

Key words:

fuzzy clustering, machine learning, biotechnology, dna chips, microarrays, escherichia coli

Abstract:

Methods for supervised and unsupervised clustering and machine learning were studied in order to automatically model relationships between gene expression data and gene functions of the microorganism Escherichia coli. From a pre-selected subset of 265 genes (belonging to 3 functional groups) the function has been predicted with an accuracy of 63-71 % by various data mining methods described in this paper. Whereas some of these methods, i.e. K-means clustering, Kohonen’s self-organizing maps (SOM), Eisen’s hierarchical clustering and Quinlan’s C4.5 decision tree induction algorithm have been applied to gene expression data analysis in the literature already, the fuzzy approach for gene expression data analysis is introduced by the authors. The fuzzy-C-means algorithm (FCM) and the Gustafson-Kessel algorithm for unsupervised clustering as well as the Adaptive Neuro-Fuzzy Inference System (ANFIS) were successfully applied to the functional classification of E. coli genes.

1.

INTRODUCTION

Genome projects and other large-scale biological research projects are now producing enormous quantities of biological data. The entire human

476

Advances in Computational Intelligence and Learning

genome, for instance, with its sequence of about 3⋅109 base pairs (bp) represented by the letters A, T, C and G would fill approximately 1000 books with 1000 pages each when printed. However, huge amounts of genomic data are being gathered with little practical value so far. The physiological functions of genome sequences are widely unknown. To overcome this situation different analysis tools have been developed in order to detect and understand the phenomena of gene regulation and physiological functions, in particular of the protein coding genes (so-called open reading frames, ORFs). Most of these tools are searching for sequence similarities comparing unknown genes with genes of known function from other organisms. This method is strictly limited to the assignment of genes with known functions. Therefore, to learn more about functionally unassigned ORFs (about 30 % in the well-known microorganisms Escherichia coli and Saccharomyces cerevisiae), gene expression studies are to be combined with functional characterization assuming that under different physiological conditions individual genes may be differently expressed. Specific responses to certain stimuli, like the addition of certain natural products or the supply of certain substrates, will provide indications with repect to the functions of the induced genes. A promising approach is to analyze transcription profiles using DNA microarrays of all genes under changing conditions in connection with the available knowledge in databases. This can be described as supervised learning if knowledge is partially available and unsupervised learning if not. In the entire genome sequence of the microorganism E. coli, widely used in biotechnology for the production of recombinant (e.g. human) proteins as well as in microbial research, 4290 ORFs were identified (Blattner et al., 1998) and used to produce DNA arrays. E. coli gene expression data published by Tao et al. (1999) using these arrays are studied in this paper and related to gene functions by different data mining methods.

2.

EXPERIMENTAL DATA

The study described here was based on E. coli gene expression data published by Tao et al. (1999) which are also publicly available via the internet (http://www.ou.edu/cas/botany-micro/faculty/tconway/global.html). The data originate from E. coli MG1655 cultures grown under different conditions on: i) minimal medium containing 0.2 % glucose („MinGlc“), ii) rich medium with Luria broth containing 0.2 % Glucose (“LB+Glc”). iii) gluconate medium (results not shown in this paper).

Intelligent Applications in Biomedicine

477

The data were determined by the Panorama™ E. coli Gene Arrays (Sigma-GenoSys Biotechnologies, Inc.) using hybridization of mRNA isolated from E. coli cells grown under different conditions with the ORF specific DNA fragments immobilized on the array followed by radioactivity detection and image analysis (see Figure 172). Functions of 67 % of all 4290 genes are known as shown in Table 49 (functional groups 1 to 21). The expression data of all 4290 genes under the two cultivation conditions are shown in Figure 173. In order to focus on methodological aspects of data mining algorithms data of only 265 genes were considered in this paper. This reduced set of genes codes for the functional groups 1 (amino acid biosynthesis), 10 (translation, post-translational modification) and 19 (putative cell structure) with non-negative expression intensity (i.e. value is higher than background value; for 3 genes the determined intensity was smaller). These pre-selected data are shown in Figure 174.

Figure 172. Panorama™ E. coli Gene Arrays showing the expression of all 4290 protein coding genes of E. coli grown on minimal medium containing glucose (“MinGlc”, from Tao et al. 1999)

478

Advances in Computational Intelligence and Learning

Figure 173. Logarithmic (log10) expression intensities under two different growth conditions („MinGlc“ and „LB+Glc“) of all 4290 E. coli genes (data from Tao et al., 1999). Different symbols indicate functional groups (see Table 49), e.g. black dots represent the 1428 genes of unknown function

6

5

LB+Glc

4

3

2

1

0

0

1

2

3

4

5

6

MinGlc Figure 174. Logarithmic expression intensities for the 265 pre-selected genes used in the study. The symbols +,* and represent the 97 genes related to amino acid biosynthesis (+), the 127 genes related to translation, post-translational modification (∗) (one gene with a negative expression value was ignored) and the 41 genes related to putative cell structure ( ) (two genes with negative expression values were ignored)

Intelligent Applications in Biomedicine

479

Table 49. E. coli genes annotated by 21 functional groups (Tao et al., 1999). The genes of the three functions in capital letters are considered in the study. No. Function Number of genes Total 4290 1 AMINO ACID BIOSYNTHESIS 97 2 Putative transport proteins 291 3 Central intermediary metabolism 149 4 Biosynthesis of cofactors, prosthetic groups and carriers 106 5 Putative enzymes 453 6 Regulatory function 208 7 Cell processes 170 8 Phage, transposon, or plasmid 91 9 Transport and binding proteins 254 10 Translation, post-translational modification 128 11 Putative regulatory proteins 167 12 Putative factors 67 13 Nucleotide biosynthesis and metabolism 66 14 DNA replication, repair, restriction/modification 105 15 Carbon compound catabolism 124 16 Energy metabolism 136 17 Cell structure 84 18 putative membrane protein 54 19 Putative cell structure 43 20 Transcription, RNA processing and degradation 28 21 Fatty acid and phospholipid metabolism 41 22 1428

3.

RESULTS OF DATA ANALYSIS

Data shown in Figure 174 were clustered and the results compared to the 3 functional groups No. 1, 10, 19 (Table 49). Cluster analysis was started unsupervised (i.e. known gene functions were not used for learning) and continued supervised using the physiological functions during the learning process. Calculations were carried out using MATLAB tools.

3.1

Unsupervised Clustering of Gene Expression Data

Data shown in Figure 174 were clustered in 3 classes. The Figure 175 to Figure 180 show the results of unsupervised clustering by different methods: - K-means (Mac Queen, 1967) - Self-organizing maps (SOM, Kohonen, 1987; Tamayo et al., 1999) - Hierarchical clustering by dendrograms (Eisen, 1998; results not shown) - Fuzzy-C-means (Bezdek, 1981) - Fuzzy clustering by the Gustafson-Kessel algorithm (Gustafson and Kessel, 1979)

480

Advances in Computational Intelligence and Learning

Figure 175. : Clustering of the 265 pre-selected genes by the K-Means algorithm. Symbols as in Figure 174; three areas separated by the classifier labelled with the functional group numbers 1, 10 and 19.

Figure 176. Clustering by Self-Organizing Maps (SOMs), labelled as in Figure 175.

Intelligent Applications in Biomedicine

481

Figure 177. Clustering by the Fuzzy-C-Means algorithm after defuzzyfication, labelled as in Figure 175.

Figure 178. Fuzzy clustering by the Gustafson-Kessel algorithm after defuzzyfication, labelled as in Figure 175.

482

Advances in Computational Intelligence and Learning Fuzzy-C-Means

5

4.5

LB+Glc

4

3.5

3

2.5

2

2

2.5

3

3.5

4

4.5

5

MinGlc

Figure 179. Clustering by Fuzzy-C-Means algorithm. The degree of membership of the genes to the 3 classes is indicated by the symbols +,* and and their size.

Gustafson-Kessel

5

4.5

LB+Glc

4

3.5

3

2.5

2

2

2.5

3

3.5

4

4.5

MinGlc Figure 180. Fuzzy-Clustering by the Gustafson-Kessel algorithm. The degree of membership of the genes to the 3 classes is indicated by the symbols and their size.

Although crisp clustering by K-means, SOMs and dendrograms (Eisen, 1998) was applied to gene expression pattern recognition earlier, fuzzy

Intelligent Applications in Biomedicine

483

clustering was first introduced by Guthke et al. (1999, 2000). After crisp clustering by K-means and SOMs as well as after defuzzyfication of fuzzy clustering results (see Figure 175 to Figure 178) the degree of membership m(i,k) of each gene (i=1,...,265) with respect to each of the 3 classes (k=1,2,3) is 1 (when gene i belongs to class k) or 0 (when gene i does not belong to class k). After fuzzy clustering without defuzzification the degree of membership m(i,k) is a real number between zero and one. These real numbers are indicated by the size of the symbols in Figure 179 and Figure 180. Due to the probabilistic version of fuzzy clustering applied in this paper the sum of m(i,k) for all k (=1,2,3) equals to one for each gene i.

3.2

Supervised Clustering of Gene Expression Data

Figure 181 and Figure 182 show the results of supervised clustering by the Adaptive Neuro-Fuzzy Inference System (ANFIS, Jang, 1991, 1993). This method generates a Sugeno-type fuzzy rule set (consisting of n*2=4 rules with n=2 number of input variables) with crisp real output values that code for gene functions. The input data are the logarithmic expression intensities of the 265 selected E. coli genes determined under the two growth conditions („MinGlc“ and „LB+Glc“). Figure 183 shows the clustering results obtained applying decision trees generated by the C4.5 algorithm (Quinlan, 1993) using the original logarithmic E. coli gene expression data as determined under the two growth conditions „MinGlc“ and „LB+Glc“ as input for the rules and the functional groups 1, 10 and 19 as output. The decision tree induced consists of three crisp rules: IF MinGlc >2.9 AND LB+Glc < 3.4 THEN functional group 1 (Amino acid biosynthesis) IF MinGlc >2.9 AND LB+Glc > 3.4 THEN functional group 10 (Translation, post-translational modification) IF MinGlc