A Probabilistic Method to Correlate Ion-pairs to Protein ... - dbPTM

Report 0 Downloads 11 Views
A Probabilistic Method to Correlate Ion-pairs to Protein Thermostability Shir-Ly Huang1,*, Li-Cheng Wu2, Hsien-Da Huang3, Han-Kuen Liang4, Ming-Tat Ko4, and Jorng-Tzong Horng1, 2 1

Department of Life Science, National Central University, Taiwan

2

Department of Computer Science and Information Engineering, National Central University, Taiwan

3

Department of Biological Science and Technology, Institute of Bioinformatics, National Chiao-Tung University, Taiwan 4

Institute of Information Science, Academia Sinica, Taiwan *E-mail: [email protected]

Abstract: Recent developments in research on the stability of proteins show that ion-pairs potentially contribute to the thermostability of proteins, according to comparisons of ion-pairs between homologous structures. This study proposes a probabilistic Bayesian statistical method to predict efficiently the thermostability of proteins based on considering the properties of ion-pairs. The experimental results suggest that the number, types, and bond distance of ion-pairs can be used to predict thermostability of proteins with functions similar to each other with a high accuracy up to 80%. The predictions have high precision of 99%, especially for hyperthermophilic proteins. The experimental results of proteins with different functions also indicate that the number of ion-pairs is related to the thermostability of proteins, and predictions of thermostability can also be made for proteins with different functions. Keywords: Ion-Pairs, Protein Thermostability, Bayesian

Introduction Proteins are the end-products of most gene expressions. Some of these proteins are used widely in industry as biocatalysts (Burton 2003). Chemical reactions often need to be performed at high temperatures to accelerate industrial processes. However, not many enzymes are stable when heated

*

Correspondence: Shir-Ly Huang E-mail: [email protected] Department of Life Science, National Central University, No. 300, Jungda Rd., Jhongli City, Taoyuan, Taiwan 320, R. O. C. 1

(Gracia et al. 2003; Voronov et al. 2002). Research is required to make proteins remain active and stable while being heated, to overcome current limits on their industrial applications. Additionally, investigations of differences between the structure of homologous proteins from mesophilic and thermophilic organisms are crucial topic in basic research. In the natural environment, the N-terminal amides of Histidine (His), Arginine (Arg) and Lysine (Lys) side-chains are positively charged, and the C-terminal carboxyls of Aspartic (Asp) and Glutamic (Glu) acid side-chains are negatively charged. These charges are distributed throughout the structure of proteins, and both attractive and repulsive electrostatic interactions are possible between charged residues. When the close approach of charged oppositely residues is energetically favorable, ion-pairs (or salt-bridges) are formed (Barlow and Thornton 1983). Studies of simple and complex ion-pairs distribution and geometry, based on statistical analysis, indicate that ion-pairs are very important to the structure and function of proteins (Barlow and Thornton 1983; Kumar and Nussinov 1999). Studies of monomeric proteins show that ion-pair geometry is critical in determining ion-pair stability (Kumar and Nussinov 1999). Current comparisons of homologous structures show that ion-pairs dominate the key role of thermostability of protein by study in homology structure comparison (Szilagyi and Zavodszky 2000). Previous research has assumed that proteins from thermophilic organisms are substantially more thermally stabilities than their counterparts from mesophilic organisms (Szilagyi and Zavodszky 2000; Vogt et al. 1997). That is, proteins of the same function, from different organisms, whether thermophilic or mesophilic, may contain information that relates the different thermostabilities of proteins. Structural differences among mesophilic, moderately thermophilic and extremely thermophilic are surveyed (Szilagyi and Zavodszky 2000). Among the structural properties compared (cavities, hydrogen bonds, ion-pairs, secondary structure, and polarity of surface), the numbers of the ion-pairs are correlated strongly with the thermostability of protein. The thermostability increases generally with the number of ion-pairs (Szilagyi and Zavodszky 2000). However, the work does not mention the relationship between the ion-pairs and the thermostability of proteins with reference to a general model. The Bayesian network (Jensen 1996), a data mining and machine learning tool, is used in many bioinformatic applications such as protein function prediction (Qu et al. 1998), multiple sequence alignment (Zhu et al. 1998), protein protein interaction prediction (Jansen et al. 2003) and RNA 2

secondary structure prediction (Hatzivassiloglou et al. 2001). This study applies Bayesian network to predict the thermostability of proteins based on the numbers, types, and bond distance of ion-pairs.

Methods Accurate data on the thermostability of proteins data are available but new data are not easy to obtain. An Internet database ProTherm1 (thermodynamic database for proteins and mutants) (Gromiha et al. 2002) provides many data about the thermodynamics of proteins. This database includes thermodynamic data in different experimental conditions. There are many mutant proteins in this database and some proteins show multiple temperatures respect to different conditions in the database. Therefore, the information about proteins in this database is not suitable for our purposes of this study. Research (Gromiha et al. 1999) has shown a direct relationship between the optimal growth temperature and the melting temperature. The original literature was extensively reviewed to retrieve optimal growth temperature records and construct a data set of our own for mining. The optimal growth temperature information thus retrieved was also valuable to other research. Thus, we create a database called PGTdb over the Internet to provide optimal growth temperature information of prokaryotes (Huang et al. 2004). The database can be accessed online at http://pgtdb.csie.ncu.edu.tw/. Proteins from thermophilic organisms are usually much more stable intrinsically than their counterparts from mesophilic organisms, although they retain the basic folded characteristics of the particular protein family (Vogt et al. 1997). Based on the above assumption, the proteins in our dataset are grouped into by thermostability as mesophilic, thermophilic and hyperthermophilic, as determined by whether the optimal growth temperature of the organisms at 20-45°C, 45-80°C, or >80°C, respectively (Michael T. Madigan and Parker 2000). Protein structures can be retrieved from PDB (Berman et al. 2002)2. In a protein structure, two oppositely charged residues that are close to each other are considered an ion-pair. There are three positively charged residues (His, Arg, Lys) and two negatively charged residues (Asp, Glu). Therefore, six combinations of residues can be presented in an ion-pair. Usually, two oppositely charged residues within distance of 4 Å are considered to be a strong ion-pair, and distances of 6 Å and 8 Å are

1

http://www.rtc.riken.go.jp/jouhou/protherm/protherm.html

2

http://www.rcsb.org/pdb/ 3

considered to define weaker electrostatic interactions (Vogt et al. 1997). Thus, ion-pairs are classified into three types according to the distance between the residues therein - strong, medium and weak, if the distances are in the ranges < 4.0 Å, 4.0 Å - 6.0 Å, and 6.0 Å - 8.0 Å respectively. If one of the residues has two charge atoms, the smallest distance between charge atoms is chosen. Eighteen types of ion-pairs are thus defined by both strength of interaction and type of residue. The total number of ion-pairs of each strength type is also considered as a feature in this study. Therefore, a protein structure has 21 features. Given an order of these features, the number of different type ion-pairs in a protein structure can be defined as a feature vector with 21 dimensions. For example, let the dimensions of the features vector be ordered as His-Asp strong, His-Asp median, His-Asp weak, His-Glu strong, His-Glu median, His-Glu weak, Arg-Asp strong … Lys-Glu weak, Ion-pair strong, Ion-pair median, Ion-pair weak. Let protein P have 4 ion-pairs - His-Glu 3.2 Å, His-Glu 5 Å, Arg-Asp 6.6 Å, His-Asp 6.4 Å. The feature vector of protein P is then represented as (0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2). The numbers of ion-pairs are vary from hundreds to thousands depending on the size of proteins. Practically, the feature vector of a protein includes only a few zero values. A large protein has more ion-pairs than small protein is reasonable. Thus, the feature vector of the protein is normalized according to the number of residues of the protein. That is, every value in the above feature vector is divided by the number of residues of the protein. If the protein in the above example has 50 residues, then the feature vector becomes (0, 0, 0.02, 0.02, 0.02, 0, 0, 0, 0.02, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.02, 0.02, 0.04). This study considers only ion-pairs within a subunit, and ion-pairs between residues which belong to different subunits are discarded while calculating feature vector. A naïve Bayesian classifier is created to represent the relationship between ion-pairs and thermostability. The naïve Bayesian classifier we defined uses many feature vectors representing different types of ion-pairs to predict thermostability of protein. Figure 1 shows an example of the naïve Bayesian classifier model for feature vector and the thermostability of protein. Let c denote a class in the term occurrence (i.e., the thermostability condition), ε denote an evidence available to the machine learning algorithm (i.e., the ion-pair feature vector f1, f2,…,fk in the protein structure). The conditional probability P( ε | c ) represents the probability that the evidence ε is found under the condition of the term class c occurs. This conditional probability can be approximated by the known data. The conditional probability P( c | ε ) represents the probability of class c while the evidence ε occurs. Given evidence ε, the naïve Bayesian classifier aims to find a class c that maximizes the 4

conditional probability P( c| ε ) . Using Bayes’ rule,

P (ε | c) ×

P (c ) P (c ) = P ( f 1, f 2,..., fk | c) × P (ε ) P ( f 1, f 2,... fk )

(1)

where P(c) is the prior probability of class c. The prior probability of the evidence P(f1, f2,…,fk) in Equation (1) is the same constant for all classes c, hence this constant does not affect the maximization of Equation (1). So maximization of Equation (1) is equivalent to maximize

P(c) × P( f 1, f 2,..., fk | c) .

(2)

Estimating P(f1, f2,…,fk | c) for a large k is however impossible for most applications due to the sparseness of available data. Here an assumption of independence between the features in the feature vector (hence the “naïve” qualifier in the name of the method) is made by us, and Equation (2) is rewritten as k

P (c ) × ∏ P ( f i c )

(3)

i =1

Although the assumption of the independence between features is obviously a simplification, the method often offers sufficient approximation to the true classification (Holte 1993). This study cannot prove that the number of ion-pairs associated with the residues is independent to each other in every protein. Thus, the simplification of independence between features is taken for approximation in this study. The elements of feature vectors are not integers after the normalization. The normalized values of different types of ion-pairs are grouped. Grouping provides the advantage of constraining the number of ion-pairs to avoid the naïve Bayesian classifier reserving probability for a very large number. Available guidelines for the grouping of continuous data into discrete values include expert recognition, equal division, exponential division, and entropy maximization heuristic (Cooper 1989). In this study, the grouping method combines equal division and exponential division, namely, taking equal division of small number and exponential division for large number. For example, the numbers of strong ion-pairs between Aspartic acid and Histidine are grouped in to groups 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 with ranges (1~9), (10~19), (20~29), (30~39), (40~59), (60~99), (100 ~179), (180~339), (340~659) and (660~1299), respectively. If one of the values in the feature vector is zero, the equation (3) requires that all of the other probabilities by multiplied by this zero, such that the value of equation (3) equals zero. Thus, P( c | ε ) 5

maximization will not choose the zero unless every P( c | ε ) is zero and the naïve Bayesian classifier will be dominated by the feature dimension of this observed zero value. A method called m-estimation is adopted to overcome this difficulty, by increasing the probabilities associated with the features which have zero value (Mitchell 1997). For example, no protein in the training set has more than 660 ion-pairs (normalized) between Aspartic and Histidine for which distance is less than 4.0 Å, so group 10th or strong Asp-His type ion-pair has observed associated probability of zero. The m-estimate method is applied and the probability associate with group 10th is predefined as a small value (for example:0.02) and some probability (for example 0.0022) is subtracted from each group 1st to 9th. A probability of 0.0022 is subtracted from each group 1st to 9th to maintain a total probability of 1. The goal of the naïve Bayesian classifier is to maximize P( c | ε ) , so the probability of each group is calculated in each ion-pair type and P( c | ε ) is calculated from Equation (3). The thermostability class which has maximum P( c | ε ) is chosen as the result. The four steps in building the conditional probability P(,fi | c) of the evidence are as follows. 1. Find all ion-pairs of proteins from the training set. 2. Classify ion-pairs by its strength and the type of residue and generate feature vectors. 3. Normalize the values in feature vectors. 4. Calculate P(fi | c) for i= 1 to 21 and apply an m-estimate to avoid zero values from being generated. Given the probability P(fi | c), the thermostability of a protein can be predicted from Equation (3). The five steps of the prediction are as follows. 1. Find all ion-pairs of the protein. 2. Classify ion-pairs by strength and type of residue and generate feature vector. 3. Normalize the values in the feature vector. 4. Calculate P( c | ε ) using Equation (3). 5. Select the thermostability class associated with the highest probability in step 4.

Results Three functional families of proteins of different thermostabilities are provided in this experimental data. The functional families are α-amylase, GAPDH, and Xylanase. These families include 119 proteins, of which 86, 27 and 6 are in the mesophilic, thermophilic and hyperthermophilic 6

classes respectively. Table 1 shows the detailed numbers of proteins in each functional family and thermostability class. The Appendix lists the proteins selected for mining. In the experiments, 70% of proteins are selected randomly as the training set for the naïve Bayesian classifier, and the other 30% are the testing set. For example, in the experiment of all 119 proteins, 83 proteins are selected randomly as the training set, and the other 36 proteins are the test set. The process of random proteins selection as training set, training naïve Bayesian classifier and testing is repeated one hundred times. We will describe the performance indices of the naïve Bayesian classifier first. Then, the experimental results concerning the 119 proteins in the three functional families and the experimental results concerning the proteins in Xylanase are described. This evaluate process is very similar to the 30-fold cross-validation (Mitchell 1997) except we increase the training and testing circle to 100 time for more accuracy test result and test set are not disjoint since data set is not large. Four performance indices, the overall rate of classification accuracy (Hatzivassiloglou et al. 2001), precision (Hatzivassiloglou et al. 2001),

recall/sensitivity (Hatzivassiloglou et al. 2001), and

specificity (Feinstein 1977) for a class, are used in our evaluation. The overall accuracy is the total percentage of correct decisions made by the classification algorithm. Recall/sensitivity is a measurement of the ability of a naïve Bayesian classifier to find the proteins in various thermostability classes. The precision represents the percentage of the predictions that are correct. The specificity represents the proportion of items not in class C that are predicted as not being in class C. The overall accuracy is defined as follows:

Overall accuracy =

Number of correctly predicted thermostability proteins in test results Number of all proteins in test set

We have observed that most proteins are mesophilic proteins. Thus, the overall accuracy is not enough for measuring the performance since a dummy model that predicts every protein is mesophilic will still yield an overall accuracy of more than 60% and no thermophilic or hyperthermophilic protein is being correctly identified. Hence, measurements other than overall accuracy are important. Accordingly, the correctness of our prediction is evaluated by using precision, recall/sensitivity (Ding and Lawrence 1999), and specificity (Feinstein 1977) as indicated in Figure 2. For a given thermostability C which maybe mesophilic, thermophilic, or hyperthermophilic, let TP, TN, FP and FN refer to True Positive, True Negative, False Positive and False Negative. TP refers to proteins that were correctly predicted to be in thermostability class C. TN refers to proteins that are in thermostability class C but were predicted 7

to be in another thermostability class. FP refers to proteins that were predicted to be in thermostability class C even through they were actually not in thermostability class C. TN refers to proteins that are not in thermostability class C and whose predicted thermostability class was other than C. The precision, recall/sensitivity (Hatzivassiloglou et al. 2001), and specificity (Feinstein 1977) are defined as follows:

Precision =

TP TP + FP

Recall/Sensitivity = Specificity =

TP TP + FN

TN TN + FP

Precision and recall are widely used in many information systems (Frankes and Baeza-Yates 1992). Sensitivity and specificity are well used in clinical analysis applications (Feinstein 1977). The precision is also called “Predictive value positive” in clinical analysis. Specificity is very useful in clinical analysis; most clinical tests involve only two classes of outcomes - diseased or healthy. In such predictions of “diseased”, TN represents a situation of good health, which is correctly reported as being one of good health in a clinical test. The experiment conducted herein involves three classes of thermostability, so TN of class C refers only to a protein that is not in thermostability class C and is not predicted as being in thermostability class C. Notably, in no way indicates whether the proteins in TN are correctly predicted. Thus, the specificity in this experiment represents only the proportion of proteins that are not in thermostability class C that and were not mistakenly assigned to thermostability class C by the proposed method. Table 2 shows the result of dataset consist of α-amylase, GAPDH, and Xylanase proteins. The overall accuracy is 77.47% showing that the naïve Bayesian classifier has an expected accuracy of over 75%. About 83.68% of mesophilic proteins in test data are predicted as mesophilic proteins. In other words, only 16.32% of mesophilic proteins are predicted as thermophilic or hyperthermophilic proteins. 87.70% of predicted mesophilic proteins are really mesophilic protein, which proved that most mesophilic result of our model is correct. The specificity of mesophilic proteins is 52.90% denote that half of non-mesophilic proteins will be mistaken as mesophilic protein. A 51.77% recall shows that 51.77% of thermophilic proteins in dataset are correctly predicted as thermophilic proteins. The precision of thermophilic proteins is also not high as 54.37%. That is only 8

54.37% of the proteins predicted thermophilic are really thermophilic proteins. The specificity of thermophilic protein is 87.87% which indicate that only 12.13% of non-thermophilic protein is mistaken as thermophilic protein. The result indicates that ion-pair may contribute less to the stability of thermophilic protein. Similar observation can be found at (Szilagyi and Zavodszky 2000). Since most proteins are not thermophilic, the 12.13% of non-thermophilic protein occupy 48.23% (1-51.77%) of thermophilic prediction result. 38.20% of hyperthermophilic proteins in the test data are correctly predicted as hyperthermophilic proteins. Half of the hyperthermophilic proteins are not identified by the naïve Bayesian classifier. But the precision is as high as 78.64% means that most protein that predicted as hyperthermophilic proteins are really hyperthermophilic proteins as Figure 3. We can see that the proportion of FP is very small in Figure 3. Most non-hyperthermophilic proteins are not identified as hyperthermophilic proteins, since the specificity is high. We then concluded that these data of each thermostability condition as follows. The naïve Bayes classifier may predict hyperthermophilic or thermophilic protein as mesophilic protein, however, it seldom predicted mesophilic protein as thermophilic or even hyperthermophilic. Although some hyperthermophilic proteins are not identified, almost every protein that predicted as hyperthermophic protein is really a hyperthermophilic protein. Table 3 presents similar experimental results obtained using dataset consist of Xylanase protein only. The data set Xylanase includes a total of 63 proteins, 4 of which are hyperthermophilic, 17 of which are thermophilic and 42 of which are mesophilic. The overall accuracy is 82% obtained using the Xylanase dataset, revealing that the naïve Bayes classifier has 80% of expected accuracy when applied to dataset which consist with single functions class of proteins. The results in Table 3 are similar to those in Table 2. The recall of hyperthermophilic is up to 81.75% and the precision and specificity are perfectly at 100%. The precision and specificity are 100% means that there is no FP protein and every protein predicted as hyperthermophlic protein is 100% accuracy. The recall of hyperthermophilic protein is also much higher than the data shown in Table 2. That is, our model has high precision and recall to predict both mesophilic and hyperthermophilic proteins on dataset Xylanase. Comparing the results in Table 2 and Table 3 indicate that our model is more precise for thermophilic proteins of single functional family than for thermophilic proteins of three different 9

functional families. Our model is also more accuracy on hyperthermophilic proteins than thermophilic and mesophilic proteins.

Discussions The proposed model has very high specificity and high precision when applied to hyperthermophilic proteins. That is, most predicted hyperthermophilic proteins are precisely hyperthermophilic proteins. The results obtained from hyperthermophilic proteins indicate that the number of ion-pairs is strongly correlated with the thermostability. Thermophilic proteins are associated with lower precision and recall. Some study (Szilagyi and Zavodszky 2000) has stated that the ion-pair may destabilize a protein at the thermophilic temperature but the relationship between the ion-pairs and the thermophilic proteins with reference to a general model is not created in previous works. The lower precision of thermophilic proteins of this study confirms that the number of ion-pairs may not be the key to the stability of thermophilic proteins. A high overall accuracy of thermostability prediction was obtained by using the naïve Bayesian classifier, showing that the number of ion-pairs is related to the thermostability class of the protein, especially in the case of hyperthermophilic proteins. The prediction concerning single functional family is clearly better than the prediction associated with three functional families. The naïve Bayesian classifier has high precision and recall in predicting both mesophilic and hyperthermophilic proteins on Xylanase dataset. Although 18.25% of hyperthermophilic proteins are not identified, every identified hyperthermophilic protein in the Xylanase dataset is in fact a hyperthermophilic protein. The naïve Bayesian classifier is suited to use by biologists in identifying the hyperthermophilic proteins in a single functional family of proteins. The dielectric constant (also called permittivity, DK and Er) is the characteristic of a material that determines the speed at which an electrical signal travels in that material. The dielectric constant varies with temperature and pressure. For example, the dielectric constant of water at 273.16°K and 100kPa(1 bar) is 80.20 while the dielectric constant of water at 400°K and 1000kPa(10 bar) is 49.06. By Coulomb’s formula, the energy of the ion-pair is the ratio of the electric charge of two atoms times the inverse of the square of the distance times the dielectric constant, so the strength of the ion-pair bond is related not only to the distance but also to the dielectric constant. Accordingly, the ion-pair features considered in this research can be improved by considering the dielectric constant. Most proteins dissolved in water. However, the dielectric constant of water decrease sharply as the temperature 10

increases. At fixed pressure, the fall with increasing temperature is linear. Given a dielectric constant D and temperature T, the regression equation is D = -0.3232 T + 86.765 when the pressure is fixed at 100kPa (1 bar) (Lubert 1995). The dielectric constant is related to the energy of the ion-pair, so increasing temperature may reduce the dielectric constant and cause the ion-pair to enter a low energy state. Research of calculating free energy also shows that an ion-pair has different meanings in different environments (Lubert 1995). Thus, the dielectric constant is related to temperature and the dielectric constants at various temperatures must be considered, with reference to the ion-pair strength, in further work. Considering the high dielectric constant of solvent which is often H2O, the solvent accessable surface area (ASA) of each protein should also be considered. Residues exposed to the solvent have different dielectric constants from residues buried and contribute different amount of energy to the structure of the protein. The ASA of the protein must be calculated and the ways in which the dielectric constant affects stability of proteins can be addressed in further work. Only the intra-subunit ion-pair, which is located only inside each subunit of a protein, is considered here. Ion-pairs between subunits may also influence the thermostability of proteins. The ion-pairs between subunits may have different dielectric constants and energies. These ion-pairs not only contribute to the thermostability but also to the coupling of different subunits (Lebbink et al. 1999). The precision associated with hyperthermophilic proteins is high, so almost every predicted hyperthermophilic protein is really a hyperthermophilic protein, revealing that this prediction process can be used as a screening tool by protein engineering scientists to reduce the cost and time of experimentation. The overall accuracy of the prediction process of Xylanase protein is 82%. This result suggests that incorporating the functional category of the protein may improve the quality of prediction results.

References 1. 2.

3. 4.

Barlow, D.J. and Thornton, J.M. 1983. Ion-pairs in proteins. J Mol Biol 168, 867-885. Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J.D., and Zardecki, C. 2002. The Protein Data Bank. Acta Crystallogr D Biol Crystallogr 58, 899-907. Burton, S.G. 2003. Oxidizing enzymes as biocatalysts. Trends Biotechnol 21, 543-549. Cooper, G. 1989. Current Research Directions in The Development of Expert Systems Based on Belief Networks. Applied Stochastic Models and Data Analysis 5, 39-52. 11

5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

Ding, Y. and Lawrence, C.E. 1999. A bayesian statistical algorithm for RNA secondary structure prediction. Comput Chem 23, 387-400. Feinstein, A.R. 1977. Clinical biostatistics. XXXIX. The haze of Bayes, the aerial palaces of decision analysis, and the computerized Ouija board. Clin Pharmacol Ther 21, 482-496. Frankes, W.B. and Baeza-Yates, R. 1992. Information retrieval: data structures and algorithms. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Gracia, M.I., Latorre, M.A., Garcia, M., Lazaro, R., and Mateos, G.G. 2003. Heat processing of barley and enzyme supplementation of diets for broilers. Poult Sci 82, 1281-1291. Gromiha, M.M., Oobatake, M., and Sarai, A. 1999. Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys Chem 82, 51-67. Gromiha, M.M., Uedaira, H., An, J., Selvaraj, S., Prabakaran, P., and Sarai, A. 2002. ProTherm, Thermodynamic Database for Proteins and Mutants: developments in version 3.0. Nucleic Acids Res 30, 301-302. Hatzivassiloglou, V., Duboue, P.A., and Rzhetsky, A. 2001. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 17 Suppl 1, S97-106. Holte, R.C. 1993. Very Simple Classification Rules Perform Well on Most Commonly Used Dataset. In Machine Learning, pp. 63-91. Huang, S.L., Wu, L.C., Liang, H.K., Pan, K.T., Horng, J.T., and Ko, M.T. 2004. PGTdb: a database providing growth temperatures of prokaryotes. To appear in Bioinfomatics 20. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein, M. 2003. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302, 449-453. Jensen, F. 1996. An Introduction to Bayesian Networks. Springer Verlag, New York. Kumar, S. and Nussinov, R. 1999. Salt bridge stability in monomeric proteins. J Mol Biol 293, 1241-1255. Lebbink, J.H., Knapp, S., van der Oost, J., Rice, D., Ladenstein, R., and de Vos, W.M. 1999. Engineering activity and stability of Thermotoga maritima glutamate dehydrogenase. II: construction of a 16-residue ion-pair network at the subunit interface. J Mol Biol 289, 357-369. Lubert, S. 1995. Biochemistry. W.H. Freeman Press, New York. Michael T. Madigan, J.M.M. and Parker, J. 2000. Brock Biology of Microorganisms. Prentice-Hall Inc., New Jersey. Mitchell, T. 1997. Machine Learning. Mcgraw-Hill Companies, NewYork. Qu, K., McCue, L.A., and Lawrence, C.E. 1998. Bayesian protein family classifier. Proc Int Conf Intell Syst Mol Biol 6, 131-139. Szilagyi, A. and Zavodszky, P. 2000. Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure Fold Des 8, 493-504. Vogt, G., Woell, S., and Argos, P. 1997. Protein thermal stability, hydrogen bonds, and ion pairs. J Mol Biol 269, 631-643. Voronov, S., Zueva, N., Orlov, V., Arutyunyan, A., and Kost, O. 2002. Temperature-induced selective death of the C-domain within angiotensin-converting enzyme molecule. FEBS Lett 522, 77-82. Zhu, J., Liu, J.S., and Lawrence, C.E. 1998. Bayesian adaptive sequence alignment algorithms. Bioinformatics 14, 25-39.

Appendix Protein PDB entries in our dataset

α-amylase

Glyceraldehyde-3-phosphate dehydrogenase (GAPDH)

1VIW 1QI4 1B2Y 1CLV 1QI3 1AVA 1JAE 1PPI 2CPU 1TMQ 1PIG 1E3X 7TAA 3CPU 1DC5 1A7K 2GD1 1DC6 1GYP 1CER 3GPD 1GYQ 1GD1 12

PDB ID 1QHP 1PIF 1DHK 1QHO 1SMD 1CPU 1G1Y 1BIP 1BVN 1BVZ 1E43 1BPL

1QPK 1QI5 1BLI 1OSE

1E3Z 1E40 6TAA 1VJS

1NLG 3DBV 1GAE 1DSS 1NLH 4DBV 1SZJ 1GGA 1EUH 1B7G 1I33

2DBV 1CF2

Xylanase

1E0X 1E0V 1E0W 1XAS 1B3V 1BK1 1B3Z 1B3Y

1B30 1B3X 1B3W 1B31 1BG4 1FH8 1HEH 1FH9

1YNA 1QLD 1C5H 1FHD 1DYO 1RED 1QH7 1E5B 1XYZ 1XYS 1XNB 1XBD 1TAX 1REF 1ENX 1XNC 1PVX 1XND 1QH6 1BVV 1I82 1E5N 2BVV 1C5I 1CLX 1E8R 1TIX 1F5J 2EXO 1XYF 1XYO 1XYP

13

1E5C 1HEJ 2XYL 2HIS 2XBD 1BCX 1I8A 1XYN

1FXM 1TUX 1I8U 1XYN 1GNY 1REE 1FH7

Figure 1. An example to show the Bayesian belief network created for the prediction of protein thermostability base on the number, types, and bond distance of ion-pairs

14

Figure 2. Graphical illustration of precision, recall/sensitivity and specificity of mesophilic proteins. The left circle (TP+FP) refers to the proteins that are predicted to be mesophilic. The right circle (TP+FN) represents the truly mesophilic proteins. The intersection of the two circles (TP) includes proteins correctly predicted to be mesophilic. The precision and recall are (TP)/(TP+FP) and (TP)/(TP+FN). The sensitivity and specificity are (TP/TP+FN) and (TN/TN+FP).

15

Figure 3. The precision of hyperthermophilic proteins is high and most proteins predicted as hyperthermophilic proteins are really hyperthermophilic proteins.

16

Table 1. The proteins in the data set are functionally classified into α-amylase, GAPDH and Xylanase. The proteins in each functional category have three classes of thermostabilities - mesophilic, thermophilic, and hyperthermophilic. For instance, sixty-three proteins belong to Xylanase and of which 42 are mesophilic, 17 are thermophilic and four are hyperthermophilic. Thermostability Functional Class α-amylase Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) Xylanase Total

Mesophilic (20-45℃)

Thermophilic (45-80℃)

Hyperthermophilic (>80℃)

Total

27

7

0

34

17

3

2

22

42

17

4

63

86

27

6

119

17

Table 2. The precision, recall, sensitivity and specificity percentages of predicted mesophilic, thermophilic and hyperthermophilic proteins in the dataset consist of three functional families. For example, 83.68% percentage of mesophilic proteins are predicted correctly as mesophilic and 87.7% of proteins being predicted as mesophilic are precisely mesophilic.

Thermostability Mesophilic (20-45℃) Thermophilic (45-80℃) Hyperthermophilic (>80℃)

Precision

Recall/Sensitivity

Specificity

83.68%

87.70%

52.90%

54.37%

51.77%

87.87%

78.64%

38.20%

99.47%

18

Table 3. Precision, recall, sensitivity and specificity percentages of predicted mesophilic, thermophilic, and hyperthermophilic proteins using the Xylanase dataset.

Thermostability Mesophilic (20-45℃) Thermophilic (45-80℃) Hyperthermophilic (>80℃)

Precision

Recall/Sensitivity

Specificity

81.80%

93.08%

62.29%

77.70%

56.80%

93.82%

100%

81.75%

100%

19