Prediction of Protein Functions from Protein Interaction Networks: A Naïve Bayes Approach Cao D. Nguyen1, Katheleen J. Gardiner2, Duong Nguyen3, and Krzysztof J. Cios1,4 1
Virginia Commonwealth University, USA 2 University of Colorado Denver, USA 3 Raytheon, USA 4 IITiS PAN, Poland {cdnguyen,kcios}@vcu.edu,
[email protected],
[email protected] Abstract. Predicting protein functions is one of most challenging problems in bioinformatics. Among several approaches, such as analyzing phylogenetic profiles, homologous protein sequences or gene expression patterns, methods based on protein interaction data are very promising. We propose here a novel method using Naïve Bayes which takes advantage of protein interaction network topology to improve low-recall predictions. Our method is tested on proteins from the Human Protein Reference Database (HPRD) and on the yeast proteins from the BioGRID and compared with other state-of-the-art approaches. Analyses of the results, using several methods that include ROC analyses, indicate that our method predicts protein functions with significantly higher recall without lowering precision. Keywords: protein function, protein interaction networks, Bayes methods. Supplementary Materials: www.egr.vcu.edu/cs/dmb/Bayesian.
1 Introduction To discover how proteins function within the living cell is one of the central goals for life scientists. Genome sequences have been published at a dramatic rate but a large fraction of newly discovered genes have no functional characterization. For example, in a simple organism, such as baker’s yeast, approximately one third of the proteins have no functional annotation. For more complex organisms, functional annotation is lacking for a much larger fraction of the proteome. Because experimental determination of protein function is expensive, successful predictive methods have an important role to play. Several computational methods for protein function prediction have been developed. Conceptually simplest, homologous proteins are identified in protein databases by using protein sequence similarity and functions are assigned to the query protein based on the known functions of the matches [8]. Another approach [9] infers protein interactions from genomic sequences using the observation that some pairs of interacting proteins have homologies in another organism fused into a single protein chain; the functional relatedness of T.-B. Ho and Z.-H. Zhou (Eds.): PRICAI 2008, LNAI 5351, pp. 788–798, 2008. © Springer-Verlag Berlin Heidelberg 2008
Prediction of Protein Functions from Protein Interaction Networks
789
some such protein pairs has been confirmed. [1] and subsequently other groups [2,3,4,5], inferred functional similarity among proteins based on phylogenetic profiles of orthologous proteins. Grouping proteins by correlated evolution, correlated messenger RNA expression patterns plus patterns of domain fusion have been successfully applied to yeast proteins [6]. Correlations between genes that have similar expression patterns are used to detect similar functions [7]. Other approaches integrate similarly heterogeneous types of high-throughput biological data for protein function prediction. Bayesian reasoning has been used to combine large-scale yeast two-hybrid (Y2H) screens and multiple microarray analyses [10]. Gene functional annotations were identified from a combination of protein sequence and structure data by support vector machines [11]. While each approach has had some success, in general these methods are severely limited by low reliability. Use of protein-protein interaction (PPI) data to annotate protein function has been extensively studied. Proteins interact with each other for a common purpose and thus a protein may be annotated by information on the functions of its interaction neighbors. Large scale protein physical interaction data have been generated by high-throughput experiments for worm [12], fly [13], yeast [14,15,16,17], and human [18]. Applied to the yeast PPI network, the Majority method assigns functions to a protein using the most frequent annotations among its nearest neighbors [15]. The drawback of the Majority, however, is that some functions may have a very high frequency in the network but are not assigned if they do not occur in the nearest neighbor set. The Majority method was extended in [19] to predict functions by exploiting indirect neighbors and using a topological weight to estimate functional similarity. Another approach [20] makes use of χ2 statistics by looking at all proteins within a specified radius thus taking into account the frequency of all proteins having a particular function. However, the χ2 statistics does not take into account the underlying topology of the PPI network. Global optimization approaches based on Markov random fields (MRF) and belief propagation in PPI networks [21,22,23] assign functions based on a probabilistic analysis of graph neighborhoods in the network. These methods assume that the probability distribution for the annotation of any node is conditionally independent of all other nodes, given its neighbors. The methods are sensitive to the neighborhood size and the parameters of prior distribution. FunctionalFlow [24] considers each protein of known function as a source of functional flow for that function. The functional flow spreads through the neighborhoods of the sources. Proteins receiving the highest amount of flow of a function are assigned that function. This algorithm does not consider indirect flow of functions to other proteins after labeling the functions. Other global approaches integrate PPI network with more heterogeneous data sources. ClusFCM [25] assigns biological homology scores to interacting proteins in a PPI network and performs an agglomerative clustering on the weighted network to cluster the proteins by known functions and cellular location. Functions then are assigned to proteins by a fuzzy cognitive map technique. The MRF methods were extended by combining PPI data, gene expression data, protein motif information, mutant phenotype data, and protein localization data to specify which proteins might be active in a given biological process [26,27].
790
C.D. Nguyen et al.
In this work we use the Naïve Bayes method that takes into account the underlying topology of a PPI network. For each protein we analyze the predicted functions by using association rules to discover interesting relationships among the assigned functions, i.e., when one set of functions occurs in a protein then the protein may be annotated with an additional set of other specific functions at some confidence level. We test our method on the PPI networks of yeast and human proteins, and compare its performance with the Majority and χ2 statistics methods.
2 Materials Yeast Interaction Dataset We use the yeast molecular interaction network from the BioGRID database [28] (release 4/2008, version 2.0.39). After eliminating direct interactions based solely on high-throughput Y2H assays because of noise levels [29], the yeast dataset includes 39,128 direct molecular interactions. The yeast dataset without Y2H comprises a total of 3,727 unique yeast proteins, of which 3,724 proteins are annotated with 3,000 distinct GO functions from the GO database [30]. Among the 3,000 GO terms there are 1,182 molecular functions, 494 cellular component functions, and 1,324 biological process functions. Human Interaction Dataset The human interaction data was obtained from the HPRD [31] (release 9/2007). The entire dataset contains 37,106 direct molecular interactions from three types of experiments (in vivo, in vitro and in Y2H). There are 9,463 distinct proteins annotated with 424 GO functions in three categories: 201 molecular functions, 150 biological process functions and 73 cellular component functions. We limit the HPRD data by excluding direct interactions supported only by Y2H experiments. The HPRD dataset without the Y2H comprises 28,148 interactions from in vivo and in vitro experiments. In this dataset, of 411 GO functions annotating the 7,764 unique proteins, there are 195 molecular functions, 143 biological process functions and 73 cellular component functions. The statistics are shown in Table 1 in Supplementary Materials.
3 Methods 3.1 Definitions The PPI network is described as an undirected graph G=(V,E), where V is a set of proteins and E is a set of edges connecting proteins u and v if the corresponding proteins interact physically. We use the following notation. K: the total number proteins in the PPI network, F: the whole GO function collection set, |F|: the cardinality of the set F, fi: a function in the set F (i=1..|F|), Cu: the cluster coefficient of protein u, Nu: the neighbor set of protein u (prof teins interacting directly with protein u), N u i : the number of proteins annotated with
Prediction of Protein Functions from Protein Interaction Networks
function fi in Nu and f f Nu | = N u i + N u i .
791
N uf i : the number of proteins un-annotated with function fi in Nu where |
3.2 Predictive Modeling Multinomial Naïve Bayes We first explain the basic idea of our approach. For a function of interest fi, we want to annotate the function fi to the proteins in a PPI network. We pose the functional annotation problem as a classification problem. The training data are available in the form of observations d ∈ Rk (k dimensions) and their corresponding class. For each protein u in the network, a function of interest fi is considered as a class label 1 if the protein u is annotated with fi, and otherwise as 0. Below we discuss how to select features to deduce class information. Exploiting the fact that proteins of known functions tend to cluster together [15], the first feature we take into account, A1, is the number of proteins annotated with the function fi in the neighborhood set of protein u (i.e. A1 = N uf i ). The second feature (A2) is the number of proteins unannotated with the function fi in the neighborhood set of the protein u (i.e. A2 = N uf i ). Figure 1 illustrates an observation in the HPRD without Y2H data for Acetylcholine acetylhydrolase protein (gene symbol: ACHE, HPRD id: 00010) with function Extracellular region (GO: 0005576, cellular component). HAND1
LAMA1
LGTN
COLQ
ACHE
PRIMA1
APP
COL4A1 LAMB1
Fig. 1. Proteins annotated with GO:0005576 are black. The observation for the ACHE is A1=4, A2=4 and class=1 because there are 4 proteins annotated with GO:0005576 in the neighborhood set, the other 4 proteins are not annotated and the ACHE is itself annotated with GO:0005576.
Several studies indicate that other features can be useful to predict functions and drug targets for a protein, such as the number of functions annotated in proteins in the neighborhood set at level 2 of the protein [19,20], the connectivity (the total number of incoming and outgoing arc of a protein, which is equal to N ufi +N uf i ), the betweenness (the number of times a node appears in the shortest path between two other nodes) and the clustering coefficient Cu (the ratio of the actual number of direct connections between the neighbors of protein u to the maximum possible number of such direct arcs between its neighbors) [31]. We include those features in our experiments and perform classification using Radial Basis Function network, Support Vector Machine and Logistic Regression (data not shown). To our surprise, the Multinomial Naïve Bayes using only three features A1 = N uf i , A2 = N uf i and A3=Cu performed best. Next, we briefly explain how the Multinomial Naïve Bayes method is used in
792
C.D. Nguyen et al.
our study. If d= is an observation for a protein u and we decide a class membership for the observation d (corresponding to a function of interest fi) by assigning d to the class with the maximal probability computed as follows: ^
^
P (d | c , f i ) P (c | f i )
^
μ (d ,f i ) α a r g m a x c ∈ { 0 ,1 } P (c |d , f i ) α a r g m a x c ∈ { 0 ,1 }
^
P (d | f i )
(1)
Note that since P(d | fi ) can be ignored as it is the same for all classes:
μ ( d , f i ) ∝ a rgmax
∧
∧
(2)
P ( d | c , fi ) P (c | fi )
c ∈{ 0 ,1}
The likelihood P(d | c, fi) is the probability of obtaining the observation d for a protein u in class c and is calculated as: ∧
fi
∧
fi
P (A1 |c, f i ) N u P (A 2 |c, f i ) N u P (d |c, f i )= (N + N + C u )! N uf i ! N uf i ! ∧
fi u
fi u
∧
P (A 3 |c, f i ) C u C u!
(3)
Thus equation (2) becomes: ∧
μ (d , f i ) ∝ arg m a x c∈{0 ,1} (N uf i + N uf i + C u )!
fi
∧
fi
P (A1 | c, f i ) N u P (A 2 | c, f i ) N u N uf i ! N uf i ! ∧
(4)
P (A3 | c, f i )C u ∧ P (c | f i ) C u!
Since the factorials in equation (4) are constant, we can rewrite the maximum a posteriori class c as follows: ∧
∧
fi
fi
∧
∧
μ(d, f i ) ∝ argm ax c∈{0,1} P (A1 |c, f i ) N u P (A2 |c, f i ) N u P (A3 |c, f i )C u P (c| f i )
(5)
Two key issues arise here. First, the problem of zero counts can occur when given class and feature values never appear together in the training data. It can be problematic because the resulting zero probabilities will wipe out the information in all other probabilities. We use the Laplace correction to avoid the problem [32]. Second, in equation (5), the conditional probabilities are multiplied and this can result in a floating point underflow. Therefore, it is better to perform the computation by using logarithms of probabilities instead of simply multiplying the probabilities. Equation (5) can be calculated as: ∧
∧
μ ( d , f i ) ∝ a r g m a x c ∈ { 0 ,1 } e x p [ N uf i lo g P ( A 1 | c , f i ) + N uf i lo g P ( A 2 | c , f i ) + ∧
∧
(6)
C u lo g P ( A 3 | c , f i ) ] P ( c | f i ) ∧
∧
∧
The parameters of the model, in our case, P(A1 | c , f i ), P(A2 | c , f i ), P(A3 | c , f i ) and ∧
P (c ) can be estimated as follows: ∧
P (A1 | c = 1 , f i )= (
∑
N uf i + lc 1 ) / (
∑
N uf i + N uf i + C u + lc 1 )
w h e r e u ∈ { p r o te in s a n n o ta te d w ith f i }
(7)
Prediction of Protein Functions from Protein Interaction Networks
793
∧
P (A1 | c= 0, f i )= ( ∑ N uf i + lc 1 ) / ( ∑ N uf i + N uf i + C u + lc 1 )
(8)
w here u ∈ {pro teins not ann otated w ith f i } ∧
P ( c = 1 | f i ) = ( | p r o te in s a n n o ta te d w it h f i |+ lc 1 ) / ( K + l c 2 )
(9)
∧
P (c= 0 | f i )= (| p ro tein s n o t a n n o ta ted w ith f i |+ lc 1 ) / (K + lc 2 )
(10)
where lc1=1 and lc2=2 are the Laplace corrections and the attributes A2 and A3 can be similarly estimated. Below we briefly describe the Majority and χ2 statistics methods used for comparison with the Multinomial Naïve Bayes method. Majority: For each protein u in a PPI network, we count the number of times each function fi ∈ F occurs in neighbors of the protein u. The functions with the highest frequencies are assigned to the query protein u. χ2 statistics: For each function of interest fi we derive the fraction π f i (number of proteins annotated with function fi / K). Then, we calculate e f i as the expected number for a query protein u annotated with fi: e f i = N u π f i . The query protein u is annotated with the function with the highest χ2 value among the functions of all proteins in its neighbors, where χ2 = ( N uf i − e f i ) 2 / e f i . 3.3 Assessment of Prediction We use the leave-one-out method to evaluate predictions performed by each method. For each query protein u in a PPI network we assume that it is unannotated. Then, we use the above methods to deduce the protein functions for protein u. Let A be the annotated function set and P be the predicted function set. We calculate the number of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) as follows: TP = | A ∩ P |, FP = | P \ A |, FN = | A \ P | and TN = |F \{A∪P}|. The following measures are used for assessing performance of the methods: precision, recall, Matthews correlation coefficient (MCC) and harmonic mean (HM) [33,35,36,37].
4 Results and Discussion We implemented the Multinomial Naïve Bayes, Majority and χ2 statistics methods in Java and tested them on three datasets: yeast and human, with and without interactions determined by Y2H. We examine predictions based on the entire GO function set, separately for each of the three categories: biological process, cellular component and molecular function. To compare the performance of our method we use implicit thresholds, namely, we normalize the posterior probability of a query protein u annotated with the function fi : P(c = 1 | d , fi ) and decide the protein u to be annotated with the function fi if the normalized P(c = 1 | d , fi ) > τ, where τ assumes a value between 0 and 1, in increments of 0.1.
794
C.D. Nguyen et al.
In our method, we assume that a newly annotated protein can flow its newly acquired function(s) to its direct neighbors. Thus, the method is repeated in two iterations. In the second iteration, to calculate the value A1 = N uf i for a protein u, we count both the number of proteins in its neighborhood annotated with fi and predicted with fi in the first iteration. For the Majority and χ2 statistics methods we select top k functions having the highest scores (k ranges from 0, 1, … to 20) and assign these functions to the query protein. For each method, we choose the threshold which yields the highest HM measure. Interestingly, we found that with the selected thresholds the MCC values also achieve the highest value for each method. Figure 2 shows relationship between precision and recall using different thresholds for the normalized probabilities of query proteins in the HPRD without Y2H dataset. The thresholds resulting in the highest HM measures in the Yeast without Y2H, HPRD and HPRD without Y2H datasets are .2, .3, and .3, respectively. Table 2 in Supplementary Materials shows performance of the algorithm. Note that functional annotations for the proteins are incomplete at present. Therefore, a protein may have a function that has not yet been experimentally verified. We wish to decrease the number of annotated functions that are not predicted and increase the number of predicted functions that are actually annotated. The fact that values of recall are always higher than the values of precision in all datasets increases confidence in our method. 100
c) HPRD without two-hybrid network.
Percentage
80
Precision
Recall
0.6
0.8
60 40 20 0 0
0.2
0.4
1
Threshold
Fig. 2. Precision and recall results for the multinomial Naïve Bayes prediction on the HPRD without Y2H dataset
The results of Naïve Bayes in the three categories: biological process, cellular component and molecular are shown in Figures 3, 4 and 5 for the three datasets (see Supplementary Materials). We observe that in the cellular component and biological process, the Naïve Bayes performs better than in the molecular functions. This confirms the fact that PPI networks are more reflective of cellular component and biological process. Performance measures of our method at the selected thresholds are shown in Table 3 in Supplementary Materials, for each category. The comparison of Multinomial Naïve Bayes method with Majority and χ2 statistics is shown in Figure 6. It
Prediction of Protein Functions from Protein Interaction Networks
795
shows that for any given precision, the recall of Naïve Bayes outperforms the recalls of the other methods. At the selected thresholds (Majority’s are 8,3,4 and χ2 statistics’s are 13,10,10 for Yeast without Y2H, HPRD and HPRD without Y2H networks, respectively) performance measures of the three methods are shown in Table 4. Comparison of the ROC curves is shown in Figure 7 in Supplementary Materials. The closer the curve follows the top left-hand area of the ROC space the more accurate the classifier. A random classifier would have its ROC lying along the diagonal line connecting points (0,0) and (1,1). The Naïve Bayes performs consistently better on the three datasets for functional prediction. Next, we ask the question: if the predicted functions {fX} appear together in a protein can we derive other functions {fY}? To answer this question we use association rule learning [38] to discover potentially interesting relationships between functions in PPI networks. Association rules are statements of the form {fX} => {fY}, meaning that if we find all of {fX} in a protein, we have a good chance of finding {fY} with some user-specified confidence (the probability of finding {fY} given {fX}) and support (the proportion of proteins containing functions {fX} in the entire networks). With 0.1% support and 75% confidence, we found 837 association rules in Yeast without Y2H, 1,504 in HPRD, and 1,837 in HPRD without Y2H (see Table 5 in Supplementary Materials for details). We derive new functions from the predicted functions of the three methods by using the mined rules and three axioms [39]: 1. if X ⊇ Y then X → Y, 2. if X → Y then XZ → YZ for any Z, and 3. if X → Y and Y → Z then X → Z. Interestingly, the performance measures for Majority and χ2 statistics improved, as can be seen in Table 6 (see Supplementary Materials), while this is not the case for the Naïve Bayes. Based on that observation we believe that our method is able to find hidden correlations among functions. As mentioned above the functional annotation of proteins is incomplete, particularly for human protein data. This suggests possibility that the predicted functions for proteins that are now false positive may actually be yet-to-be-discovered true positive. We list in Table 7 in Supplementary Materials all the proteins from HPRD without Y2H network classified with functions at very high probabilities (>.9) but termed as “false” at the present. 100
c) HPRD without two-hybrid network. Majority
Precision
80
Chisquare
Bayesian
60 40 20 0 0
20
40
60
80
Recall
Fig. 6. Precision and recall of the three methods on the HPRD without Y2H network
100
796
C.D. Nguyen et al.
Table 4. Performance of the three methods on three datasets using leave-one-out validation; (1): Multinomial Naïve Bayes (2): Majority (3): χ2 statistics
Precision Recall MCC HM
Yeast without Y2H (1) (2) (3) 0.29 0.17 0.13 0.54 0.15 0.31 0.40 0.16 0.20 0.38 0.16 0.19
(1) 0.47 0.58 0.52 0.52
HPRD (2) 0.42 0.25 0.32 0.32
(3) 0.20 0.51 0.31 0.28
HPRD without Y2H (1) (2) 0.49 0.47 0.62 0.30 0.55 0.37 0.55 0.37
(3) 0.22 0.53 0.33 0.31
4 Conclusions We introduce a novel method based on the Multinomial Naïve Bayes for protein function predictions. Our algorithm uses a global optimization approach that takes into account several characteristics of interaction networks: direct and indirect interactions, underlying topology (cluster coefficients), and functional protein clustering. We have shown robustness of our method by testing it on three interaction datasets using the leave-one-out crossvalidation. Results show that the Multinomial Naïve Bayes consistently outperforms the Majority and the χ2–statistics methods for prediction of protein functions. In addition, we discovered hidden relationships among the predicted functions by using association rule learning; we believe that it finds new functions of proteins. Acknowledgments. The authors acknowledge support from the National Institutes of Health (1R01HD05235-01A1) (KG and KC) and Vietnamese Ministry of Education (CN).
References 1. Pellegrini, M., Marcotte, E., Thompson, M., Eisenberg, D., Yeates, T.: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA 96, 4285–4288 (1999) 2. Bowers, P., Cokus, S., Eisenberg, D., Yeates, T.: Use of logic relationships to decipher protein network organisation. Science 306, 2246–2259 (2004) 3. Pagel, P., Wong, P., Frishman, D.: A domain interaction map based on phylogenetic profiling. J. Mol. Biol. 344, 1331–1346 (2004) 4. Sun, J., Xu, J., Liu, Z., Liu, Q., Zhao, A., Shi, T., Li, Y.: Refined phylogenetic profiles method for predicting protein–protein interactions. Bioinformatics 21, 3409–3415 (2005) 5. Ranea, J., Yeats, C., Grant, A., Orengo, C.: Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes. PLoS Comput. Biol. 3(11), e237 (2007) 6. Marcotte, E., Pellegrini, M., Thompson, M., Yeates, T., Eisenberg, D.: A combined algorithm for genome-wide prediction of protein function. Nature 402, 83–86 (1999) 7. Zhou, X., Kao, M., Wong, W.: Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl. Acad. Sci. USA 99, 12783–12788 (2002) 8. Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
Prediction of Protein Functions from Protein Interaction Networks
797
9. Marcotte, E., Pellegrini, M., Ng, H., Rice, D., Yeates, T., Eisenberg, D.: Detecting protein function and protein–protein interactions from genome sequences. Science 285, 751–753 (1999) 10. Troyanskaya, O., Dolinski, K., Owen, A., Altman, R., Botstein, D.: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100, 8348–8353 (2003) 11. Lewis, D., Jebara, T., Noble, W.: Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics 22, 2753– 2760 (2006) 12. Li, S., et al.: A map of the interactome network of the metazoan C.elegans. Science 303, 540–543 (2004) 13. Giot, L., et al.: A protein interaction map of Drosophila melanogaster. Science 302, 1727– 1736 (2003) 14. Fromont-Racine, M., et al.: Toward a functional analysis of the yeast genome through exhaustive Y2H screens. Nat. Genet. 16, 277–282 (1997) 15. Schwikowski, B., Uetz, P., Fields, S.: A network of protein-protein interactions in yeast. Nature Biotechnology 18, 1257–1261 (2000) 16. Uetz, P., et al.: A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627 (2000) 17. Ho, Y., et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 (2002) 18. Peri, S., et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research 13, 2363–2371 (2003) 19. Chua, H., Sung, W., Wong, L.: Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics 22, 1623–1630 (2006) 20. Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., Takagi, T.: Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 18, 523–531 (2001) 21. Deng, M., Zhang, K., Mehta, S., Chen, T., Sun, F.: Prediction of protein function using protein-protein interaction data. Journal of Computational Biology 10, 947–960 (2003) 22. Letovsky, S., Kasif, S.: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19, 197–204 (2003) 23. Vazquez, A., Flammi, A., Maritan, A., Vespignani, A.: Global protein function prediction from protein-protein interaction networks. Nature Biotechnology 21(6), 697–670 (2003) 24. Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M.: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21, 302– 310 (2005) 25. Nguyen, C., Mannino, M., Gardiner, K., Cios, K.: ClusFCM: An algorithm for predicting protein functions using homologies and protein interactions. J. Bioinform. Comput. Biol. 6(1), 203–222 (2008) 26. Deng, M., Chen, T., Sun, F.: An integrated probabilistic model for functional prediction of proteins. Journal of Computational Biology 11, 463–475 (2004) 27. Nariai, N., Kolaczyk, E.D., Kasif, S.: Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data. PLoS ONE 2(3), e337 (2007) 28. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539 (2006) 29. Sprinzak, E., Sattath, S., Margalit, H.: How reliable are experimental protein–protein interaction data? Journal of Molecular Biology 327, 919–923 (2003)
798
C.D. Nguyen et al.
30. Ashburner, M., et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000) 31. Yao, L., Rzhetsky, A.: Quantitative systems-level determinants of human genes targeted by successful drugs. Genome Res. 18(2), 206–213 (2008) 32. Niblett, T.: Constructing decision trees in noisy domains. In: Proceedings of the Second European Working Session on Learning, pp. 67–78. Sigma, Bled, Yugoslavia (1987) 33. van Rijsbergen, C.: Information retrieval: theory and practice. In: Proceedings of the Joint IBM/University of Newcastle upon Tyne Seminar on Data Base Systems, pp. 1–14 (1979) 34. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412–424 (2000) 35. Kurgan, L., Cios, K., Scott, D.: Highly Scalable and Robust Rule Learner: Performance Evaluation and Comparison. IEEE Transactions on Systems Man and Cybernetics, Part B 36(1), 32–53 (2006) 36. Cios, K., Kurgan, L.: CLIP4: Hybrid Inductive Machine Learning Algorithm that Generates Inequality Rules. Information Sciences 163(1-3), 37–83 (2004) 37. Cios, K., Pedrycz, W., Swiniarski, R., Kurgan, L.: Data Mining A Knowledge Discovery Approach. Springer, Heidelberg (2007) 38. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules Between Sets of Items in Large Databases. In: SIGMOD Conference, pp. 207–216 (1993) 39. Armstrong, W.: Dependency Structures of Data Base Relationships. In: Information Processing 74. North Holland, Amsterdam (1974)