Computers in Biology and Medicine 43 (2013) 1196–1204
Contents lists available at SciVerse ScienceDirect
Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm
Improving protein complex classification accuracy using amino acid composition profile Chien-Hung Huang a, Szu-Yu Chou a, Ka-Lok Ng b,n a b
Department of Computer Science and Information Engineering, National Formosa University, 64, Wen-Hwa Road, Hu-wei, Yun-Lin 632, Taiwan Department of Biomedical Informatics, Asia University, 500 Lioufeng Road, Wufeng Shiang, Taichung 41354, Taiwan
art ic l e i nf o
a b s t r a c t
Article history: Received 28 October 2012 Accepted 30 May 2013
Protein complex prediction approaches are based on the assumptions that complexes have dense protein–protein interactions and high functional similarity between their subunits. We investigated those assumptions by studying the subunits' interaction topology, sequence similarity and molecular function for human and yeast protein complexes. Inclusion of amino acids' physicochemical properties can provide better understanding of protein complex properties. Principal component analysis is carried out to determine the major features. Adopting amino acid composition profile information with the SVM classifier serves as an effective post-processing step for complexes classification. Improvement is based on primary sequence information only, which is easy to obtain. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Protein complex Protein–protein interaction Gene Ontology Sequence alignment Physicochemical property Hydrophobic Hydrophilic Amino acid composition profile Machine learning method
1. Introduction It is known that protein complexes are involved in many biological processes. Some of the well-known protein complexes are: enzyme–inhibitor complexes, antibody–protein complexes, and protein–receptor complexes [1]. Enzyme–inhibitor complexes include trypsin-like serine proteinases and subtilisins (PDB code 2six), antibody–protein complexes include immunoglobulin FAB complexed with lysozyme (PDB code 2hfl), and protein–receptor complexes include human growth hormone, hGHbp (PDB code 3hhr). The spoke model hypothesizes that all the subunits inside the complex directly interacts with the bait protein, whereas the matrix model assumes all possible interacting pairs among the complex's subunits [2–5]. The correctness of these two models is still an open question and needs further investigation. Subunits refer to the protein constituents of a protein complex. Recent experimental studies indicate that a protein complex can be visualized as a unit composed of cores, modules and attachments [6,7]. Core proteins are proteins that have comparatively more interactions among themselves and belong to a unique protein complex [8,9]. Attachment proteins bind to the core proteins with relative fewer interactions among them. Module proteins are a subset of the attachment, which are always present
n
Corresponding author. Tel.: +886 423 394541. E-mail address:
[email protected] (K.-L. Ng).
0010-4825/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiomed.2013.05.026
together, and module proteins can be present in more than one complex. A recent study has suggested that the prediction of protein complexes based on the core-attachment model can achieve better performance than graphical approaches [10]. Furthermore, it is reported that subunits of a complex tend to have highly correlated gene expression patterns [11]. In this study we propose to characterize human and yeast protein complexes by adopting protein–protein interaction (PPI) data. This allows us to quantify the interaction topology among the subunits of a complex. It is known that many protein complex prediction calculations are based on the identification of pseudocliques [12–15], and dense PPI regions [16–19]. Our study may serve as a test of whether a protein complex is composed of highly interacting subunits. Secondly, the average of the pairwise sequence similarity, i.e. the bit score, of subunits inside a protein complex will be computed. This number can be used to characterize the overall sequence similarity of a complex. Thirdly is the Jaccard index (JI) of the Gene Ontology (GO) annotation; here the molecular function descriptions for protein subunits are used. In our previous study [20] it is conjectured that prediction approaches based on the assumption that complexes are composed of highly PPI dense regions can predict a rather limited number of complexes. In this study we propose characterizing protein complexes by considering their physicochemical properties. Amino acids’ physicochemical properties are used in characterizing PPI interfaces for protein complexes [21]. Also, there has
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
been an attempt to use physicochemical properties in detecting remote protein homology [22], with successful results. In a previous work [23], it was suggested that pI and sequence length could be used to help predict the probability that a protein belongs to a particular complex. AAindex is a database [24] that collected various physicochemical and biochemical properties of amino acids. AAindex (version 9.1) documented a list of 544 amino acid indices. A recent work [25] proposed the use of fuzzy clustering techniques to categorize these 544 indices into three high quality subsets, and demonstrated the effectiveness of their approach for prediction of post-translational modification sites. Many physicochemical property calculations required secondary structure or tertiary structure information, which limited the usefulness of such approach. Here we propose to consider the following physicochemical properties of a complex; the composition profile of the 20 amino acids, hydrophobicity, hydrophilicity, pI value, and subunit sequence length. The numerical value of these properties was derived from the primary sequence information, which is much easier to access. For instance, the ExPASy tool, ProtParam [26], computes the numeric physicochemical properties of a protein using sequence data only. Therefore, instead of trying to predict complexes from PPI data only, the major objective of the present study is to identify important physicochemical parameters for protein complex classification. It is proposed that the results of this work will be helpful in improving the accuracy of protein complex classifications. This is achieved by post-processing protein complex prediction results. For the purpose of protein complex classification, physicochemical parameters are used to construct the feature vectors that are trained by support vector machine (SVM) [27], neural networks (NN), decision tree (DT) and a naïve Bayes classifier (NBC) [28]. Principal component analysis (PCA) is an useful technique in bioinformatics, it reduces the dimensionality of the original data set [29], improves performance by removing correlations among the feature variables. To identify the major features or capture the contribution due to physicochemical properties, PCA [30] is adopted to determine the major feature spaces (a space spanned by the linear combination of the original features) before using the machine learning classifiers. PCA had been presented as a feature selection method [31–33] for extracting a reduced set of feature variables, which preserve the main features of the whole data set. This approach found applications in corn fungi detection [34], machine defect classification [35], and image classification [36,37]. Once the major feature spaces are determined they will be used and trained by the above four machine learning methods. This will be followed by ten-fold cross-validation test to validate the classification accuracy based on the major feature spaces.
2. Methods 2.1. Interaction topology of protein complex subunits A total of 1818 protein complex data were retrieved from MIPS [38] for human data, and for yeast data a total of 1643 protein complexes were retrieved from Bond [39], and 491 complexes from a database maintained by a group of scientists at Cellzome AG and the European Molecular Biology Laboratory in Heidelberg, Germany [http://yeast-complexes.russelllab.org/]. This database is denoted as ‘Yeast’ in our study. Protein subunits’ accession numbers are labeled according to the gene index for the protein. A topological parameter is defined to test whether protein complexes are found in PPI dense regions, or not [20]. This parameter is the density of interaction, which
1197
describes the experimentally recorded PPI among the subunits of a protein complex relative to the maximum possible PPI (i.e. clique). Given a protein complex with N subunits, there can be N n ðN þ 1Þ=2 possible PPIs, including self-interaction. The density of PPI, ρ, among the subunits of a protein complex, is then given by ρ¼
2s n100% N nðN þ 1Þ
ð1Þ
where s is the observed number of PPIs among the subunits. PPI data are obtained from the BioGrid database [40].
2.2. Sequence bit score of protein complex subunits An all-against-all pairwise sequence alignment is performed using the BLAST program. Output files reported by the BLAST program for all the protein complexes are parsed and the bit score value for each complex is kept for further analysis. The average of the bit score of a complex D, ID, is defined by ID ¼
2 ∑ I NðN−1Þ i o j∈D ij
ð2Þ
where Iij denotes the bit score values reported by BLAST, i and j are labels (i¼1, …, N−1) which denote the complex subunits. Since the average bit score value ID varies from complex to complex, a normalized index V is introduced to represent the complex. Given a property, complex D has an average value d over its subunits; in other words, d is a generalized symbol for the average value of a property, such as the bit score or any other physicochemical property. Let VðDÞ represent the normalized computed index for a complex D, which is defined by VðDÞ ¼
d−minðDÞ maxðDÞ−minðDÞ
ð3Þ
where max and min correspond to the maximum and minimum operations respectively. The max and min operations do not run over subunit's index i and j but over the complete set of protein complexes. It is noted that V ðDÞ lies between 0 and 1 for a property. There is concern that the use of normalized value may filter out information, we demonstrated that the use of normalized value resulted in better classification accuracy (see Appendix Table 1).
2.3. Gene Ontology of protein complex subunits It is suggested that a protein complex is a biologically functional module composed of subunits performing similar functions [41]. Although evolutionary mechanisms drive the emergence of functional modules, the function of the core component of the complex appears to be more conserved among duplicate complexes; hence each complex remains functionally similar. Molecular function (MF) annotations for the subunits are carried forward from GO, which is used to characterize the whole complex. JI is a quantity that is used to quantify the similarity between two sets, hence, given two subunits i and j, the JI is given by JI MF ði; jÞ ¼
ji∩jj ji∪jj
ð4Þ
where ji∩jj and ji∪jj denote the cardinality of i∩j and i∪j respectively. It is noted that JI lies between 0 and 1. For example, given that the MF annotation for subunit i¼ {a, b, c} and subunit j¼{b, c, d}, then JI(i,j)¼2/4¼ 0.50. An all-against-all pairwise subunits’ JIMF is computed, and the average JI score for a complex D, JIMF(D), with N
1198
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
Table 1 Tools used for computing the physiochemical properties of a protein subunit. Physiochemical property Hydrophobic, hydrophilic pI value
Tool KD scale ProtParam
subunits is defined by JI MF ðDÞ ¼
2 ∑ JI MF ði; jÞ NðN−1Þ i o j∈D
ð5Þ
2.4. Physicochemical properties of protein complex subunits Physicochemical properties for the subunits can be computed from two bioinformatics tools (see Table 1 for details). The hydrophobic and hydrophilic values for the 20 amino acids are obtained from the Kyte and Doolittle (KD) scale [42]. The following amino acids are taken to be hydrophobic in our study; i.e. Ala, Met, Cys, Phe, Leu, Val, and Ile. The remaining amino acids are considered to be hydrophilic. ExPASy provides the tool, ProtParam [26], to compute the physicochemical parameters for protein sequences, such as amino acid composition profile, pI (isoelectric point), and sequence length. The isoelectric point (pI) is the pH value at which a particular molecule, such as a protein, carries no net electrical charge. It is related to the dissociation constant, pKa, of a protein [43,44]. Sequence length and amino acid composition profile are obtained from the subunit's FASTA sequence information. Since the physicochemical value varies from subunit to subunit, the normalized index, V ðDÞ, defined by Eq. (3), is introduced to represent the normalized physicochemical parameter of a complex among complexes. The Kolmogorov–Smirnov (KS) test was employed to examine whether there is any difference between the overall distributions of various parameters for the ‘Yeast’, Bond and MIPS data. For example, the overall distributions of various parameters derived from ‘Yeast’ and Bond are compared, the same test is also applied for Bond versus MIPS, and ‘Yeast’ versus MIPS. The KS test examines the differences between two cumulative distributions. It rejects the null hypothesis of no difference between two cumulative distributions if the p-value is less than 0.05. The KS test is performed by using the R package (http://www.r-project. org/) KS function, ks.test. 2.5. Major features selection and machine learning methods for classification With a few features computed, it is necessary to extract the major features. Here, we make use of PCA to identify major features. PCA is an un-supervised method for data dimensionality reduction. This method aims to find a projection of the data with the maximum variance. The kth largest eigenvalue corresponds to the kth principle component or eigenvector. The first important feature is given by the largest component (absolute value) of the first largest eigenvector. In general, the kth most important feature is given by the largest component of the kth largest eigenvector. PCA accepts the use of a covariance matrix or a correlation matrix for analysis. In this study, a correlation matrix of the 27 features (ρ, GO, bit score, hydrophobic, hydrophilic, pI, sequence length, and the composition profile for the 20 amino acids) is used for PCA analysis because feature attributes are mixed, e.g. density
of PPI, GO annotation, bit score, hydrophobicity and amino acid composition profile. A machine learning method, SVM, is used to train the input feature vectors. In particular, the LIBSVM [45] is used for complex classification. LIBSVM provides several kernel functions for classification, i.e. linear, polynomial, sigmoid and radial basis functions (RBF). The same set of features are adopted for classification using NN, DT and NBC, which are provided by MS-SQL using their default settings. A ten-fold cross-validation test is conducted to validate the classification accuracy (ACC) based on the major features. Randomized samples are generated in order to train protein complex classification using various machine learning methods. Tests are performed in which the assignment of protein subunits is randomized, with each complex consisting of the same number of randomly assigned subunits. The physicochemical values for each complex in the randomized set (so-called negative set, comprises a total of 1636 randomized complexes for human) are then computed. The results of the randomized set are taken together with the original complexes’ feature values to form the training set, and are input into a machine learning method, for instance, SVM for training. Classification accuracy, ACC, is defined by ACC ¼
TP þ TN TP þ TN þ FP þ FN
ð6Þ
where TP, TN, FP and FN denote true positive, true negative, false positive and false negative, respectively. The above procedures are repeated for the other three classifiers, namely, NN, DT and NBC. The test set has a size of 182 human protein complexes.
2.6. Improving protein complex prediction methods, COACH, MCL MCODE and MINE To study the impacts of the amino acid composition profile and physicochemical property on protein complex classification, we demonstrate how the available methods; such as, COACH [46–48], MCL [49–51], MCODE [52] and MINE [53], can be improved by post-processing their predictions. The protein complex prediction method, COACH, infers protein complexes using graph clustering techniques. When compared with other existing protein complex prediction methods COACH claims to achieve the highest F-measure due to its balanced precision and recall. Default settings are used, for example, parameter t is set equal to 0.225 in our test runs; where COACH can achieve a good balance of both F-measure and coverage rate. MCL algorithm is based on the Markov Clustering algorithm to discover clusters [49,50]. MCL simulates random walk or flow within the PPI network by transforming the flow into a stochastic Markov matrix. Self-loop is first added to each node in the input graph, and then two operations; i.e. expansion and inflation, are introduced to identify the optimal partition of the graph. Default parameters are used for our test runs, where the inflation parameter in MCL was set as 2.0. It is interesting to note that a recent work [51] proposed a refined MCL method for detecting yeast complexes from PPI data by incorporating ‘core’ and ‘attachment’ proteins. The MCODE algorithm [52] is a graph clustering technique that detects protein complexes based on the highly interconnected PPI assumption. The algorithm proceeds in three steps. The first step is vertex weighting, which raises the weight of locally dense connected module. The second step is molecular complex prediction, MCODE takes the vertex weighted module as a seed complex and moves outward to include vertices according to given parameters. The last step is post-processing, complex is removed if it does not contain a vertex with degree connectivity two. Default parameters
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
are used for test runs, for instance, the depth levelling is set unlimited. Module Identification in Networks (MINE) [53] identifies highly interconnected clusters from PPI network. This method detects functional modules based on an agglomerative (bottom-up) clustering algorithm using a modified vertex weighting strategy. It was reported that MINE performs very competitively with existing methods, such as MCL, MCODE, CFinder [54] and SPICi [55]. Default settings for the node score cutoff, modularity score cutoff, complex merge score and maximum depth parameters are used for test runs. To test the performance, the BioGrid PPI data (release 3.1.82) for human and yeast are adopted as an input for the four prediction methods. BioGrid collected a total of 70,308 and 276,218 PPI for human and yeast respectively. Other species are not considered because of their number of PPI data are rather limited; for instance, mouse and fruit fly has 6076 and 34,901 PPI records respectively. Predicted results are compared with the three datasets; MIPS, ‘Yeast’ and Bond, which provide high quality human and yeast protein complexes data. Here, the parameter, JI, was used to rank the overall performance of the four methods on predicting an individual complex. Given the predicted human complexes, each one of the complexes is compared with the MIPS complexes, then, a JI value is computed; the largest JI value being retained. The whole process was repeated for another complex, then, we obtained the average JI value for each protein complex method. To examine the effectiveness of using an amino acid composition profile for protein complexes prediction, complexes obtained from a prediction tool; for instance, COACH, are filtered based on their amino acid composition profiles. The average JI value for this reduced set of complexes is computed. In comparison, a better JI value strongly suggests the effectiveness of using an amino acid composition profile. The whole process was repeated for the other methods, MCL, MCODE and MINE.
1199
smaller ρ value. The p-value derived from the KS test is 1.46e−13, as shown in Table 2. The difference is probably due to the fact that almost all the complexes collected by ‘Yeast’ have ten or more subunits, whereas Bond documented complexes with a smaller number of subunits. The ‘Yeast’ data indicates that protein complexes do not have dense interactions among their subunits. These results suggest that algorithms based on the assumption that complexes are composed of highly PPI dense regions may have limitations in predicting large complexes. 3.2. Sequence bit score of protein complex subunits Fig. 2 (Appendix Fig. 1) shows the relative frequency of protein complexes versus the normalized (un-normalized) bit score for the ‘Yeast’, Bond and MIPS datasets. It is noted that over 90% of the protein complexes have a normalized bit score below 0.1. The main reason being that some complexes are composed of highly similar subunits, which result in large bit score values of around 1000. 3.3. Gene Ontology annotation of protein complex subunits The plot of the relative frequency of protein complexes versus JI for GO molecular function annotation is shown in Fig. 3. About 4%, 15% and 10% of the ‘Yeast’, Bond and MIPS complexes have average JI values above 0.5. In other words, relatively few complexes are composed of subunits with similar molecular function annotations. 3.4. Hydrophobic and Hydrophilic properties of a protein complex Amino acids’ hydrophobic (Hb) and hydrophilic (Hp) values for each subunit are computed. Since both the Hb and Hp values vary from subunit to subunit within a complex, their index V values are computed; these two indexes are used to represent the protein complex. The plot of the relative frequency of protein complexes versus normalized (un-normalized) index V of the hydrophobic and hydrophilic value is shown in Figs. 4 and 5 (Appendix Figs. 2 and 3)
3. Results 3.1. Interaction topology of protein complex subunits Fig. 1 summarizes the results of ρ for the species, human and yeast. Suppose a ρ value of 0.7 or above represents a highly interactive dense region in the PPI network, our results showed that around 82%, 49% and 72% of the ‘Yeast’, Bond and MIPS complexes are below this ρ value. In other words, quite a significant number of complexes do not have dense PPIs among their subunits [56]. It is evident that from Fig. 1 the distributions of ρ for ‘Yeast’ and Bond data are quite different, i.e. the ‘Yeast’ data has a much
Fig. 1. The plot of relative frequency of complexes versus ρ for the ‘Yeast’, Bond and MIPS datasets.
Table 2 The results of the p-values for the Kolmogorov–Smirnov test using 100 bins. Feature
‘Yeast’ vs. Bond p-value
‘Yeast’ vs. MIPS p-value
Bond vs. MIPS p-value
GO PPI bit score pI Sequence length Hydrophobic Hydrophilic
0.00232 1.46e−13 1.29e−09 0.0158 3.73e−05 0.000445 7.14e−05
0.0366 3.73e−05 5.10e−10 0.000787 0.00136 0.0101 4.71e−06
0.581 3.61e−12 9.57e−06 0.281 0.0243 0.0541 0.0063
Fig. 2. The plot of relative frequency of complexes versus bit score for the ‘Yeast’, Bond and MIPS datasets.
1200
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
Fig. 3. The plot of relative frequency of complexes versus JI for GO molecular function annotation.
Fig. 4. The plot of relative frequency of normalized hydrophobic score for protein complexes.
Fig. 6. The plot of relative frequency of normalized pI values for protein complexes.
Fig. 7. The plot of relative frequency of normalized value of sequence length for protein complexes.
3.6. Sequence length distribution of subunits within a complex
Fig. 5. The plot of relative frequency of normalized hydrophilic score for protein complexes.
respectively. Our results indicate that about 93%, 98% and 98% of the ‘Yeast’, Bond and MIPs complexes have V values less than or equal to 0.5 for the hydrophobic and hydrophilic values.
3.5. pI value of protein complex The pI value for every protein complex's subunit is obtained from ProtParam. For each complex the normalized pI value is computed and the normalized values are grouped into 10% intervals. The plot of the relative frequency of protein complex versus the normalized (un-normalized) pI value is shown in Fig. 6 (Appendix Fig. 4). It is found that around 92% of the MIPS complexes have a normalized value less than or equal to 0.5. The Bond data shows a smaller pI value relative to the ‘Yeast’ database. Our results indicate that about 68% and 90% of the ‘Yeast’ and BOND complexes have a normalized pI value less than or equal to 0.5.
The sequence length of every protein subunit for every complex obtained from MIPS is computed. For each complex, the normalized value of the sequence length is computed, then we group the normalized values into intervals with a width of 0.1. The plot of the relative frequency versus the normalized (un-normalized) sequence length is shown in Fig. 7 (Appendix Fig. 5). The Bond data shows a smaller normalized value relative to the ‘Yeast’ database. It is found that around 93%, 98% and 97% of the ‘Yeast’, Bond and MIPS complexes have a normalized value less than or equal to 0.5. The results of the p-values for the two-sample KS test using 100 bins are summarized in Table 2. Since most of the p-values are below 0.05 the null hypothesis is rejected. In other words, distributions of various parameters are different for the ‘Yeast’, Bond and MIPS data; except the Bond and MIPS data using GO, pI and hydrophobic features (see bold and italic data in Table 2). We like to point out that if 10 bins are used for representing the features, then there is no difference between the overall distribution of the studied parameters for the three sets of data, because the p-values obtained from the KS test are all above 0.05. In other words, conducting the test using quasi-continuous data (100 bins) instead of categorized data (10 bins) gave different conclusion. 3.7. Amino acid composition profile of a protein complex The normalized amino acid composition profiles for every complex are deduced from the subunits’ sequences. Every amino acid composition value is summed and averaged over all the complexes. The 20 composition values were then compared. Fig. 8 is the box-plot of the normalized amino acid composition profile for human protein complexes. The median for the normalized composition value for the 20 amino acids is somewhere between 0.1 and 0.2.
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
Finally, we ranked the area under curve (AUC) values [56,57] for each one of the 27 features, the results were given in Appendix Table 1. It was found that features PPI and GO are the top two most significant features in classification performance. Also, most of the amino acids are ranked next; hence, it suggested that adopting the amino acid profile for classification is feasible.
3.8. Major features determined by PCA A correlation matrix of the 27 features is computed. Feature selection calculation is conducted by using the PCA function provided by MATLAB. For human complexes, it is found that the first 17 eigenvectors or principle components (PCs) account for more than 95% of the total sum of eigenvalues. The first PC, which is associated with the largest eigenvalue, captures the largest variation; whereas the other orthogonal PCs capture variation left in the data. For the first PC, only the first 20 largest components are kept, this is because the difference between these 20 components is small in comparison to the difference between the 20th and 21st components. This criterion is carried forward from the work by Isebrands and Crow [58]. For instance, magnitudes of the first, 20th and 21st components are 0.239, 0.205 and 0.184 respectively; we discarded anything beyond the 21st component because the difference between the 20th and 21st components is much bigger than the consecutive difference between the 1st and the 20th component. The first 20 components correspond exactly to the 20 amino acids. In other words, the most important feature is the linear combination of the 20 amino acids composition. The first four most significant features (accounts for more than 80% of the total eigenvalues) determined by the PCA are given in Table 3. Using the same criteria as the first PC, it is noted that only components with a significant magnitude are kept in the 2nd, 3rd and 4th PCs. The above analyses were repeated for the Bond and ‘Yeast’ data. In both cases, the first 19 PCs account for 95% of the total sum of eigenvalues. The results of the features given by the largest four PCs are the same as the result obtained from human, i.e. same as
Fig. 8. Box-plot of the normalized amino acid composition value for human protein complexes.
1201
Table 3. This indicates that the feature spaces spanned by PCA are the same for both species; human and yeast. 3.9. Machine learning methods classification results Guided by the PCA results, i.e. Table 3; we examined the dependence of the classification accuracy (ACC) on the feature spaces. Five types of feature spaces were considered. The first type of feature space, denoted by I, is given by the first PC (only the 20 amino acids composition are considered). The second feature space, denoted by II, is defined by the 3rd and 4th PCs in Table 3. It is noted that II does not consist of any physicochemical property. The third type of feature space, denoted by III, is the linear combination of pI, sequence length, hydrophilicity and hydrophobicity. The fourth feature space, denoted by IV, is the combination of I and III, i.e. the linear combination of all physicochemical properties. Finally, the fifth feature space, denoted by V, is the combination of II and IV, i.e. the complete set of 27 features. To examine the relative importance of the amino acid profile and the physicochemical properties, we studied the AUC values for feature spaces I and III, the results were given in the Appendix Table 2. It was found that the AUC value for I achieved a value of about 0.99, hence, it suggested the use of amino acid profile for classification. Using the RBF kernel, the ACC results predicted by the SVM for the five feature spaces are given in Table 4. The classification calculations are repeated using NN, DT and NBC, the results are also given in Table 4. Among the four classifiers, SVM achieves the best ACC in classifying the three protein complex datasets, i.e. ‘Yeast’, Bond and MIPS. The differences of ACC among the three datasets are rather insignificant. It is interesting to notice that excellent ACC, above 97.3%, was obtained just by using the amino acid composition profile (feature space I). A similar level of ACC was also achieved by using II. However, the main advantage of using I is that only protein sequences information are required, there is no need to compute: (i) the density of PPI, (ii) JI for GO, and (iii) perform sequence pairwise alignment, thus, allowing a much easier and more efficient way of classifying protein complexes. Effectiveness of adopting feature space I in identifying putative complexes will be demonstrated later in this section. In comparison, the use of III resulted in a lower ACC values. In other words, the use of pI, sequence length, hydrophilicity and hydrophobicity do not perform as good as the choice of other features. If feature space I combines with III, i.e. using all the physicochemical properties, an excellent ACC value is achieved, which is above 98.9% using the SVM classifier; which is slightly better than using I. In case the complete set of 27 features are used, i.e. ACC can be further improved where it can achieve 99.2%. To demonstrate the effectiveness of this; we compared the average JI value for the predicted complexes before and after the classifying or filtering process using feature space I and SVM. Overall, tests are performed using four protein complex programs over three different datasets; that is a total of 15 experimental runs. The MCL prediction program reported complexes composed of a single subunit, if excluded there are 12 runs only. The results for human complex study are summarized in Table 5. Percentage
Table 3 The results of the feature spaces given by the largest four PCs for human protein complexes. First PC
Second PC
Third PC
Fourth PC
The 20 amino acids
Sequence length, hydrophilic, hydrophobic, pI
Bit scorea, ρ
GO
a b
Bit score is given by an all-against-all pairwise sequence alignment by using the BLAST program. The Jaccard Index is a quantity which is used to quantify the similarity between two sets of Gene Ontology (GO) molecular function annotations.
b
1202
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
Table 4 The results of classification accuracy of ‘Yeast’, Bond and MIPS complexes for five different features. Feature spaceb
Database
SVMa, ACC(%)
NN, ACC (%)
DT, ACC (%)
NBC, ACC (%)
I
Yeast Bond MIPS
98.0 97.3 99.6
97.4 96.8 99.3
86.3 87.6 93.7
67.7 71.7 71.4
II
Yeast Bond MIPS
99.3 96.0 98.8
99.0 95.3 97.9
99.1 95.4 97.5
98.1 94.9 98.2
III
Yeast Bond MIPS
75.7 72.8 81.1
70.1 67.5 80.7
70.1 67.7 75.0
65.4 64.1 62.4
IV
Yeast Bond MIPS
98.9 99.5 99.8
97.2 96.3 99.5
86.2 87.6 94.4
68.1 69.7 69.2
V
Yeast Bond MIPS
99.4 99.2 99.8
99.3 98.9 99.3
99.1 96.4 97.6
91.3 90.0 92.9
a
Part of the SVM results have been reported in a previous work [59]. In the ‘Feature space’ column, I denotes 20 amino acid composition, II denotes GO, ρ and bit score, III denotes pI, sequence length, hydrophilicity and hydrophobicity, IV denotes pI, sequence length, hydrophilicity, hydrophobicity and 20 amino acid composition, and V denotes the combination of the 27 features. b
Table 5 The results of average JI for human complexes before and after the classifying process.
COACH MCL MCODE MINE a
JIbefore
JIafter
Δ (%)
0.076 0.0783 0.161 0.152
0.168 0.0993a/0.0712 0.164 0.170
121 26.8a/−9.07 1.86 11.8
Data include protein complexes composed of only one subunit.
of improvement in JI is denoted by Δ, which is defined by (JIafter−JIbefore)/JIbeforen100%. For the human complex test run, the filtering approach improves the COACH result from 0.076 to 0.168, a 121% improvement (Table 5), which is rather significant. It is also found that COACH, MCODE and MINE achieved a relatively better JIafter values compared to MCL. The results for the yeast complex study are summarized in Table 6. It was found that MCODE and MINE achieved higher JIbefore and JIafter values compared to COACH and MCL. This indicates that MCODE and MINE perform better in complex predictions. If COACH or MCL predictions are post-processed, the improvement in JI is rather significant for the Bond data, that is, the Δ values (36.8% and 70.6%). This may be due to the fact that COACH and MCL reported a larger number of putative complexes relative to MCODE and MINE (see Table 7). Although MCODE and MINE gave better JI values, the number of predicted complexes is less by a factor of at least two. We also note that, the best JI value, i. e. 0.348, is obtained which corresponds to the 70.6% improvement. Overall, better JI values are obtained; i.e. the Δ values are positive, which correspond to better protein complex prediction. From Tables 5 and 6, the present approach achieved a better JI value for 15 (100%) or 11 (91.6%) times. MCL predicted protein complexes compose of single subunit, if these complexes are excluded; the performance degraded slightly, i.e. −9.07%, 2.73% and 8.25% for the human, ‘Yeast’ and Bond datasets respectively. In summary, a positive Δ value is obtained in almost all the cases, suggesting the effectiveness of using the amino acid composition profile to improve protein complex classification. A certain fraction of predicted protein complexes are false positive events, and are filtered by the amino acid composition profile approach. The number of predicted complexes, also called
coverage, before and after the filtering process is given in Table 7. The percentage of coverage is denoted by δ, which is defined by Nafter/Nbeforen100%. Relatively speaking, coverage values for yeast are higher than human. Most of the coverage values for yeast remained above 70%, except the MCODE result. Relative to other prediction tools, MCODE tends to report a smaller number of protein complex predictions. A web service has been set up which can be accessed at: http:// bioinfo.csie.nfu.edu.tw:8080/ProteinComplex/ProteinComplexsvm. aspx. It provides protein complex predictions based on FASTA sequence information only. User simply uploads a flat file of FASTA sequences, the web server will report the prediction instantly.
4. Discussion A topological parameter, density of PPI is defined to test whether protein complexes are found in PPI dense regions, or not. The present data indicate that interaction-dense regions represent protein complexes in a certain fraction of the cases. A large proportion of protein complexes have a lower density of PPI. It is conjectured that prediction approaches based on the assumption that complexes are composed of highly PPI dense regions can predict a limited number of complexes. Several approaches have been developed for protein complex prediction, such as: (1) using graph theory to study dense PPI regions [16–18], (2) based on experimental data, such as tandem mass spectrometry, (3) the core attachment approach [10], and (4) heterogeneity data integration [60–63]. All of these approaches have certain limitations for they consider only the static, nonbiochemical properties of a protein complex. In the present study, several physicochemical parameters, i.e. the composition profile of the 20 amino acids, hydrophobic, hydrophilic, pI value, and sequence length, are introduced to describe protein complexes and gain better insights for our understanding of protein complex architecture. Important features are inferred by PCA. It is found that the best feature classification accuracy is achieved by using the amino acid composition profile. Given the FASTA sequences for a predicted complex subunit, a possible application of the present study is to improve the sensitivity of determining whether a protein subunit belongs to an inferred protein complex, or not. We validated this finding by
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
1203
Table 6 The results of average JI for the yeast complexes before and after the classifying process. ‘Yeast’
COACH MCL MCODE MINE a
Bond
JIbefore
JIafter
Δ (%)
JIbefore
JIafter
Δ (%)
0.159 0.084a/0.0878 0.179 0.188
0.160 0.0867a/0.0902 0.192 0.190
0.63 3.21a/2.73 7.26 1.06
0.155 0.204a/0.194 0.330 0.329
0.212 0.348a/0.210 0.346 0.330
36.8 70.6a/ 8.25 4.85 0.30
Data include protein complexes composed of only one subunit.
Table 7 The number of predicted complexes predicted (coverage) before and after the filtering process. ‘Yeast’
Human
COACH MCL MCODE MINE a
Bond
Nbefore
Nafter
δ (%)
Nbefore
Nafter
δ (%)
Nbefore
Nafter
δ (%)
1824 2126a/1775 333 890
111 39a/797 25 118
6.09 1.83a/44.9 7.51 13.3
888 1212a/959 87 324
795 1191a/916 13 293
89.5 98.3a/95.5 14.9 90.4
888 1212a/959 87 324
623 41a/815 63 270
70.2 3.38a/85.0 72.4 83.3
Data include protein complexes composed of only one subunit.
demonstrating that better complex prediction is achieved by postprocessing the COACH, MCL, MCODE and MINE predictions. The present work presents an interesting discovery that strongly suggests one may have to integrate amino acid composition profiles to improve protein complex prediction. The main advantage of using amino acid composition profile is that only primary sequence information is required, and it is much easier to work with.
Conflict of interest statement None.
Acknowledgments The work of Chien-Hung Huang is supported by the National Science Council of Taiwan under Grants NSC 100-2221-E-150-069 and NSC 101-2221-E-150-088-MY2, the work of Ka-Lok Ng is supported by NSC 99-2221-E-468-016-MY2, NSC 100-2221-E468-013 and NSC 101-2221-E-468-027. We also thank Kun-Ting Chao for his contributions on data collection and data parsing, and Dr. Tim Williams of Asia University for providing English proof reading service for this article.
Appendix A. Supporting information Supplementary data associated with this article can be found in the online version at: http://dx.doi.org/10.1016/j.compbiomed.2013. 05.026.
References [1] C. Kleanthous, Protein–protein Recognition, Oxford University Press, 2000. [2] G.D. Bader, C.W. Hogue, Analyzing yeast protein–protein interaction data obtained from different sources, Nat. Biotechnol. 20 (2002) 991–997. [3] D. Scholtens, M. VidalR. Gentleman, Local modeling of global interactome networks, Bioinformatics 21 (2005) 3548–3557. [4] L. Hakes, D.L. Robertson, S.G. Oliver, S.C. Lovell, Protein interactions from complexes: a structural perspective, Comp. Funct. Genomics 2007 (2007) 1–5.
[5] L. Hakes, D.L. Robertso, S.G. Oliver, Effect of dataset selection on the topological interpretation of protein interaction networks, BMC Genomics 6 (2005) 131. [6] Z. Lubovac, J. Gamalielsson, B. Olsson, Combining functional and topological properties to identify core modules in protein interaction networks, Proteins: Struct. Funct. Bioinform. 64 (2006) 948–959. [7] C.N.I. Pang, J.R. Krycer, A. Lek, M.R. Wilkins, Are protein complexes made of cores, Modules and attachments? Proteomics 8 (2008) 425–434. [8] A.C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J. Jensen, S. Bastuck, B. Dümpelfeld, et al., Proteome survey reveals modularity of the yeast cell machinery, Nature 440 (2006) 631–636. [9] B. Zhang, B.H. Park, T. Karpinets, N.F. Samatova, From pull-down data to protein interaction networks and complexes with biological relevance, Bioinformatics 24 (2008) 979–986. [10] H.C.M. Leung, Q. Xiang, S.M. Yiu, F.Y.L. Chin, Predicting protein complexes from PPI data: a core-attachment approach, J. Comput. Biol. 16 (2009) 133–144. [11] S. Zanivan, I. Cascone, C. Peyron, I. Molineris, S. Marchio, M. Caselle, F. Bussolino, A New, Computational approach to analyze human protein complexes and predict novel protein interactions, Genome Biol. 8 (2007) R256. [12] V. Spirin, L.A. Mirny, Protein complexes and functional modules in molecular networks, Proc. Natl. Acad. Sci. 100 (2003) 12123–12128. [13] M. Altaf-Ul-Amin, Y. Shinbo, K. Mihara, K. Kurokawa, S. Kanaya, Development and implementation of an algorithm for detection of protein complexes in large interaction networks, BMC Bioinform. 7 (207) (2006) 1–13. [14] H. Yu, A. Paccanaro, V. Trifonov, M. Gerstein, Predicting interactions in protein networks by completing defective cliques, Bioinformatics 22 (2006) 823–829. [15] E. Zotenko, K.S. Guimarães, R. Jothi, T.M. Przytycka, Decomposition of overlapping protein complexes: a graph theoretical method for analyzing static and dynamic protein associations, Algorithms Mol. Biol. 1 (7) (2006) 1–11. [16] H. Zheng, H. Wang, D.H. Glass, Integration of genomic data for inferring protein complexes from global protein–protein interaction networks, IEEE Trans. Syst. Man Cybern. 38 (2008) 5–16. [17] A.D. King, N. Przulj, I. Jurisica, Protein complex prediction via cost-based clustering, Bioinformatics 20 (2004) 3013–3020. [18] Y. Qi, F. Balem, C. Faloutsos, K.S. Judith, B.J. Ziv, Protein complex identification by supervised graph local clustering, Bioinformatics 24 (2008) 250–268. [19] P.P.C. Tan, D. Dargahi, F. Pio, Predicting protein complexes by data integration of different types of interactions, Int. J. Comput. Biol. Drug Des. 3 (2010) 19–30. [20] Chien-Hung Huang, K-.T. Chao, Ka-Lok Ng (2011). Protein complexes subunits interaction topology and sequence identity, in: IEEE International Conference on Computer Research and development (IEEE ICCRD 2011), vol. 1, Shanghai, China, March 11–13, 2011, pp. 229–232. [21] K. Cho, K. Lee, K. Lee, D. Kim, D. Lee, Specificity of molecular interactions in transient protein–protein interaction interfaces, Proteins 65 (2006) 593–606. [22] Y. Yang, E. Tantoso, K.B. Li, Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties, J. Theor. Biol. 252 (2008) 145–154. [23] P. Wong, S. Althammer, A. Hildebrand, A. Kirschner, P. Pagel, B. Geissler, P. Smialowski, F. Blöchl, M. Oesterheld, T. Schmidt, et al., An evolutionary and structural characterization of mammalian protein complex organization, BMC Genomics 9 (629) (2008) 1–16.
1204
C.-H. Huang et al. / Computers in Biology and Medicine 43 (2013) 1196–1204
[24] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, M. Kanehisa, AAindex: amino acid index database, progress report, Nucleic Acids Res. 36 (2008) D202–205. [25] I. Saha, U. Maulik, S. Bandyopadhyay, D. Plewczynski, Fuzzy clustering of physicochemical and biochemical properties of amino acids, Amino Acids 43 (2012) 583–594. [26] E. Gasteiger, C. Hoogland, A. Gattiker, S. Duvaud, S.M.R. Wilkins, R.D. Appel, A. Bairoch, Protein identification and analysis tools on the ExPASy serverin: M. John Walker (Ed.), The Proteomics Protocols Handbook, Humana Press, 2005, pp. 571–607. [27] S. Abe, Support Vector Machines for Pattern Classification, Springer, London, 2005. [28] E. Keedwell, A. Narayanan, Intelligent Bioinformatics, John Wiley & Sons Ltd, New York, 2005. [29] S. Ma, Y. Dai, Principal component analysis based methods in bioinformatics studies, Brief Bioinform. 12 (2011) 714–722. [30] I.T. Jolliffe, Principal Component Analysis, Springer, New York, 2002. [31] I.T. Jolliffe, Discarding variables in a principal component analysis I: Artificial data, Appl. Stat. 21 (1972) 160–173. [32] W.J. Krzanowski, Selection of variables to preserve multivariate data structure, using principal component analysis, J. R. Stat. Soc. Ser. C Appl. Stat. 36 (1987) 22–33. [33] Y. Lu, I. Cohen, X.S. Zhou, Q. Tian, Feature selection using principal feature analysis, in: Proceedings of the 15th International Conference on Multimedia, 2007, pp. 301–304. [34] Z. Cataltepe, H.M. Genc, T. Pearson,A PCA/ICA based feature selection method and its application for corn fungi detection, in: Proceedings of the 15th European Signal Processing Conference, 2007, pp. 970–974. [35] A. Malhi, R.X. Gao, PCA-based feature selection scheme for machine defect classification, IEEE Trans. Instrum. Meas. 53 (2004) 1517–1525. [36] I.S. Bajwa, M.S. Naweed, M.N. Asif, S.I. Hyder, Feature based image classification by using principal component analysis, Int. J. Graphic. Vision Image Process 9 (2009) 11–17. [37] O. Khayat, H.R. Shahdoosti, M.H. Khosravi, Image classification using principal feature analysis, in: Proceedings of the 7th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, 2008, pp. 198–203. [38] P. Pagel, S. Kovac, M. Oesterheld, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, P. Mark, V. Stümpflen, H.W. Mewes, A. Ruepp, D. Frishman, The MIPS mammalian protein–protein interaction database, Bioinformatics 21 (2005) 832–834. [39] C. Alfarano, C.E. Andrade, K. Anthony, N. Bahroos, M. Baiec, K. Bantoft, D. Betel, B. Bobechko, K. Boutilier, E. Burgess, et al., The biomolecular interaction network database and related tools 2005 update, Nucleic Acids Res. 33 (2005) D418–424. [40] B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D.H. Lackner, J. Bähler, V. Wood, K. Dolinski, M. Tyers, The BioGRID interaction database: (2008) update, Nucleic Acids Res. 36 (2008) D637–640. [41] J.B. Pereira-Leal, E.D. Levy, S.A. Teichmann, The origins and evolution of functional modules: lessons from protein complexes, Philos. Trans. R. Soc. London B Biol. Sci. 361 (2006) 507–517. [42] J. Kyte, R.F. Doolittle, A simple method for displaying the hydropathic character of a protein, J. Mol. Biol. 157 (1982) 105–132. [43] J. Kim, J. Mao, M.R. Gunner, Are acidic and basic groups in buried proteins predicted to be ionized? J. Mol. Biol. 348 (2005) 1283–1298. [44] P.J. Kundrotas, E. Alexov, Electrostatic properties of protein–protein complexes, Biophys. J. 91 (2006) 1724–1736. [45] C.C. Chang, C.J. Lin, LIBSVM—A Library for Support Vector Machines, Department of Computer Science, National Taiwan University, 2010. [46] M. Wu, X. Li, C.K. Kwoh, S.K. Ng, A Core-Attachment Based, Method to detect protein complexes in PPI networks, BMC Bioinform. 10 (169) (2009) 1–16. [47] X. Li, M. Wu, C.K. Kwoh, S.K. Ng, Computational approaches for detecting protein complexes from protein interaction networks: a survey, BMC Genomics 11 (Suppl. 1) (2010) 1–19, S3. [48] A.J. Enright, S.V. Dongen, C.A. Ouzounis, An efficient algorithm for large-scale detection of protein families, Nucleic Acids Res. 30 (2002) 1575–1584. [49] S. van Dongen, Graph clustering by flow simulation (Ph.D. thesis), University of Utrecht, 2000. [50] J. Vlasblom, S.J. Wodak, Markov clustering versus affinity propagation for the partitioning of protein interaction graphs, BMC Bioinformatics 10 (2009) 99.
[51] S. Srihari1, K. Ning, H.W. Leong, MCL-CAw: a refinement of mcl for detecting yeast complexes from weighted PPI networks by incorporating coreattachment structure, BMC Genomics 11 (2010) 504. [52] G.D. Bader, C.W.V. Hogue, An automated method for finding molecular complexes in large protein interaction networks, BMC Bioinform. 4 (2003) 2. [53] K. Rhrissorrakrai, K.C. Gunsalus, MINE: module identification in networks, BMC Bioinform. 12 (2011) 192. [54] G. Adamcsek, I.J. Palla, I. Farkas, T. Derenyi, Vicsek, CFinder: locating cliques and overlapping modules in biological networks, Bioinformatics 22 (2006) 1021–1023. [55] P. Jiang, M. Singh, SPICi: a fast clustering algorithm for large biological networks, Bioinformatics 26 (2010) 1105–1111. [56] E. Alpaydin, Introduction to Machine Learning, 2nd ed., The MIT Press, London, 2011. [57] R.J. van Berlo, L.F. Wessels, D. De Ridder, M.J. Reinders, Protein complex prediction using an integrative bioinformatics approach, J. Bioinform. Comput. Biol. 5 (4) (2007) 839–864. [58] J.G. Isebrands, T.R. Crow, Introduction to Uses and Interpretation of Principle Component Analysis in Forest Biology, USDA Forest Service General Technical Report, NC-17, 1975. [59] C.H. Huang, K.T. Chao, K. L. Ng, Physicochemical features study of protein complexes, in: Proceedings of the 4th IEEE International Conference on Computer Science and Information Technology (ICCSIT 2011), 2011. [60] J. Qiu, W.S. Noble, Predicting co-complexed protein pairs from heterogeneous data, PLoS Comput. Biol. 4 (2008) 1–10. [61] J. Krumsiek, C.C. Friedel, R. Zimmer, ProCope—Protein Complex prediction and evaluation, Bioinformatics 24 (2008) 2115–2116. [62] P.J. Kundrotas, E. Alexov, PROTCOM: searchable database of Protein Complexes enhanced with domain domain structures, Nucleic Acids Res. 35 (2007) D575–579. [63] E. Sprinzak, Y. Altuvia, H. Margalit, Characterization and predication of protein–protein interactions within and between complexes, Proc. Natl. Acad. Sci. 103 (2006) 14718–14723.
Dr. Chien-Hung Huang received the B.S. degree in computer science from the Tatung Institute of Technology, Taipei, Taiwan in 1991, and the Ph.D. degree in Computer and Information Engineering from National Tsing Hua University, Hsinchu, Taiwan in 1999. From 1999 to 2004, he joined the faculty of Ling Tung University. Currently, he is an associate professor at the Department of Computer and Information Engineering, National Formosa University. His research interests include bioinformatics, data hiding, algorithms and open source distributions.
Mr. Szu-Yu Chou received the M.S. degree from Graduate Institute of ElectroOptical and Materials Science, National Formosa University, Taiwan in 2012. Since March 2013, he has been a Research Assistant at the Academia Sinica Research Center for Information Technology Innovation. Currently, he is a Ph.D. student at the Department of Electrical Engineering, National Tsing Hua University. His research interests include data mining and machine learning.
Dr. Ka-Lok Ng received the Honours diploma in physics from Hong Kong Baptist College in 1983, and the Ph.D. degree in theoretical physics from the Vanderbilt University at USA in 1990. He is a professor at the Department of Biomedical Informatics, Asia University, Taiwan, since August 2008. Beginning from December 2009, he serves on the Editorial board of several international journals. He is the Editor-in-Chief, Associate Editor, Reviewer Editor and Guest Editor of the WSEAS Transactions of Biology and Biomedicine, IST Transactions of Biomedical Sciences and Engineering, Frontiers in Genomic Assay Technology and Current Bioinformatics respectively. Furthermore, he is also actively involved in reviewing manuscripts for international journals. He is the PI and co-PI of more than 15 national funded research grants in the last 10 years in the area of bioinformatics research. Dr. Ng has published articles in highly ranked journals, in the areas of PPI network, robustness study of biological networks, domain–domain interactions, non-coding RNA, protein function prediction and DNA data hiding method. His research interests include PPI network, mRNA-microRNA expression profile study, cancer-related microRNAs, physio-chemical properties of protein complexes, time series microarray data analysis and host–pathogen PPI studies.