Prediction of Subcellular Localization of Apoptosis ... - CiteSeerX

Report 5 Downloads 99 Views
Prediction of Subcellular Localization of Apoptosis Proteins by Dipeptide Composition Chaohong Song, Feng Shi

Prediction of Subcellular Localization of Apoptosis Proteins by Dipeptide Composition Chaohong Song, Feng Shi*Corresponding author College of Science, Huazhong Agricultural University, Wuhan, 430070, China [email protected] [email protected] doi: 10.4156/jdcta.vol4.issue1.4

Abstract

to identify four kinds of subcellular locations of 98 apoptosis proteins based on amino acid composition by means of the covariant discriminant function [5]. Because of the information absence of sequence order in fact, some new protein features were proposed in order to incorporate sequence order effects of proteins. Chou proposed the concept “pseudo amino acid compositions”[6]. Chou and Cai [7,8] developed an accurate method integrating the pseudo amino acid compositions, the function domain composition and the information of gene ontology. Feng proposed a new representation of unified attribute vector, all of proteins have their representative points on the surface of the 20-D globe [9]. Shao et al used complexity measure factor to predict protein subcellular location [10], Zhang et al predicted protein homo-oligomer types by pseudo amino acid composition which used an improved feature extraction and Naive Bayes feature fusion [11]. Bulashevska and Eils predicted the four kinds of subcellular locations of the same datasets by using hierarchical ensemble of Bayesian classifiers based on Markov chains [12] and so on. Recently, Chen and Li utilized the measure of diversity and increment of diversity to predict the subcellular location of apoptosis proteins [13,14]. The total prediction accuracy was 82.7% and 84.2%,respectively. These results were much better than that of other methods. In this paper, we shall utilize the amino acid and dipeptide composition of a protein sequence to construct the feature of a protein sequence, and use K-Nearest Neighbor Classifier to predict the subcellular location of apoptosis proteins. In order to compare with other approaches, The dataset we used is the same as that in [13]. Via jackknife test, the total prediction accuracy is 88.3%. which is higher than the results of [13]and [14]. In order to validate the feasibility of our method ulteriorly, we do the same work on a new expand dataset which includes 1551 apoptosis proteins, the total prediction accuracy is 78.3%, these results show that the feature of protein sequence constructed in our method combined with K-Nearest Neighbor classfier is very useful for predicting subcellular location of apoptosis proteins.

By cluster analysis, all dipeptides are classified into 16 categories according to their hydrophobicity, Based on the composition of dipeptide categories, a novel representation of protein sequences is proposed here to predict the subcellular location of apoptosis protein sequences. Using K-Nearest Neighbor Classifier, and test on a known dataset which includes 317 apoptosis proteins , the higher predictive success rates are obtained, the total prediction accuracy of our method is 88.3%. In order to validate the feasibility of the method ulteriorly, we do the same work on a new expanded dataset which includes 1551 apoptosis proteins, the total prediction accuracy is 78.3%, these results indicate that the composition of dipeptide categories combined with K-Nearest Neighbor Classifier is very useful for predicting subcellular location of apoptosis proteins .

Keywords Subcellular location; dipeptide; hydrophobicity; Cluster analysis; K-Nearest Neighbor Classifer

1. Introduction Apoptosis, also known as programmed cell death, is characterized by specific morphologic and biochemical properties [1]. Apoptosis proteins play a central role in development and homeostasis of an organism [2]. The function of an apoptosis protein is closely correlated with its subcellular location, so it is very improtant to gain the information about the subcellular location of apoptosis proteins [3]. Although subcellular location of unknown proteins can be determined by experimental methods, they are both time-consuming and expensive. Therefore, it is very urgent to develop an accurate and reliable prediction method for apoptosis protein subcellular location. Actually, many efforts have been made for protein subcellular location: Emanuelsson et al. [4] used N-terminal sequence as an input into two layers of artificial neural networks. Zhou and doctor attempted

32

International Journal of Digital Content Technology and its Applications Volume 4, Number 1, February 2010 our approach can play a complementary role to the subcellular location of apoptosis proteins.

avoid any homology bias, we reduce sequence redundancy to 90%. In order to keep balance, only those subcellulars with more than 60 proteins are reserved. At last, we obtain the 1551 apoptosis proteins, in order to keep the data balance for apoptosis protein category , these apoptosis proteins are classified into the four subcellular locations: (1) 305 cytoplasmic proteins, (2) 550 membrane proteins, (3) 228 secreted proteins, (4) 468 nuclear proteins (see Tables 2).

2. Material and Method 2.1 Data sets Dataset 1(denoted as CL317) is provided by Chen and Li [13]. This dataset is generated by selecting the sequence length with more than 80 amino acids and there is a decided single subcellular location from 846 apoptosis proteins in SWISSPROT (version 49.0) (www.ebi.ac.uk/swissprot) [15]. The dataset contains 317 apoptosis proteins and is classified into the six subcellular locations (see Table 1).

2.2. protein feature In our research, every protein is represented as a point or a vector in a 36-D space, the first 20 components of this vector are supposed to be the occurrence frequencies of the 20 amino acids in a protein sequence, and the last 16 components of this vector are the occurrence frequency of 16 dipeptide categories in a protein sequence. These dipeptide categories are generated by cluster analysis according to the hydrophobicity of dipeptides, the classification of dipeptides is described as Table 3, Hence all proteins have their representative points in the 36-dimension Hilbert space.

Table 1.Number of proteins within each subcellular localization category for 317 apoptosis proteins Subcellular Number of localization proteins Cytoplasmic 112 Membrane 55 Mitochondrial 34 Secreted 17 Nuclear 52 Endoplasmic 47 reticulum sum 317

Table 3. The classification of dipeptides tt(1) VV-VL-VI-VF-VM-VC-VY-LVLL-LI-LF-LM-LC-LY-IV-IL-II-IF-I M-IC-IY-FV-FL-FI-FF-FM-FC-FYMV-ML-MI-MF-MM-MC-MY-CV-C L-CI-CF-CM-CC-CY-YV-YL-YI-YF -YM-YC-YYtt(2) AV-AL-AI-AF-AM-AC-AY-PV-P L-PI-PF-PM-PC-PY-SV-SL-SI-SF-S M-SC-SY-TV-TL-TI-TF-TM-TC-TY -QV-QL-QI-QF-QM-QC-QY-GV-GL -GI-GF-GM-GC-GY-HV-HL-HI-HFHM-HC-HY-NV-NL-NI-NF-NM-NC -NYtt(3) AA-AP-AS-AT-AQ-AG-AH-ANPA-PP-PS-PT-PQ-PG-PH-PN-SA-SP -SS-ST-SQ-SG-SH-SN-TA-TP-TS-T T-TQ-TG-TH-TN-QA-QP-QS-QT-Q Q-QG-QH-QN-GA-GP-GS-GT-GQGG-GH-GN-HA-HP-HS-HT-HQ-HG -HH-HN-NA-NP-NS-NT-NQ-NG-N H-NNtt(4) VA-VP-VS-VT-VQ-VG-VH-VNLA-LP-LS-LT-LQ-LG-LH-LN-IA-IP -IS-IT-IQ-IG-IH-IN-FA-FP-FS-FT-F Q-FG-FH-FN-MA-MP-MS-MT-MQMG-MH-MN-CA-CP-CS-CT-CQ-CG -CH-CN-YA-YP-YS-YT-YQ-YG-YH

Table 2.Number of proteins within each subcellular localization category for 1551 apoptosis proteins Subcellular Number localization of proteins cytoplasmic 305 membrane 550 secreted 228 nucleus 468 sum 1551 Dataset 2 is generated by selecting the sequence length with more than 80 amino acids and from apoptosis proteins in SWISSPROT (version 56.9) (www.ebi.ac.uk/swissprot) [15]. At first, only those entries annotated with ‘‘apoptosis’’ in the ID (identification) fields are collected, then sequences annotated with ambiguous or uncertain words, such as ‘‘potential’’, ‘‘probable’’, ‘‘probably’’, ‘‘maybe’’, and ‘‘by similarity’’ are excluded, the protein which tagged two or more different subcellular location are also removed, and sequences with less than 80 amino acid residues and sequences annotated with ‘‘fragment’’ are removed because they might be just fragments. To

33

Prediction of Subcellular Localization of Apoptosis Proteins by Dipeptide Composition Chaohong Song, Feng Shi

tt(5) tt(6) tt(7)

tt(8)

tt(9) tt(10) tt(11)

tt(12)

tt(13) tt(14) tt(15) tt(16)

Here we defne the distance between two feature vector of protein sequences u = (u1 , u2 , un ) and

-YNAW-PW-SW-TW-QW-GW-HWNWVW-LW-IW-FW-MW-CW-YWRA-RP-RS-RT-RQ-RG-RH-RN-K A-KP-KS-KT-KQ-KG-KH-KN-DADP-DS-DT-DQ-DG-DH-DN-EA-EPES-ET-EQ-EG-EH-ENRV-RL-RI-RF-RM-RC-RY-KV-K L-KI-KF-KM-KC-KY-DV-DL-DI-D F-DM-DC-DY-EV-EL-EI-EF-EM-EC -EYWA-WP-WS-WT-WQ-WG-WHWNWV-WL-WI-WF-WM-WC-WYAR-AK-AD-AE-PR-PK-PD-PE-S R-SK-SD-SE-TR-TK-TD-TE-QR-QK -QD-QE-GR-GK-GD-GE-HR-HK-H D-HE-NR-NK-ND-NEVR-VK-VD-VE-LR-LK-LD-LE-I R-IK-ID-IE-FR-FK-FD-FE-MR-MKMD-ME-CR-CK-CD-CE-YR-YK-YD -YEWR-WK-WD-WEWW RW-KW-DW-EWRR-RK-RD-RE-KR-KK-KD-KEDR-DK-DD-DE-ER-EK-ED-EE-

v = (v1 , v2 , vn ) as Eq. (2)-(3): Distance Where,

Here we calculate the total prediction accuracy (Ac), sensitivity (Sn), and matthew’s correlation coefficient (MCC) to evaluate the final performance of the predictive method, Sensitivity shows the correct prediction rate, and Matthew’s correlation coefficient affects entirely performance of the prediction algorithm. the calculation formulas are Eq (4)-(6): k

Ac = ∑ p (i ) / N

(4)

Sn(i) = p (i ) / obs (i )

(5)

i =1

MCC (i ) =

p (i ) ⋅ n(i ) − u (i ) ⋅ o(i ) ( p (i ) + u (i )) ⋅ ( p (i ) + o(i )) ⋅ (n(i ) + u (i ))(n(i ) + o(i ))

(6) where N is the total number of sequences, k is the class number, obs (i ) is the number of sequences observed in localization i , and

p (i ) is the number of correctly predicted sequences of localization i , n(i ) is

the number of correctly predicted sequences not of localization i , u (i ) is the number of under-predicted

ki is the

sequences of localization

number of these k closest protein sequences which belong to the species ϖ i respectively, i = 1, 2 p .

i and o(i ) is the number of

over-predicted sequences of localization i .

2.5. Performance test

Define discriminant function as Eq. (1):

g= k= i 1, 2 ρ i ( x) i i

(3)

i

2.4. Evaluation of the performance

query protein, Distances from it to all stored vectors (known protein sequence ) are computed, and k

g j ( x) = max ki , then x ∈ ω j

(u − v ) 2 if ui ≠ vi f (ui , vi ) =  i i otherwise  1

i

Usually, k is the number of the nearest proteins counted to the query protein, The best value of k depends on the dataset , the bigger value of k can reduce the effect of noise on the classification, but make boundaries between classes less distinct. the value of k can be selected by various heuristic techniques such as cross-validation.

The K-Nearest Neighbor (K-NN) classfier is quite popular in pattern recognition due to its good performance and simple-to-use feature [16,17,18,19].The K-NN rule, also named as the “voting K-NN rule”, can be simplely expressed as follows: Suppose x = ( x1 , x2  xn ) is feature vector of a

If

(2)

i =1

2.3. K-Nearest Neighbor classfier

closest protein sequences are selected,

n

∑ f (u , v )

(u , v) =

The jackknife test is one of cross-validation tests. During the test, the subcellular location of each protein sequence is identified by all the other proteins except the one that was being identified. Among all cross-validation tests, the jackknife test is thought to be the most rigorous and objective one[13]. In this paper,

(1)

34

International Journal of Digital Content Technology and its Applications Volume 4, Number 1, February 2010 we use jackknife test to validation effectiveness of the predictive method.

our method can correctly identified 216 cytoplasmic proteins, 461 Membrane proteins, 150 Secreted proteins, 387 Nuclear proteins. The result is satisfactory, although the prediction accuracy in Dataset 2 is decreased nearly 10% compared with Dataset 1, this is because the much bigger data size of Dataset 2 reduces the similarity of protein sequences. For the subcellar location of a apoptosis protein, it is very improtant to select a set of reasonable information feature from protein sequences, different feature extraction can bring different prediction accuracy. our prediction results show that the local compositions of amino acids and dipeptide compositions are very useful for subcellar location of a apoptosis protein. The prediction results also show that based on such feature of protein sequences in our method, K-Nearest Neighbor classfier is much suitable for subcellar location of apoptosis proteins. The quite encouraging results indicate that our approach is effective and might be used to predict subcellar location of other proteins.

2. Results and discussion Calculation results of our method are listed in Table 4 and Table 5 respectively. Table 4. Comparison of prediction performance for different methods on the 317 apoptosis proteins dataset with jackknife test Location Diversity Diversity Our method incrementa incremen ( k = 1 ) tb Sn MCC Sn MCC Sn MCC Cy 81.3 0.80 91.1 0.8 92.9 0.84 Me 81.8 0.77 89.1 0.83 87.3 0.87 Mi 85.3 0.74 79.4 0.77 79.4 0.83 Se 88.2 0.68 58.8 0.65 76.5 0.87 Nu 82.7 0.73 73.1 0.69 86.5 0.84 En 83.0 0.90 87.2 0.91 91.5 0.87 Ac(%) 82.7 84.2 88.3 a Comes from[13], by using diversity increment. b Comes from[14], by using diversity increment.

3. Acknowledgments The work was partly supported by Youth Fund of college of science, Huazhong Agriculture University Research Launching Funds (06033) and the Discipline crossing Research Foundation of Huazhong Agricultural University (2008XKJC006).

Table 5. performance of the methods on the 1551 apoptosis proteins dataset with jackknife test Location jackknife test cytoplasmic membrane secreted nucleus Ac(%)

Sn 70.8 83.8 65.8 82.7 78.3

MCC 0.60 0.76 0.71 0.71

4. References [1] Wyllie, A.H., Kerr, J.F., Currie, A.R. “Cell death: the significance of apoptosis”. Int. Rev. Cytol. Vol. 68,1980, pp.251–306. [2] Raff, M. Cell suicide for beginners. Nature. Vol. 396, No. 3707, 1998, pp.119–122. [3] Suzuki, M., Youle, R.J., Tjandra, N. “Structure of Bax: coregulation of dimmer formation and intracellular location”. Cell. Vol. 103, No. 4, 2000, pp. 645–654. [4] Emanuelsson O, Nielsen H, Brunak S, Heijne G. “Predicting subcellular localization of proteins based on their N-terminal amino acid sequence”. J Mol Biol vol Vol. 300, No. 4, 2000, pp. 1005–1016 [5] Zhou, G.P., Doctor, K. “Subcellular location prediction of apoptosis proteins”. Proteins: Struct. Funct. Genet. Vol. 50, No. 2, 2003, pp. 44–48. [6] Chou KC. “Prediction of protein cellular attributes using pseudoamino acid composition”. Proteins: Struct. Funct. Genet. Vol. 43, No. 3, 2001, pp.246–255. [7] Chou KC, Cai YD. “Using functional domain composition and support vector machines for prediction of protein subcellular location”. J Biol Chem, Vol. 227, No. 48, 2002, pp.45765–45769 [8] Chou KC, Cai YD. “Predicting subcellular localization of proteins by hybridizing functional domain composition

From Table 4, we can see that for dataset CL317, the sensitivity and mathew’s correlation coefficient are 92.9% and 0.84 for cytoplasmic proteins, 87.3% and 0.87 for Membrane proteins, 79.4% and 0.83 for mitochondrial proteins, 76.5% and 0.87 for Secreted proteins, 86.5% and 0.84 for Nuclear proteins, 91.5% and 0.87 for Endoplasmic reticulum proteins, respectively. The overall prediction accuracy is 88.3%. Compare with the result of [14], the better Mathew’s correlation coefficients and the better overall prediction accuracy are obtained. the total prediction accuracy of our method is enhanced more than 4%. Table 5 shows that we yield the correct prediction rates as follows: 70.8% for cytoplasmic proteins, 83.8% for Membrane proteins, 65.8% for Secreted proteins, 82.7% for Nuclear proteins, respectively. The total prediction accuracy is 78.3%. which means that

35

Prediction of Subcellular Localization of Apoptosis Proteins by Dipeptide Composition Chaohong Song, Feng Shi and pseudo-amino acid composition”. J Cell Biochem. Vol. 91, No. 3, 2004, pp.1197–1203 [9] Feng, Z.P. “Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition”. Biopolymers, Vol. 58, No. 4, 2001, pp. 491–499. [10] Xiao, X., Shao, S., Ding, Y., Huang, Z., Huang, Y., Chou, K.C. “Using complexity measure factor to predict protein subcellular location”. Amino Acids, Vol. 28, No. 1 2005, pp. 57–61. [11] Zhang, S.W., Pan, Q., Zhang, H.C., Shao, Z.C., Shi, J.Y. “Prediction of protein homo-oligomer types by pseudo amino acid composition: approached with an improved feature extraction and Naive Bayes feature fusion”. Amino Acids, Vol. 30,No. 4, 2006, pp.461–468. [12] Bulashevska A, Eils R. “Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains”. BMC Bioinformatics, Vol. 7, No. 1, 2006, pp. 298 [13] Chen Y L, Li Q Z. “Prediction of the subcellular location of apoptosis proteins’. Journal of Theoretical Biology,Vol. 245, No. 4, 2007, pp. 775–783 [14] Chen Y L, Li Q Z, Yang K L, Fan G L. “Predicting of the subcellular location of apoptosis proteins using the algorithm of the increment of diversity combined with support vector machine”. Acta Biophysica Sinica, Vol. 23,No. 3, 2007, pp.192-197. [15] Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M. “The SWISS-PROT protein knowledgebase and its supplement TrEMBL”. Nucleic Acids Res. Vol. 31, No. 1, 2003, pp. 365–370. [16] Cover, T. M.; Hart, P. E. “Nearest neighbour pattern classification IEEE Trans”. Inform. Theory IT. Vol. 13, No. 1, 1967, pp. 21-27. [17] Denoeux, T. “A k-nearest neighbor classification rule based on Dempster-Shafer theory”. IEEE Trans. Syst. Man. Cybernet, Vol. 25, No. 5, 1995, pp. 804-813. [18] Keller, J. M.; Gray, M. R.; Givens, J. A. “A fuzzy k-nearest neighbours algorithm”. IEEE Trans. Syst. Man. Cybernet. Vol. 15, No. 4, 1985, pp. 580-585. [19]Kuo-Chen Chou; Hong-Bin Shen. “Predicting Eukaryotic Protein Subcellular Location by Fusing Optimized Evidence-Theoretic K-Nearest Neighbor Classifiers”, Journal of Proteome Research , Vol. 5, No. 8, 2006, pp. 1888-1897.

36