Prediction of Secondary Structures of Proteins Using a Two-Stage ...

Report 2 Downloads 35 Views
16th European Symposium on Computer Aided Process Engineering and 9th International Symposium on Process Systems Engineering W. Marquardt, C. Pantelides (Editors) © 2006 Published by Elsevier B.V.

1679

Prediction of Secondary Structures of Proteins Using a Two-Stage Method Metin Turkay and Ozlem Yilmaz and Fadime Uney Yuksektepe College of Engineering, Koç University, Rumelifeneri Yolu, Sarıyer, 34450 İstanbul, TURKEY

Abstract Protein structure determination and prediction has been a focal research subject in life sciences due to the importance of protein structure in understanding the biological and chemical activities in any organism. The experimental methods used to determine the structures of proteins demand sophisticated equipment and time. In order to overcome the shortcomings of the experimental methods, a host of algorithms aimed at predicting the location of secondary structure elements using statistical and computational methods are developed. However, prediction accuracies of these methods rarely exceeded 70%. In this paper a novel two-stage method to predict the location of secondary structure elements in a protein using the primary structure data only is presented. In the first stage of the proposed method, folding type of a protein is determined using a novel classification model for multi-class problems. The second stage of the method utilizes data available in the Protein Data Bank and determines the possible location of secondary structure elements in a probabilistic search algorithm. It is shown that the average accuracy of the predictions increased to 74.1%. Keywords: Protein Structure, Data Classification, Mixed-Integer Linear Programming

1. Introduction Proteins are large molecules indispensable for existence and proper functioning of biological organisms. Proteins are used in structure of cells, which are main constituents of larger formations like tissues and organs. Bones, muscles, skin and hair of organisms are made basically up of proteins. Besides their necessity for structure, they are also required for proper functioning and regulation of organisms such as enzymes, hormones, antibodies. Understanding functions of proteins is crucial for discovery of drugs to treat various diseases and disorders. A protein molecule is the chain(s) of amino acids also called residues. A typical protein contains 200 – 300 amino acids but this may increase up to approximately 30,000 in a single chain. There are 4 basic structural phases in proteins: primary structure, secondary structure, tertiary structure and quaternary structure. The primary structure is the sequence of amino acids that make up the protein. The secondary structure of a segment of polypeptide chain is the local spatial arrangement of its main-chain atoms without regard to the conformation of its side chains or to its relationship with other segments. This is the shape formed by amino acid sequences due to interactions between different parts of molecules. There are mainly three types of secondary structural shapes: α-helices, β-sheets and other structures connecting these such as loops, turns or coils. Alpha-helices are spiral strings formed by hydrogen bonds between CO and NH groups in residues Beta-sheets are plain strands formed by stretched polypeptide backbone. Connecting structures do not have regular shapes; they connect α-helices and β-sheets to each other. Turns enable parts of polypeptide chain to

1680

M. Turkay et al.

fold onto itself reversing the direction of the polypeptide chain to form its threedimensional shape. Proteins are classified according to their secondary structure content, considering αhelices and β-sheets. Levitt and Chothia [1] were the first to propose such a classification with four basic types. “All-α” proteins consist almost entirely (at least 90%) of αhelices. “All-β” proteins are composed mostly of β-sheets (at least 90%) in their secondary structures. There are two intermediate classes which have mixed α-helices and β-sheets. “α/β” proteins have approximately alternating, mainly parallel segments of α–helices and β-sheets. The last class, “α+β” has mixture of all-α and all-β regions, mostly in an antiparallel fashion.[2] Due to bottlenecks in experimental methods to determine protein structures, computational approaches to predict protein structures are developed. All structure prediction methods basically rely on the idea that there is a correlation between residue sequence and structure. Most methods to predict protein structure from residue sequence utilize information on known protein structures. Databases are formed and examined for relationships between amino acid sequence and protein structure. First predictions were made in 1970s with a few dozen structures available. Currently structures of about 33,500 (as of November 2005) proteins are identified that means vast amount of data supporting more reliable predictions with better accuracy is available. Protein structures are stored in and accessible from Protein Data Bank.[3] Among different computational methods developed to predict protein structures, the most successful ones include neural network models, database search tools, multiple sequence alignment, local sequence alignment, threading, hidden Markov model-based prediction, nearest neighbor methods, molecular dynamic simulation, and approaches combining different prediction methods. Neural networks are parallel, distributed information processing structures and the method tries to solve the problem by training the network.[4-6] The most successful ones are Copenhagen[7], PSI-BLAST[8], PHD[6,9] and SSpro[10]. The multiple sequence alignment method aligns each sequence such that one base in a sequence corresponds to bases in the other sequences to reveal the similarity of genetic code, evolutionary history, and common biological functions of the strings. Consensus is one of the latest approaches utilizing this method with significant performance.[11] The local sequence alignment approach utilizes local pair-wise alignment of the sequence to be predicted and the most significant method developed with this approach is named PREDATOR.[12] Threading maps the unknown structure to the most similar known sequence.[13] Hidden Markov Model-Based Prediction of Secondary Structure (HMMSTR) considers similarity of unknown protein to segments of known structures.[14] The nearest neighbor methods operate by matching segments of the sequence with segments within a database of known structures, and making a prediction based on the observed secondary structures of the best matches.[15--18] There are two significant approaches: combination of GOR Algorithm and Multiple Sequence Alignment Method and combination of Nearest-Neighbor Algorithms and Multiple Sequence Alignment Method. The combination of GOR algorithm and multiple sequence alignment method[19] starts with selection of a set of proteins (12 proteins) with well-determined structures, none of which belonging to or having identity to any of proteins in databank of GOR program. Next, multiple sequence alignment of these proteins is carried out and the results are the inputs for GOR algorithm. A scoring system considering sequence-similarity matrix, local structural environment scoring scheme and N and C-terminal positions of secondary structure types is utilized with a restricted database of a small subset of

Prediction of Secondary Structures of Proteins Using a Two-Stage Method

1681

proteins that are similar in the combination of nearest-neighbor algorithms and multiple sequence alignment method. Nearest-neighbor algorithmic part is followed by multiple sequence alignments.[20] The comparison of various methods for predicting secondary structure of globular proteins in general are tested on 195 proteins and three state per residue performances (Q3: helix, sheet or other), and Q3 are measured.[11] Prediction accuracies of these methods are given in Table 1. Table 1: Average three-state accuracy indices calculated for six prediction algorithms based on 396 proteins[11].

Method PHD[6,9] NNSSP[18] DSC[22] PREDATOR[12] ZPRED[23] Consensus[11]

Q3 (%) 71 950 71.400 68.413 68.602 59.637 72.707

In this paper, a two-stage algorithm for the secondary structure prediction of proteins is presented. The algorithm has a probabilistic approach utilizing data on all structurally identified proteins having the same folding type with the new unknown protein. The first stage in the method is determination of class of unknown protein. This is accomplished by solving a mixed integer linear program (MILP) problem with 100% accuracy. The objective of the first stage is to reveal some of the uncertainties in the protein structure by determining the folding type accurately. The second stage involves decomposition of the amino acid sequence to overlapping sequential groups of 3 to 7 residues. A local database is formed for each folding type by extracting structural data from PDB files. After matrix of frequency of occurrences of each sequential group of new residue chain is generated, probabilities of being in an α– helix, a β-sheet or a connecting structure are calculated for each residue. The structure with maximum probability is accepted as the structure.

2. The Two-Stage Method The two-stage algorithm decomposes the secondary structure prediction problem into two steps: first the overall folding type of the protein is predicted, and then the secondary structure is predicted using the refined statistical data from the first step. 2.1. Prediction of Folding Type The overall folding type of a protein depends on amino acid composition.[24] Several methods are developed to exploit this theory in the prediction of folding type of proteins.[25-27] These methods use statistical analysis and separate multi-dimensional amino acid composition data into several folding types. The prediction of protein folding type is a typical multi-class data classification problem. Classification of multidimensional data plays an important role in the decision determining main characteristics of a set. Support vector machines is a data mining method to classify data into different groups.[27] Although this method can be efficient in classifying data into two groups, it is inaccurate and inefficient when the data needs to be classified into more than two sets. Mixed-integer programming allows the use of hyper boxes for

M. Turkay et al.

1682

defining boundaries of the sets that include all or some of the points in that set. Therefore, the efficiency and accuracy of multi-class data classification can be improved significantly compared to traditional methods.[28-29] Another approach is to define piecewise linear functions to separate the data that belongs to different classes from each other.[30] The main differences between these three approaches are illustrated in Figure 1.

Figure 1. Three approaches to multi-class data classification problems: (a) support vector machines, (b) MILP using hyper-boxes, (c) piecewise-linear functions. The protein folding type prediction problem is considered in two parts: training and testing. The objective of the training part is to determine the characteristics of the proteins that belong to a certain class and differentiate them from the data points that belong to other classes. After the distinguishing characteristics of the classes are determined, then the effectiveness of the prediction must be tested. The prediction accuracies with different methods for the data set given in Chuo[25] are summarized in Table 2. Table 2: Average prediction accuracies with different methods for the folding type problem.

Method SVD[26] NN[31] SVM[27] CC[32] Hyper-boxes[29] Piecewise-Linear Functions[30]

al-α 66.7% 68.6% 74.3% 84.3% 87.5% 100%

al-β 90.1% 85.2% 82% 82% 85.7% 100%

α+β 81% 86.4% 87.7% 81.5% 91.3% 100%

Overall α/β 66.7% 81% 56.9% 74.7% 72.3% 79.4% 67.7% 79.1% 50% 83.3% 100% 100%

2.2. Prediction of the Secondary Structure The basis for the algorithm is searching segments of its residue sequence in pool of known protein structures and predicting structure for each residue on the basis of frequency of occurrence. To determine the number of residues in each segment to be searched, two facts are considered: the segment should be long enough to have a legitimate reason to search considering interactions and bonds formed between amino acids to shape their structures. Every structure is searched in the relevant database whose dimensions were stated in previous section. The algorithm considers 3 to 7residues-long segments of this chain in the database as illustrated on a sample primary sequence in Figure 2.

Prediction of Secondary Structures of Proteins Using a Two-Stage Method

1683

a k v r a q h s y a f t q k l m s r f h n a t y …….. akv

raq

akvr

hsy

aft

aqhs

akvra

qkl

yaft qhsya

akvraq

msr

qklm ftqkl

hsyaft

akvraqh

fhn

aty

srfh

naty

msrfh qklmsr

syaftqk

naty. fhnaty

lmsrfhn

aty....

Figure 2. Representation of overlapping 3,4,5,6, and 7 residue segments. Then, the probability for a particular residue to be in a secondary element for each residue segment is calculated according to its folding type. ⎛ rijk ⎞ Pijk = ⎜ ⎟ ⎝ tik ⎠

(1)

where Pijk represents the probability of residue I being of structure type j in k-residue segments, rijk is the total number of occurrences of residue i in structure type j in kresidue segments, and tik is the total number of occurrences of residue i in k-residue segments. Then, the ultimate probability, Qij, for residue i to be in structure type j is calculated as follows, 7

Qij = ∑ wk Pijk

(2)

k =3

The weights, wk, are determined for each folding type using the data available in SCOP[21] database with least squares regression.[33] The three ultimate probabilities calculated using Eq. (2) are compared and the secondary structure type that has the highest ultimate probability is selected as the secondary structure of the residue. The results of the algorithm are given in Table 3. Table 3: Results of secondary structure prediction.

Q3 (%) # f All-α All-β α/β α+β Total

2000 2766 3375 2866 11007

i

# f 41139 67373 100255 62935 271702

# f 419 579 707 601 2306

i

# f

id

203302 313844 505693 279355 1302194

80.5 72.4 71.9 75.5 74.1

3. Conclusions A novel two-stage method to predict the secondary structure of proteins is presented in this paper. The objective of the first stage is to reveal some of the uncertainties in the protein structure by determining the folding type accurately. The second stage involves decomposition of the amino acid sequence to overlapping sequential groups of 3 to 7 residues and calculation of probability for a particular residue to be in a secondary

1684

M. Turkay et al.

structure. It is shown that the novel two-stage method performs better compared to the state-of-the-art general methods for globular proteins.

References [1] Levitt,M. and Chotia,C. (1976) Structural patterns in globular proteins. Nature, 261, 552–558. [2] Mount, D. W. (2001), Bioinformatics: Sequence & Genome Analysis, Cold Spring Harbor Laboratory Press, Woodbury, New York. [3] Protein Data Bank, htttp://www.pdb.org/. [4] Bohr, H., Bohr, J., Brunak, S., Cotterill, R. M J., Fredholm, H., Lautrup, B., and Petersen, S. B. (1990), A novel approach to prediction of the 3-dimensional structures of protein backbones by neural networks, FEBS Letters 261, 43-46. [5] Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2002), Artificial neural network method for predicting protein secondary structure content, Computers and Chemistry 26, 347-350. [6] Rost, B. (2001) Review: Protein secondary structure prediction continues to rise, Journal of Structural Biology 134, 204-218. [7] Petersen, T. N., Lundegaard, C., Nielsen, M., Bohr, H., Bohr, J., Brunak, S., Gippert, G. P., and Lund, O. (2000), Prediction of protein secondary structure at 80% accuracy, Proteins 41, 17-20. [8] Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. (1997) Gapped Blast and PSI-Blast: A new generation of protein database search programs, Nucleic Acids Res. 25, 3389-3402. [9] Rost, B., and Sander, C. (1993), Prediciton of protein secondary structure at better than 70% accuracy, J. Mol. Biol. 232, 584-599. [10] Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastrt, G. (1999) Exploiting the past and the future in protein secondary structure prediction, Bioinformatics 15, 937-946. [11] Cuff, J. A., Barton, G. (1999), Evaluation and improvement of multiple sequence methods for protein secondary structure prediction, Proteins 34, 508-519. [12] Frishman, D., and Argos, P. (1997), Seventy–five percent accuracy in secondary structure prediction, PROTEINS: Structure, Function and Genetics 27, 329-335. [13] Thiele, R., Zimmer, R., and Lengauer, T. (1999), Protein threading by recursive dynamic programming, Journal of Molecular Biology 290, 757-779. [14] Bystroff, C., Thorsson, V., and Baker, D. (2000), HMMSTR: A hidden Markov model for local sequence – structure correlations in proteins, J. Mol. Biol. 301, 173-190. [15] Sen, S. (2003), Statistical analysis of pair-wise compatibility of spatially nearest neighbor and adjacent residues in α-helix and β-strands: Application to a minimal model for secondary structure prediction, Biophysical Chemistry 103, 35-49. [16] Westhead, D. R., and Thornton, J. M. (1998), Protein structure prediction, Current Opinion in Biotechnology 9, 383-389. [17] Yi, T. M., and Lander, E. S. (1993), Protein Secondary Structure Prediction Using Nearestneighbor Methods, Journal of Molecular Biology 232, 1117-1129. [18] Salamov, A. A., and Soloveyev, V. V. (1997) Protein secondary structure prediction using local alignments, J. Mol. Biol. 268, 31-36. [19] Kloczkovski, A., Ting, K.-L., Jernigan, R.L., and Garnier, J. (2002), Protein secondary structure prediction based on the GOR algorithm incorporating multiple sequence alignment information, Polymer 43, 441-449. [20] Salamov, A. A., and Soloveyev, V. V. (1995), Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithms and Multiple Sequence Alignments, J. Mol. Biol. 247, 11-15. [21] SCOP, http://scop.mrc-lmb.cam.ac.uk/ [22] King, R. D., and Sternberg, M. J. E. (1996), Identification and application of the concepts important for accurate and reliable protein secondary structure prediction, Protein Science 5, 2298-2310. [23] Zvelebil, M. J. J. M., Barton, G. J., Taylor, W. R., and Sternberg, M. J. E. (1987), Prediction of protein secondary structure and active sites using the alignment of homologous sequences, Journal of Molecular Biology 195, 957-961. [24] Nakashima, H., Nishikawa, K., and Ooi, T. (1986), J. Biochem. 99, 152-162. [25] Chou, K.C. (1995), Does the folding type of a protein depend on its amino acid composition?, FEBS Letters 363, 127-131. [26] Bahar, I., Atilgan, A.R., Jernigan, R.L., and Erman, B. (1997), Understanding the Recognition of Protein Structural Classes by Amino Acid Composition, Proteins: Structure, Function, and Genetics 29,172-185.

Prediction of Secondary Structures of Proteins Using a Two-Stage Method

1685

[27] Cai, Y.D., Liu, X.J., Xu, X.B., and Zhou, G.P. (2001), Support Vector Machines for predicting protein structural class, BMC Bioinformatics 2, 3. [28] Uney, F., and Turkay, M. (2005)"A Mixed-Integer Programming Approach to Multi-Class Data Classification Problem", European Journal of Operational Research, in print. [29] Turkay, M., Uney, F. and Yilmaz, O. (2005), Prediction of Folding Type of Proteins Using Mixed-Integer Linear Programming, Computer-Aided Chem. Eng., vol 20A: ESCAPE-15, L. Puigjaner and A. Espuna (Eds.), 523-528, Elsevier, Amsterdam. [30] Turkay, M., Bagirov, A. and Uney, F. (2005), Prediction of Folding Type of Proteins Using Piecewise-Linear Functions, manuscript under preparation. [31] Cai, Y.D., Zhou, G.P. (2000), Prediction of protein structural classes by neural network, Biochemie, 82, 783-785. [32] Chou, K.C., Liu, W.M., Maggiora, G.M., Zhang, C.T. (1998), Prediction and classification of domain structural classes, ProteinsL Structure, Function, and Genetics, 31, 97-103. [33] Yilmaz, O, (2003), A Two-stage mathematical programming algorithm for predicting secondary structures of proteins, MS Thesis, Koc University, Istanbul, Turkey.