Prediction of protein secondary structure from PDB structure information based on Sequence segments homology searching Shouji Tatsumoto1 , Kenji Satou2 3 and Akihiko Konagaya1 1
Genomics Sciences Center (GSC), RIKEN (The Institute of Physical and Chemical Research), 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa 230-0045, JAPAN 2
School of Knowledge Science, Japan Advanced Institute of Science and Technology 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, JAPAN 3
Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Corporation (JST), 5-3 Yonban-cho, Chiyoda-ku, Tokyo 102-8666, JAPAN
Abstract In this paper, a novel method to predict protein secondary structure (e.g., helix, beta-sheet and coil) is described. Our method predicts the secondary structure for a query sequence using a segment-wise similarity search, which finds the most probable secondary structure based on similarities between a set of sequence segments of a query sequence and our segment databases: the segment sequence DB and the segment structure DB. The important points concerning our system are: (i) capability of visualizing evidence for the prediction of a query sequence, (ii) higher prediction accuracy in regard to beta-sheet than those of existing methods. Since the existing test set (e.g., the RD126 set) is not applicable to our system for performance evaluation, we used an original blind test set (similar to CASP) which included 355 non-homologous protein chains. The performance of our system yields a 76.9% accuracy of secondary structure prediction which is up to 20% greater than other prediction methods.
Keywords: protein structure prediction / protein secondary structure / segment homology / similarity search
1 Introduction One of the greatest challenges in sequence analysis is to predict accurately the secondary structure of a protein (helix, strand, and coil) form its amino acid chain. In recent years, feed-forward neural networks (NN) have been often used to predict protein secondary structures. The work by Qian and Sejnowski (1988) was one of the first this machine learning techniques. Their approach is based on a fully connected NN, with a local input window of 13 amino acids with orthogonal encoding,
and a single hidden layer. This simple neural network can achieve a prediction accuracy of up to 65%. Increasing the size of the window, however, does not lead to improvements because of the over fitting problem associated with large networks. PHD (Rost and Sander 1993, 1994) uses a different number of machine learning techniques. It achieved a performance of Q3=70.8% (see “Material and Methods”), while prediction of strands achieved an accuracy 65.4% on a non-homologous data set of 126 protein chains (the RS126 set), using cross validation and prediction reliability index. Several methods have been proposed for secondary structure prediction using machine learning technique fed with profiles and alignments from PSI-BLAST (Altschul et al. 1997). Currently the best method achieves nearly 80% accuracy by using multiple NN (Petersen et al. 2000). It is known that existing structure prediction methods tend to achieve high scores if sequences, ‘globally’ similar to a query sequence, are contained in the data set for learning. For this reason, Asilomar blind prediction competition CASP (Critical Assessment of Protein Structure Prediction) results cannot reach those performance levels of 80% (Arthur et al. 2001). Prediction accuracy can be improved by combining more than one prediction method. Cuff and Barton (1999) have combined several widely used prediction methods including PHD, NNSSP (Salamov and Solovyev 1995), DSC (King and Sternberg 1996), and PREDATOR (Frishman and Argos 1996). In this paper, we describe a novel method to predict protein secondary structure. We focused on a local similarity search method using short segment of sequences. The proposed method divides protein sequences in the Protein data bank (PDB) into two types of short segments. A query sequence is divided into overlapping segments and is compared with
segment database (DB) using a standard homology search program, BLAST (Altschul et al. 1990). The usual assumption is that a given short segment of sequence is more likely to form one kinds of secondary structure than another, with the three-dimensional structure having little influence on the formation of secondary structure. If this was true for some residues of a protein, then we should be able to predict their secondary structures with a high accuracy than others. Our method predicts the secondary structure for a query sequence using a segment-wise similarity search, which finds the most probable secondary structure based on similarities between a set of sequence segments of a query sequence and our segment DB. The performance of our system yields a 76.9% accuracy of secondary structure prediction was 20% greater than other prediction methods.
2 Materials and Methods Assignment of secondary structure to particular amino acids is sometimes included in the PDB entry by the investigator who solved the three-dimensional structure. Therefore, three methods are currently used to define the secondary structures of a protein based on its atomic coordinates: DSSP (Kabsh and Sander 1983) STRIDE (Frishman and Argos 1995) and DEFINE (Richards and Kundrot 1995). The DSSP defines the secondary structure, geometrical features, and solvent exposures for a query protein with information regarding atomic coordinates in the PDB format. DSSP was used to refine the erroneous and/or incomplete secondary structure information written in PDB entries and it is currently the most widely used secondary structure definition method. This was the method we used to refine the secondary structure for each protein chains in PDB entries. DSSP provides an eight-state assignment of secondary structure, which we reduced to three states (corresponding to helix, strand; and coil). The mapping of states is as follows (i) ‘H’, ‘G’ and ‘I’ assigned to helix; (ii) ‘E’ and ‘B’ to strand; and (iii) ‘others’ to coil. Among the several measures used for prediction accuracy, Q3 is still the most commonly used. Q3 provides the probabilities of residues predicted correctly as a helix (Pα), strand (Pβ), and coil (Pcoil), and for all three conformational states. The definition of Q3 is Q3 = (Pα+Pβ+Pcoil)/T, where T is the total number of residues. The whole data set of PDB entries (27723 chains; Dec. 31, 2001) were used to construct the segment DB of our method. We used all available information known about protein structures. PDB entries with incomplete structure information (e.g., chain breaks) were previously filtered. The EMBL non-redundant PDB entries subset (Mar. 25, 2002) was used as the testing set in order to try the accuracy of our method. The test set contains 1713 non-redundant protein chains (Hobohm et al. 1992). Entries were excluded when. A chain has less than 30 amino acids. A chain has more than 1500 amino acids (the chain almost breaks).
Segment sequence DB Searching for all possible segment patterns in the structure database would require an enormous number of proteins. Hence, in our approach we simplify by using the segment sequence DB with all protein chains in PDB divided into fixed size of segment. Their segments may be no redundancy each others. The number of segments contains in segment sequence DB is between 3,000 and 28,000. 1) Protein chains in PDB, contains many redundancies and repetitions, for instance multiply solved protein structures and abundance of some folds (such as globins). To avoid problems, we used following approach; if (chain90), 60(>120), 90(->150), 120(->180) } elsif (chain>=400 residues) {width=100; step=50; #step 50 means:400(->500), 450(->550), 500(->600), 550(->650) } 2) Representative chains with no similarity were selected from all chains using the clustering BLAST homology search. 3) The representative chains were evaluated by using a sliding window of 9-residues, and shift a length of 3. So, for instance, the chain ‘MNPLSRPFARTPSLRTRV’ would be divided into ‘MNPLSRPFA’, ‘LSRPFARTP’, ‘PFARTPSLR’ and ‘RTPSLRTRV’. 4) Whenever two or more segments had sequencestructure similarity, they were merged as redundant. 5) Finally, the segment sequence DB were organized as text files on FASTA format, and converted into binary databases for the BLAST homology search program. Segment structure DB Previous methods used to predict the secondary structure of an amino acid residue do not have such a high accuracy when amino acids are more distant than in the fixed segment window of sequence are used. Since the segment sequence DB does not make use of any meaningful segmentation rule in biochemical terms, we made three types of secondary structure segments based on a completely different policy of segmentation. By using this approach, we expect to improve the prediction accuracy using the segment sequence DB. The number of segments contains in it is between 200 and 3,500. 1) Same step as in segment sequence DB. 2) All chains are divided into three states (helix, strand and coil) segments using the secondary structure information from DSSP. Since segment of length 5 or shorter are not suitable for the data set, we filter them out of this point. Following the previous example, the chain ‘MNPLSRPFARTPSLRTRV:chhhhhhccsssssssss’ would be divided into ‘M:c’, ‘NPLSRP:hhhhhh’, ‘FA:
cc’ and ‘RTPSLRTRV:sssssssss’, where the small letters show the secondary structure. 3) Representative segments were selected from all of segments using the clustering BLAST homology search. 4) Finally, the sequence segment DB was organized as text files on FASTA format and converted into binary databases for the BLAST homology search program. Prediction of a query sequence When a query sequence is inputted, the system decomposes it into segments according to the specified BLAST parameter. After the decomposition, all the segments are compared with those in the segment DB described above by using BLAST homology search. All the results of the BLAST search are processed as follows: 1) A query sequence is directly compared with the segment structure DB of the same length, without changing the default E value of BLAST homology
search. 2) A query sequence is also decomposed into short segments and it is evaluated by using a sliding window of fixed size, and shift length is 3. Then, each segment is compared with those in the segment sequence DB with similar length. 3) The top 3 homologous segments in the segment sequence DB are chosen. 4) These 3 segments are aligned on the query sequence, and the secondary structure type, aligned on each position associated with the homologous segments. 5) As the result of the previous steps, a tendency to belong in a secondary structure type is reported for each amino acid position in the query. If no BLAST homology search result was found for a position, the tendency for the position is left as "no segment". The position is not determined unless indicated by a majority of the data (Figure 1).
Figure 1. A result of secondary structure prediction system It can be seen that the likelihood of each secondary structure type is continuously changing in a sinusoidal form, and a prediction result is shown by a majority decision. Note that the waveform can vary by changing the parameters in the input form. The influence of the E value is especially important. Small values of E would make BLAST to report the less number of highly homologous segments, and a waveform with smaller amplitude and many positions with ‘no segment’ will be shown. When a position cannot be decided by a majority of votes, it is indicated as ‘not voted’. If a position is both ‘no segment’ and ‘no voted’, they will be marked as ‘unknown’. In addition to this waveform and the prediction result, the system can show the basis of a prediction at each position by listing the homologous segments used to build the waveform.
Also, each of the listed segments can be visualized via PDB highlight, which is a web-based protein visualization system (Tsukamoto et al. 1997).
3 Learning We performed a set of experiments were based on the 1713 non-redundant chains contained the EMBL nonredundant PDB subset. The prediction accuracies of three state classifiers and other results regarding the data set are shown in Table I. Q3 accuracy is 78.6% per protein and 79.3% per residue and 2.0% respect to chain breaks. The results of our experiment clearly show that the secondary structure prediction score regarding a coil is lower than helix and strand prediction obtained from
segment structure DB. Moreover, the prediction score regarding a helix compared with other secondary structures was much lower: between 5% and 15% from segment sequence DB. It explains that coils are not a common secondary structural element in segment sequence DB. In other words, the coil prediction using Table I.
the low length of segment sequence DB has a bad influence on prediction accuracy of helix. The reason for secondary structure prediction regarding a helix is low concerning segment sequence DB involves the choice of the representative length of our original segment DB.
Accuracy of three state classifiers on the 25% non-redundant data set segment structure DB
segment sequence DB
Length Number Q3(%) MEAN(%) Q3(%)[helix, strand, coil] Q3(%)[helix, strand, coil] 31~ 100 623 77.7 76.7 81.1 [85.0, 75.7, 78.5] 77.0 [65.5, 80.7, 81.2] 101~ 200 582 80.4 80.3 81.9 [84.6, 85.5, 78.5] 83.3 [77.9, 87.7, 83.6] 201~ 300 255 79.7 79.5 79.6 [81.8, 79.0, 77.2] 84.0 [80.2, 89.0, 83.7] 301~1207 253 78.8 78.7 76.2 [81.2, 71.2, 72.2] 84.7 [83.7, 88.0, 83.8] 31~1207 1713 79.3 78.6 79.6 [83.0, 79.5, 76.3] 83.4 [80.0, 87.5, 83.4] Length: protein chain length. Number: number of proteins. Q3: accuracy per residue. MEAN: average accuracy per protein. helix: accuracy of helix prediction. strand: accuracy of strand prediction. coil: accuracy of coil prediction. The segmentation rule was changed so that representative chains were selected from all chains using clustering with BLAST homology search. A complementary set was created, taking into consideration representative chains about secondary structure. Non-redundant (sequence-to-structure) chains were selected from a complementary set. The new segmentation rule did not use coil information for segment sequence DB that if the sequence length was Table II.
over 250 residues. All chains use a new segment sequence DB which takes into consideration in protein chain of the secondary structure. These two types of new segments DB were used for each prediction. Q3 accuracies of 1713 chains of 80.5% per protein and 80.3% per residue are obtained using the new segment (sequence and structure) DB (Table II). A comparison of the results of Table I and Table II shows an overall increase in prediction accuracy of about 1~2%.
Accuracy of three state classifiers on the 25% non-redundant data set segment structure DB
segment sequence DB
Length Number Q3(%) MEAN(%) Q3(%)[helix, strand, coil] Q3(%)[helix, strand, coil] 31~ 100 623 80.9 80.0 79.6 [75.7, 78.5, 81.1] 85.0 [87.3, 85.1, 84.4] 101~ 200 582 81.7 81.6 81.0 [85.5, 78.5, 81.9] 84.6 [89.7, 85.1, 85.4] 201~ 300 255 81.2 81.0 82.5 [77.3, 79.5, 79.9] 80.8 [88.3, 86.8, 86.0] 301~1207 253 78.8 78.9 81.9 [59.0, 72.2, n/a ] 76.3 [86.1, 86.1, 84.9] 31~1207 1713 80.4 80.5 81.7 [73.5, 78.7, 79.1] 81.1 [87.7, 85.9, 85.2] Length: protein chain length. Number: number of proteins. Q3: accuracy per residue. MEAN: average accuracy per protein. helix: accuracy of helix prediction. strand: accuracy of strand prediction. coil: accuracy of coil prediction.
4 Results Since existing test sets (e.g., the RS126 set) are overlapped with the training sets (PDB; Dec. 31, 2001). We cannot use them foe the reasons explained above (see the section “Introduction”). Another factor to consider in prediction accuracy is that some protein structures are more easily predictable than others, so the set of proteins we chose for testing will influence the final accuracy results. Therefore, we decided to generate a new test set containing all proteins published in the PDB since the original training set of proteins was created. We found 1771 protein chains released by the PDB since January 2002. The homology was assessed using BLAST homology search. As a result, we created a new test set containing 355 non-homologous chains. A segment-based secondary structure prediction method, as described in the learning section, was applied to the test set of 1771 non-homologous protein chains. On the test set of 355 chains, our method obtained a Q3 structure prediction accuracy of 76.7% per
protein and 76.1% per residue, with 4% for “unknown”. In order to compare on fair basis our accuracy results with those of other approaches, we measured performance using the same set of proteins and ensuring that none of proteins used for training was used for testing. We compared 7 different methods and a consensus prediction of secondary structure prediction methods (Cuff and Barton 1999). The NPS@ Web servers (Combet et al. 2000) which are available for secondary structure prediction are SOPM (Geourjon and Deleage 1994), HNN (Guermeur 1997), DPM (Deleage and Roux 1987), DSC, GORIV (Garnier et al. 1996), PHD, PREDATOR, SIMPA96 (Levin 1997) and a consensus prediction using all previous method. These prediction methods have achieved results between 56% and 71% accuracy (Table III). The performance of our prediction was 10% greater than consensus prediction methods. Our method classifies each protein chain into one of two categories, ‘predict’ or ‘not predict’. The system considers any chain with ‘unknown’ regions exceeding 10% of the total length as ‘not predict’,
which turned out in 49 chains in the test set being ‘not predicted’. Our method predicted 306 chains with accuracies of 80.3% for protein and 80.8% for residue. Table III. Prediction accuracy achieved by different methods on 355 new non-homologous proteins Q3(%) MEAN(%) DPM 56.4 56.5 DSC 67.6 68.1 GOR4 59.4 60.0 HNNC 64.8 65.2 PHD 71.1 71.2 Predator 65.4 66.3 SIMPA96 65.4 66.1 SOPM 64.6 64.7 Cons. Prediction 66.0 66.5 Our approach 76.4 76.9 Q3: accuracy per residue. MEAN: average accuracy per protein. DPM, DSC, GOR4, HNNC, PHD, Predator, SIMPA96, SOPM, and Sec. Cons: predictions obtained from the NPS@ Web server.
5 Discussion In this paper, we presented a new approach for secondary structure prediction based on segment homology, which resulted in an improved overall accuracy. By combining two different segmentation rules on protein chains in PDB entries, our system achieved a high level of accuracy comparable with those of existing prediction methods. Any techniques, such as multiple alignments, have a tendency to predict a sequence as a strand based on the use of known amino acid sequences in order to survey the global information. Even if our method does not use multiple alignments, strand (global information) tendency can be expressed only using protein chains in PDB. An advantage of this technique is that it can resolve strand, difficult predict with a high level of accuracy using existing methods. This result suggests that adequate segment databases and homology searches against them might play an important role since PDB continues to grow. The method is not a black box like a machine learning technique. Every step can be easily rationalized. Our system does not just predict the secondary structure, but also provides valuable information by giving the degree of reliability of the prediction at each position. This is done by displaying a visual score for each of the three states of structures at each residue position. Although the sequence length has been set experimentally by using the value obtained from BLAST, accuracy may be increased by adjusting the length of each sequence using machine learning technique. This means that our prediction system accuracy can be improved from 3 to 10% (i.e., amino acids predicted as ‘unknown’). By applying other prediction methods to the ‘unknown’ region, our system could predict more true positives. The prediction for every segment can be traced back to the three-dimensional (3D) structure. We believe this is possible through the use of segment 3D information to predict a protein’s three-dimensional structure.
Acknowledgement This work was supported by Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Information Science” from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
References [1] Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D.J. "Basic local alignment search tool." (1990) J. Mol. Bio l., 215, 403-410. [2] Altschul, S.F., Madden, T. L., Schfer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. "Free in PMC Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." (1997) Nucleic Acids Res., 2 5, 3389-3402. [3] Arthur, M. L., Loredana, L. C. and Tim, J. P. H. " Abstract Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts." (2001) Proteins, 5, 98-118. [4] Bernstein, F. C., et al. (1977) J. Mol. Biol., 112, 535-542. Combet, C., Blanchet, C., Geourjon, C. and Deléage, G. " NPS@: Network Protein Sequence Analysis TIBS." (2000) TIBS, 291, 147-150. [5] Cuff J, A. and Barton, G. J., " Evaluation and improvement of multiple sequence methods for protein secondary structure prediction." (1999) Proteins, 34, 508519. [6] Deleage, G. and Roux, B. " An algorithm for protein secondary structure prediction based on class prediction." (1987) Protein Eng., 4, 289-294. [7] Frishman, D. and Argos, P "Knowledge-based protein secondary structure prediction." (1995) Proteins, 23, 566579. [8] Frishman, D. and Argos, P "Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence." (1996) Protein Eng., 9, 133-142. [9] Garnier, J., Gibrat, J. F. and Robson, B. "GOR secondary structure prediction method version IV." (1996) R. F. Doolittle Ed., 266, 540-553. [10] Garnier, J., Ousguthorpe, D. J. and Robson, B. " Analysis and implications of simple methods for predicting the secondary structure of globular proteins." (1978) J. Mol. Biol., 120, 97-120. [11] Geourjon, C. and Deleage, G. " SOPM: a self-optimized method for protein secondary structure prediction." (1994) Protein Eng., 7, 157-164. [12] Hobohm, U., Scharf, M., Schneider, M., and Sander, C. " Selection of a representative set of structures from the Brookhaven Protei n Data Bank. " (1992) Protein Sci., 1,409-417. [13] Kabsch, W. and Sander, C. (1983) Biopolymers, " Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features."
2 2, 2577-2637. [14] King, R. D. and Stermberg, M. J. E. "Identification and application of the concepts important for accurate and reliable protein secondary structure prediction." (1996) Protein Sci., 5, 2298-2310. [15] Levin, J. M. "Exploring the limits of nearest neighbour secondary structure prediction." (1997) Protein Eng, 7, 771-776. [16] Petersen, T. N. et al. "Prediction of protein secondary structure at 80% accuracy." (2000) Proteins, 4 1, 17-20. [17] Qian, N. and Sejnowski, T. J. "Predicting the secondary structure of globular proteins using neural network models." (1988) J. Mol. Biol., 202, 865-884. [18] Richards, F.M. and Kundrot, C. E. "Identification of Structural Motifs from Protein Coordinate Data: Secondary Structure and First-Level Supersecondary Structure." (1988) Proteins, 3, 7 1-84. [19] Rost, B. and Sander, C. "Prediction of protein secondary structure at better than 70% accuracy." (1993) J. Mol. Biol., 232, 584–599. [20] Rost, B., Sander, C. and Schneider, R. "Redefining the goals of protein secondary structure prediction." (1994) J. Mol. Biol., 235, 13–26. [21] Salamov, A. A. and Solovyev, V. V. "Prediction of protein secondary structure by combining nearestneighbor algorithms and multiple sequence alignments." (1995) J. Mol. Biol., 247, 11–15. [22] Tsukamoto, Y., Takiguchi, K., Satou, K., Furuichi, E., Takagi, T. and Kuhara, S. "Application of a deductive database system for topological and similar three dimensional structures in protein." (1997) CABIOS, 3,183190. [23] Zhang, C. T. and Zhang, R. " JPred: a consensus secondary structure prediction server." (1998) Bioinformatics, 1 4, 857-65.