Capturing Long-Term Dependencies for Protein Secondary Structure Prediction Jinmiao Chen1 and Narendra S. Chaudhari2 1
[email protected] [email protected] School of Computer Engineering, Nanyang Technological University, Singapore, 639798 2
Abstract. Bidirectional recurrent neural network(BRNN) is a noncausal system that captures both upstream and downstream information for protein secondary structure prediction. Due to the problem of vanishing gradients, the BRNN can not learn remote information efficiently. To limit this problem, we propose segmented memory recurrent neural network(SMRNN) and use SMRNNs to replace the standard RNNs in BRNN. The resulting architecture is called bidirectional segmented-memory recurrent neural network(BSMRNN). Our experiment with BSMRNN for protein secondary structure prediction on the RS126 set indicates improvement in the prediction accuracy.
1
Introduction
One of the most important open problems in computational biology concerns the computational prediction of the secondary structure of a protein given only the underlying amino acid sequence. During the last few decades, much effort has been made toward solving this problem, with various approaches including GOR (Garnier, Osguthorpe, and Robson) techniques [4], PHD(Profile network from HeiDelberg) method [11], nearestneighbor methods [13] and support vector machines (SVMs) [7]. These methods are all based on a fixed-width window around a particular amino acid of interest. Local window approaches don’t take into account the interactions between distant amino acids, which commonly occur in protein sequences, especially in the regions of beta-sheets. The limitations of the fixed-size window approaches can be mitigated by using recurrent neural network (RNN), which is a powerful connectionist model for learning in sequential domains. Recently, recurrent neural networks have been applied to predict protein secondary structure. Gianluca Pollastri et. al proposed a bidirectional recurrent neural network (abbreviated BRNN), which provides a noncausal generalization of RNNs [9]. The BRNN uses a pair of chained hidden state variables to store contextual information contained in the upstream and downstream portions of the input sequence respectively. The output is then obtained by combining the two hidden representations of context. However, learning long-term dependencies is difficult if gradient descent algorithms are employed to train recurrent neural networks [1]. The generalized back propagation algorithm for training BRNN is gradient descent essentially, thus the error propagation in both the forward and backward chains is also subject to exponential decay. In the practice of protein secondary structure prediction, the BRNN can utilize information located within about ±15 amino acids around the residue of interest. It fails to discover relevant information contained in even further distant portions [9]. In order to improve the prediction performance especially the recognition accuracy of β-sheet regions, we propose an alternative recurrent architecture called Bidirectional Segmented-Memory Recurrent Neural Network (BSMRNN), which is capable of capturing longer ranges of dependencies in protein sequences.
2
2 2.1
Segmented-memory recurrent neural networks Architecture
As we observe, during the process of human memorization of a long sequence, people tend to break it into a few segments, whereby people memorize each segment first and then cascade them to form the final sequence. The process of memorizing a sequence in segments is illustrated in Figure 1. In Figure 1, gray arrows indicate the update of
Fig. 1. Segmented memory with interval=d
contextual information associated to symbols and black arrows indicate the update of contextual information associated to segments; numbers under the arrows indicate the sequence of memorization. Based on the observation on human memorization, we believe that RNNs are more capable of capturing long-term dependencies if they have segmented-memory and imitate the way of human memorization. Following this intuitive idea, we propose SegmentedMemory Recurrent Neural Network (SMRNN) as illustrated in Figure 2.
Fig. 2. Segmented-memory recurrent neural network with interval=d
Fig. 3. Dynamics of segmented-memory recurrent neural network
The SMRNN has hidden layer H1 and hidden layer H2 representing symbol-level state and segment-level state respectively. Both H1 and H2 have recurrent connections among themselves. The states of H1 and H2 at the previous cycle are copied back and stored in context layer S1 and context layer S2 respectively. Most importantly, we introduce into the network a new attribute interval, which denotes the length of each segment.
3
2.2
Dynamics
In order to implement the segmented-memory illustrated in Figure 1, we formulate the dynamics of SMRNN with interval=d as below: nX nU X X Wijxx xt−1 + Wijxu utj ) xti = g( j j=1
nY nX X X yit = g( Wijyy yjt−d + Wijyx xtj ) j=1
zit = g(
(1)
j=1
(2)
j=1 nY X
Wijzy yjt )
(3)
j=1
We now explain the dynamics of SMRNN with an example(Figure 3). In this example, the input sequence is divided into segments with equal length 3. Then symbols in each segment are fed to hidden layer H1 to update the symbol-level context. Upon completion of each segment, the symbol-level context is forwarded to the next layer H2 to update the segment-level context. This process continues until it reaches the end of the input sequence, then the segment-level context is forwarded to the output layer to generate the final output. In other words, the network reads in one symbol per cycle; the state of H1 is updated at the coming of each single symbol, while the state of H2 is updated only after reading an entire segment and at the end of the sequence. The segment-level state layer behaves as if it cascades segments sequentially to obtain the final sequence as people often do: Every time when people finish one segment, they always go over the sequence from the beginning to the tail of the segment which is newly memorized, so as to make sure that they have remembered all the previous segments in correct order(see Figure 1). 2.3
Learning strategy
The SMRNN is trained using an extension of the Real Time Recurrent Learning algorithm. Every parameter P is initialized with small random values then updated according to gradient descent: ∂E t ∂E t ∂y t ∆P = −α = −α t (4) ∂P ∂y ∂P where α is the learning rate and E t is the error function at time t. Derivatives associated to recurrent connections are calculated in a recurrent way. Derivatives of segment-level state at time t depend on derivatives at time t − d where d is the length of each segment. nY X ∂yjt−d ∂yit t−d t t + Wijyy yy = yi (1 − yi )(δik yl yy ) ∂Wkl ∂Wkl j=1
(5)
nY t−d X ∂yit yy ∂yj t t t = y (1 − y )( W yx yx + δik xl ) i i ij ∂Wkl ∂W kl j=1
(6)
where δik denotes the Kronecker delta function(δik is 1 if i = k and 0 otherwise). Derivatives of symbol-level state at time t are dependent on derivatives at time t-1. nX X ∂xt−1 ∂xti j t−1 t t xx = x (1 − x )(δ x + W ik i i ij l xx xx ) ∂Wkl ∂W kl j=1
(7)
nX X ∂xt−1 ∂xti j t t t xx = x (1 − x )( W i i ij xu xu + δik ul ) ∂Wkl ∂W kl j=1
(8)
4
3 3.1
Bidirectional Segmented-Memory Recurrent Neural Network Architecture
In Gianluca Pollastri’s bidirectional recurrent neural network, the forward subnetwork Nϕ (left subnetwork in Figure 4) and the backward subnetwork Nβ (right subnetwork in Figure 4) are conventional recurrent neural networks. We replace the conventional recurrent subnetworks with segmented-memory recurrent neural networks and obtain a novel architecture called Bidirectional Segmented-Memory Recurrent Neural Network (abbreviated as BSMRNN), which is capable of capturing farther upstream context and farther downstream context. The BSMRNN architecture is illustrated in Figure 5.
Fig. 4. A BRNN architecture for PSS prediction
Fig. 5. Bidirectional segmented-memory recurrent neural network Architecture
The upstream context and downstream context are contained in vector Ft and Bt respectively. Let vector It encode the input at time t, vectors Ft and Bt are defined by the following recurrent bidirectional equations: Ft = φ(Ft−d , It )
(9)
Bt = ψ(Bt+d , It )
(10)
where φ() and ψ() are nonlinear transition functions. They are implemented by the forward SMRNN Nφ and the backward SMRNN Nψ respectively (left subnetwork and right subnetwork in Figure 5). The final output is obtained by combining the two hidden representations of context and the current input: Ot = ζ(Ft , Bt , It )
(11)
where ζ() is realized by MLP Nζ (top subnetwork in Figure 5). Usually the number of input neurons is equal to the size of input alphabet, i.e. |Σi |. Symbols in the sequences are presented to the input layer with one-hot coding. When al (the l-th symbol in the input alphabet) is read, the kth element of the input vector is Ik = δ(k, l)(δ(k, l) is 1 if k = l and 0 otherwise). Normally the synaptic input equals the sum of inputs multiplied by weights. Hence if an input is zero, the weights associated to that input unit won’t be updated. Thus we apply a contractive mapping f (x) = ² + (1 − 2²)x, with ² a small positive number, to input units. This mapping does not affect the essentials of the formalism presented here. The number of output units is equal to the size of output alphabet, i.e. |Σo |.
5
3.2
Inference and Learning
Given an amino acid sequence X1 , . . . , Xt , . . . , XT , the BSMRNN can estimate the posterior probabilities of secondary structure classes for each sequence position t. Starting from F0 = 0, the forward SMRNN reads in the preceding substring X1 , . . . , Xt−1 from left to right and updates its states Ft , following eq. 9. Similarly, starting from B0 = 0, the backward SMRNN scans the succeeding substring Xt+1 , . . . , XT from right to left and updates its states Bt , following eq. 10. After the forward and backward propagations have taken place, the output at position t is then calculated with eq. 11. Learning in BSMRNN is also gradient-based. The weights of subnetwork Nζ are adjusted in the same way as standard MLP. The derivatives of the error function with respect to states Ft and Bt are calculated and injected into Nφ and Nψ respectively. Then the error signal is propagated over time in both directions and the weights of Nφ and Nψ are adjusted using the same formulas as those of causal SMRNN (refer to section 2.4).
4
Experimental evaluation
For the problem of PSS prediction, we use seven fold cross validation on the RS126 set(126 protein chains, 23,348 amino acids). Many PSS prediction methods were developed and tested on this set. With sevenfold cross validation, approximately six-sevenths of the dataset are used for training, and the remaining one-seventh is used for testing. In order to avoid the selection of extremely biased partitions that would give an inauthentic prediction accuracy, the RS126 set is divided into seven subsets with each subset having similar size and content of each type of secondary structure. For each subnetwork of the BSMRNN, we use 10 hidden units. According to G.Pollastri’s study, BRNN can reliably utilize information contained within about ± 15 amino acids around the predicted residue. Thus we set the interval of SMRNN to be 15. Recurrent neural network is a bit less stable than the feed-forward ones, so we shuffle the training set at each pass, so as to present it in different random order each time. The error may oscillate (so it might occasionally grow), but it’s not difficult to cope with it. What we do is to allow the network to be above its minimal error for a fixed number of passes of the training set (20-30 will do). So, if the minimal error so far is in the last 20-30 epochs, there is no problem; if it is before that then it means the system is stuck, so we decrease the learning rate to half, and so on. We obtained results illustrated in Table 1. Q3 is the overall three-state prediction percentage defined as the ratio of correctly predicted residues to the total number of residues. QE , QH and QC are the percentage of correctly predicted residues observed in class E, H, C respectively. The results show that BSMRNN performs better on the problem of PSS prediction. Particularly, the higher prediction accuracy of beta-sheets indicates that BSMRNN captures longer ranges of dependencies in protein sequences than BRNN. Table 1. Comparison between BSMRNN and BRNN Q3 QE QH QC BSMRNN 66.7% 61.8% 52.1% 78.3% BRNN 65.3% 57.1% 52.3% 77.3%
6
5
Concluding remarks
The segmented memory recurrent neural networks are more capable of capturing longterm dependencies than conventional RNNs. From biology, we know protein secondary structure prediction is a long-term dependency problem. Therefore, BSMRNN can improve the prediction performance, especially the recognition accuracy of beta-sheets. However, there is a trade-off between the efficient training of gradient descent and longrange information latching. For BSMRNN, the training algorithm is also gradient descent essentially, hence BSMRNN does not circumvent the problem of long-term dependencies. In practice, we found that the best prediction of testing data is obtained when the training error is not very small, and after that prediction accuracy begins to drop even though the error still keeps converging. One reason may be that the RS126 set is too small for BSMRNN to learn the complex mapping from amino acid sequence to secondary structure sequence. Hence a larger training set is required to achieve a satisfactory level of generalization. A lot of research has shown that prediction quality can be improved by incorporating evolutionary information in the form of multiple sequence alignment [3, 10, 12]. In a multiple sequence alignment, different sequences are compared to each other to find out which parts of the sequences are alike and which are dissimilar. The similarity of sequences is usually related to their evolutionary dependencies. The sequence alignment of homologous proteins accords with their structural alignment and aligned residues usually have similar secondary structures. In the experiments, we performed prediction on single protein sequence only. The prediction accuracy can be further improved by using multiple alignments of homologous protein sequences.
References 1. Yoshua Bengio, Patrice Simard, and Paolo Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks 5 (1994), no. 2, 157–166. ˘ 2. Branko Ster, Latched recurrent neural network, Elektrotehni˘ ski vestnik 70(1-2) (2003), 46–51. 3. V.Di Francesco, J. Garnier, and P.J. Munson, Improving protein secondary structure prediction with aligned homologous seqeunces, Prot.Sci. 5 (1996), 106–113. 4. J.F. Gibrat, J. Garnier, and B. Robson, Further developments of protein secondary structure prediction using information theory, J. Mol. Biol. 198 (1987), 425–443. 5. S.El Hihi and Y. Bengio, Hierarchical recurrent neural networks for long-term dependencies, Advances in Neural Information Processing Systems (M.Perrone M.Mozer, D.D.Touretzky, ed.), MIT Press, 1996, pp. 493–499. 6. S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation 9 (1997), no. 8, 1735–1780. 7. S. Hua and Z. Sun, A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach, J. Mol. Biol. 308 (2001), 397– 407. 8. T. Lin, B.G. Horne, P. Tino, and C.L. Giles, Learning long-term dependencies in narx recurrent neural networks, IEEE Trans. on Neural Networks 7 (1996), 1329–1337. 9. G. Pollastri, D. Przybylski, B. Rost, and P. Baldi, Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles, Proteins 47 (2002), 228–235. 10. S.K. Riis and A. Krogh, Improving prediction of protein secondary structure using structural neural networks and multiple sequence alignment, J.Comp.Biol. 3 (1996), 163–183. 11. B. Rost and C. Sander, Prediction of protein secondary structure at better than 70% accuracy, J.Mol.Biol. 232 (1993), 584–599. 12. A.A. Salamov and V. V. Solovyev, Protein secondary structure prediction using local alignments, J. Mol. Biol. 268 (1997), 31–36. 13. Tau-Mu Yi and Eric S. Lander, Protein secondary structure prediction using nearest-neighbor methods, J. Mol. Biol. 232 (1993), 1117–1129.