Bidirectional Segmented-Memory Recurrent ... - Semantic Scholar

Report 4 Downloads 43 Views
Soft Computing manuscript No. (will be inserted by the editor)

Bidirectional Segmented-Memory Recurrent Neural Network for Protein Secondary Structure Prediction Jinmiao Chen1 , Narendra S. Chaudhari2 1

e-mail: [email protected], School of Computer Engineering, Nanyang Technological University, Singapore, 639798

2

e-mail: [email protected], School of Computer Engineering, Nanyang Technological University, Singapore, 639798

Abstract

The formation of protein secondary structure especially the regions of β-sheets

involves long-range interactions between amino acids. We propose a novel recurrent neural network architecture called Segmented-Memory Recurrent Neural Network (SMRNN) and present experimental results showing that SMRNN outperforms conventional recurrent neural networks on long-term dependency problems. In order to capture long-term dependencies in protein sequences for secondary structure prediction, we develop a predictor based on Bidirectional Segmented-Memory Recurrent Neural Network(BSMRNN), which is a noncausal generalization of SMRNN. In comparison with the existing predictor based on bidirectional recurrent neural network(BRNN), the BSMRNN predictor can improve prediction performance especially the recognition accuracy of β-sheets.

1 Introduction

One of the most important open problems in computational biology concerns the computational prediction of the secondary structure of a protein given only the underlying Send offprint requests to: Jinmiao Chen

2

Jinmiao Chen, Narendra S. Chaudhari

amino acid sequence. During the last few decades, much effort has been made toward solving this problem, with various approaches including GOR (Garnier, Osguthorpe, and Robson) techniques (Gibrat, Garnier & Robson 1987), PHD(Profile network from HeiDelberg) method (Rost & Sander 1993), nearest-neighbor methods (Yi & Lander 1993) and support vector machines (SVMs) (Hua & Sun 2001). These methods are all based on a fixed-width window around a particular amino acid of interest. Local window approaches have several well-known drawbacks. First, a fair choice of the window size may be difficult especially in the absence of relevant prior knowledge. Second, the number of parameters grows with the window size. This means that permitting certain far away inputs to exert an effect on the current prediction is paid in terms of parametric complexity. Last but most importantly, the interactions between distant amino acids, which are common in protein sequences especially in the regions of β-sheets, are not taken into account. To overcome these limitations, we need a predictor which makes use of the whole sequence instead of short substrings and consider long-term dependencies in protein sequences. The limitations of the fixed-size window approaches can be mitigated by using recurrent neural network (RNN), which is a powerful connectionist model for learning in sequential domains. Unlike feedforward neural networks, RNNs accept sequential data as inputs and employ state dynamics to store contextual information, thus they can adapt to variable width temporal dependencies. Recently, recurrent neural networks have been applied to predict protein secondary structure(PSS). Gianluca Pollastri et. al proposed a PSS predictor based on bidirectional recurrent neural network (abbreviated BRNN), which provides a noncausal generalization of RNNs (Pollastri, Przybylski, Rost & Baldi 2002). The BRNN uses a pair of chained hidden state variables to store contextual information contained in the upstream and downstream portions of the input sequence respectively. The output is then obtained by combining the two hidden representations of context. The BRNN is trained with a generalized back propagation algorithm. Unfortunately, learning long-term dependencies is difficult if gradient descent algorithms are employed to train recurrent neural networks (Bengio, Simard & Frasconi 1994) because the necessary conditions of robust in-

Title Suppressed Due to Excessive Length

3

formation latching bring a problem of vanishing gradients (Bengio et al. 1994). Similarly, in the case of BRNN, the error propagation in both the forward and backward chains is also subject to exponential decay, thus BRNN cannot learn remote information efficiently. In the practice of protein secondary structure prediction, the BRNN can utilize information located within about ±15 amino acids around the residue of interest. It is unable to discover relevant information contained in even further distant portions (Pollastri et al. 2002). From biology, we know that long-range interactions between amino acids are important in the formation of protein secondary structures, especially the β-sheets. Thus the capability of capturing long-term dependencies has a great impact on the performance of secondary structure prediction. Failing to capture long-term dependencies is the main reason for the low prediction accuracy of β-sheets using the existing methods. In order to improve the prediction performance especially the recognition accuracy of β-sheet regions, we propose an alternative recurrent architecture called Bidirectional Segmented-Memory Recurrent Neural Network (BSMRNN), which is capable of capturing farther upstream and downstream context than BRNN.

2 Segmented-Memory Recurrent Neural Network 2.1 Motivation Earlier research indicates that the memory of recurrent neural networks is limited (Bengio et al. 1994). If the RNN reads in a long sequence at a time, it tends to forget the head part of the sequence. Human memory is similar in that people also have difficulty in memorizing very long sequences. As is observed, during the process of human memorization of a long sequence, people tend to break it into a few segments, whereby people memorize each segment first and then cascade them to form the final sequence (Severin & Rigby 1963, Wickelgren 1967, Ryan 1969a, Ryan 1969b, Frick 1989, Hitch, Burgess, Towse & Culpin 1996). The process of memorizing a sequence in segments is illustrated in Figure 1. In Figure 1, the substrings in parentheses represent segments of equal length d; gray arrows indicate the update of contextual information associated to symbols and black arrows

4

Jinmiao Chen, Narendra S. Chaudhari

indicate the update of contextual information associated to segments; numbers under the arrows indicate the sequencing of memorization.

Fig. 1 Segmented memory with interval=d

2.2 Architecture

Based on the observation on human memorization, we believe that RNNs are more capable of capturing long-term dependencies if they have segmented-memory and imitate the way of human memorization. Following this intuitive idea, we propose Segmented-Memory Recurrent Neural Network (SMRNN) as illustrated in Figure 2. The SMRNN has two hidden layers, namely layer H1 and layer H2. The layer H1 represents symbol-level state and the layer H2 represents segment-level state. Both H1 and H2 have recurrent connections among themselves. The states of H1 and H2 at the previous cycle are fed back and stored in context layer S1 and context layer S2 respectively. Most importantly, we introduce into the network a new attribute interval d, which denotes the length of each segment.

Title Suppressed Due to Excessive Length

5

Fig. 2 Segmented-memory recurrent neural network with interval=d

2.3 Dynamics

In this section, we formulate the dynamics of SMRNN with interval=d to implement the segmented memory illustrated in Figure 1. The symbol-level state is initialized at the beginning of each segment by x0k = g(σk ). Let uti be the input at time t, the symbol-level state at time t is calculated by:   PnU P X  xu t xx 0  g( nj=1 Wki ui ) if t = nd + 1 Wkj xj + i=1 t xk =  P X PnU  xx t−1 xu t  g( nj=1 Wkj xj + i=1 Wki ui ) otherwise

(1)

where k = 1, . . . , nX . The segment-level state is initialized at the beginning of the sequence by yk0 = g(υk ). Due to the insertion of intervals into the memory of context, the segment-level state is updated only after reading each entire segment or at the end of the sequence. Let T denote the length of input sequence and d denote the length of interval, the segment-level state at time t is calculated by:   P Y PnX  yy t−1 yx t  g( nj=1 Wkj yj + i=1 Wki xi ) if t = nd or t = T t yk =    ykt−1 otherwise

(2)

where k = 1, . . . , nY . Contextual information contained in vector y t is forwarded to the output layer to produce an output. nY X zy t zkt = g( Wkj yj ) j=1

where k = 1, . . . , nZ

(3)

6

Jinmiao Chen, Narendra S. Chaudhari

For the sake of completeness, we again summarize the significance of the symbols in the above equations.

– yjt−1 denotes the previous state of hidden layer H2, which is stored in context layer S2. – xt−1 denotes the previous state of hidden layer H1, which is stored in context layer S1. j – nZ , nY , nX and nU denote the numbers of neurons at output layer, hidden layer H2 (context layer S2), hidden layer H1 (context layer S1) and input layer respectively. – Wijzy denotes the connection from the jth neuron at hidden layer H2 to the ith neuron at output layer. – Wijyy denotes the connection from the jth neuron at context layer S2 to the ith neuron at hidden layer H2. – Wijyx denotes the connection from the jth neuron at hidden layer H1 to the ith neuron at hidden layer H2. – Wijxx denotes the connection from the jth neuron at context layer S1 to the ith neuron at hidden layer H1. – Wijxu denotes the connection from the jth neuron at input layer to the ith neuron at hidden layer H1. – g(x) = 1/(1 + exp(−x)).

We now explain the dynamics of SMRNN with an example(Figure 3). In this example,

Fig. 3 Dynamics of segmented-memory recurrent neural network

Title Suppressed Due to Excessive Length

7

the input sequence is divided into segments with equal length 3. Then symbols in each segment are fed to hidden layer H1 to update the symbol-level state. Upon the completion of each segment, the symbol-level state are forwarded to the next layer H2 to update the segment-level state. This process continues until it reaches the end of the input sequence, then the segment-level state is forwarded to the output layer to generate the final output. In other words, the network reads one symbol per cycle as conventional RNNs do; the state of H1 is updated at the coming of each single symbol, while the state of H2 is updated only after reading each entire segment or at the end of the sequence. The segment-level state layer behaves as if it cascades segments sequentially to obtain the final sequence as people often do: Every time when people finish one segment, they always go over the sequence from the beginning to the tail of the segment which is newly memorized, so as to make sure that they have remembered all the previous segments in correct order(see Figure 1).

2.4 SMRNN learning strategy

The segmented-memory recurrent neural network is trained using an extension of the Real Time Recurrent Learning algorithm (abbreviated as ERTRL). Usually, the weights are initialized with random values and then learned, but the initial states of the hidden neurons x0k and yk0 do not change during training. As remarked in (Forcada & Carrasco 1995), it is not reasonable to make the initial states fixed, and the behavior of the network improves if they are also learned. In order to keep xtk and ykt within the range [0, 1] during gradient descent, we define the synaptic inputs σkt and υkt such that xtk = g(σkt ) and ykt = g(υkt ), and take σk0 and υk0 as the parameters to be optimized. Let E t represent the cost function at time t which is given by:

Et =

nZ X 1 k=1

2

(zkt − z¯kt )2 ,

(4)

where z¯kt is the desired output and zkt is the actual output at time t. Every parameter P , zy yy yx xx xu including Wkj , Wkj , Wki , Wkj , Wki , σk0 and υk0 , is initialized with small random values

8

Jinmiao Chen, Narendra S. Chaudhari

then updated according to gradient descent: ∆P = −α

∂E t + η∆0 P , ∂P

(5)

with a learning rate α and a momentum term η. The value ∆0 P represents the variation of P in the previous iteration. zy The derivative of E t with respect to Wkj can be calculated in a single step:

∂E t t ¯kt )zkt (1 − zkt )yjt . zy = (zk − z ∂Wkj

(6)

However the derivatives with respect to other parameters need much more computation: n

n

Z Y X X ∂υ t ∂E t zy t Wab yb (1 − ybt ) b . = (zat − z¯at )zat (1 − zat ) ∂P ∂P a=1

(7)

b=1

The derivative of synaptic input υbt with respect to parameter P , i.e.

∂υbt ∂P ,

is computed

in a recurrent way. At time 0, the initial derivative with respect to υk0 is calculated by ∂υb0 0 ∂υk

= δ(b, k) where δ(b, k) denotes the Kronecker delta function. For the other parameters,

the initial derivatives are set to be zero, that is,

∂υb0 ∂P

= 0. At time t, the derivatives of υbt

yy yx with respect to Wkj , Wki and υk0 are calculated respectively using the following equations.

nY X ∂υbt ∂υ t−1 yy t−1 t−1 + Wba ya (1 − yat−1 ) a yy , yy = δ(b, k)yj ∂Wkj ∂Wkj a=1

(8)

nY X ∂υbt ∂υ t−1 yy t−1 t = δ(b, k)x + Wba ya (1 − yat−1 ) a yx , yx i ∂Wki ∂Wki a=1

(9)

n

Y t−1 X ∂υbt yy t−1 t−1 ∂υa = W y (1 − y ) , a a ba ∂υk0 ∂υk0 a=1

(10)

xu xx , and Wki The iteration for calculating the derivatives of υbt with respect to σk0 , Wkj

which are all denoted by P , is given as follows:

n

n

Y X X ∂υbt ∂υ t−1 X ∂σ t yy t−1 yx t = Wba ya (1 − yat−1 ) a + Wba xa (1 − xta ) a . ∂P ∂P ∂P a=1 a=1

(11)

xx xu The derivatives of σat with respect to σk0 , Wkj and Wki are also calculated by iteration

Title Suppressed Due to Excessive Length

9

∂σa0 = δ(a, k) , ∂σk0

(12)

X X ∂σbt−1 ∂σat xx t−1 Wab xb (1 − xt−1 = , b ) 0 ∂σk ∂σk0

(13)

∂σa0 xx = 0 , ∂Wkj

(14)

X X ∂σbt−1 ∂σat t−1 xx t−1 = Wab xb (1 − xt−1 , b ) xx xx + δ(a, k)xj ∂Wkj ∂Wkj

(15)

∂σa0 xu = 0 , ∂Wki

(16)

X X ∂σbt−1 ∂σat xx t−1 t = Wab xb (1 − xt−1 b ) xu xu + δ(a, k)ui . ∂Wki ∂Wki

(17)

n

b=1

n

b=1

n

b=1

2.5 Information latching We test the performance of SMRNN on the information latching problem. This problem is a minimal task designed by Bengio as a test that must be passed in order for a dynamic system to latch information robustly (Bengio et al. 1994). In this task, the SMRNN is trained to classify two different sets of sequences. For each sequence X1 , X2 , . . . , XT , the class C(X1 , X2 , . . . , XT ) ∈ {0, 1} depends only on the first L values of the sequence: C(X1 , X2 , . . . , XT ) = C(X1 , X2 , . . . , XL )

(18)

where T is the length of the sequence. The value XL+1 , · · · , XT are irrelevant for determining the class of the sequences, however, they may affect the evolution of the dynamic system and eventually erase the internally stored information about the initial values of the input. We suppose L fixed and allow sequences of arbitrary length T À L. By increasing T , we will be able to create a long-term dependency problem. The network should provide an answer at the end of each sequence. Thus, the problem can be solved only if the network is able

10

Jinmiao Chen, Narendra S. Chaudhari

to store information about the initial input values for an arbitrary duration. This is the simplest form of long-term computation that one may ask a recurrent network to carry out. We carried out the first experiment on a SMRNN with interval=15 and obtained results illustrated in Table 1. In this experiment, we kept L fixed and varied T in increments of ten.

Table 1 Information latching in SMRNN with interval=15 L

T

train set size

test set size

accuracy

50

60

30

30

83.3%

50

60

50

50

96%

50

70

50

50

54%

50

70

80

80

92.5%

50

80

50

50

58%

50

80

80

80

88.8%

50

80

100

100

97%

50

90

50

50

66%

50

90

100

100

92%

50

90

150

150

100%

50

100

100

100

53%

50

100

150

150

98%

50

110

200

200

45.5%

50

110

300

300

96.3%

50

120

400

400

99%

50

130

500

500

99.6%

50

140

500

500

100%

50

150

600

600

100%

50

160

800

800

99.75%

50

170

1000

1000

99.9%

50

180

1200

1200

100%

50

190

1300

1300

99.9%

50

200

1400

1400

100%

Title Suppressed Due to Excessive Length

11

As the sequences become longer, more training samples are required to achieve a satisfactory level of generalization. For sequences with length 60-200, SMRNN can learn to classify the testing sequences with high accuracies. Being a comparison test, an Elman’s network (Elman 1991) was also trained for the same task. The results are illustrated in Table 2. From this table, we observe that Elman’s Table 2 Information latching in Elman’s network with 10 hidden units L

T

train set size

test set size

accuracy

50

55

30

30

30%

50

55

100

100

72.5%

50

55

200

200

92.5%

50

60

100

100

40%

50

60

200

200

85%

50

65

200

200

75%

50

65

300

300

60%

50

65

400

400

47.5%

network has difficulty learning to classify sequences of length 65 and the accuracy declines as the size of training set increases. This means that Elman’s network has low accuracy not because the training data is insufficient but because it is not powerful enough. A comparison between Table 1 and Table 2 indicates that SMRNN is able to capture much longer ranges of dependencies than Elman’s network.

3 Bidirectional Segmented-Memory Recurrent Neural Network

As discussed in the previous section, segmented-memory recurrent neural network can improve performance on long-term dependency problems. This advantage allows it to capture long-term dependencies in protein sequences for secondary structure prediction. However, the SMRNN is a causal system in the sense that the output at time t does not depend on future inputs. Obviously, PSS prediction is a noncausal problem, in which the SS pat-

12

Jinmiao Chen, Narendra S. Chaudhari

tern of an amino acid depends on both upstream and downstream information. To remove the causality assumption, we develop a bidirectional segmented- memory recurrent neural network (BSMRNN) which is the noncausal generalization of SMRNN, .

3.1 Architecture

In Gianluca Pollastri’s bidirectional recurrent neural network, the forward subnetwork (left subnetwork in Figure 4) and the backward subnetwork (right subnetwork in Figure 4) are conventional recurrent neural networks. We replace the conventional recurrent subnetworks

Fig. 4 A BRNN architecture for PSS prediction

with segmented-memory recurrent neural networks and obtain a novel architecture called Bidirectional Segmented-Memory Recurrent Neural Network (abbreviated as BSMRNN) as illustrated in Figure 5. The upstream context and downstream context are contained in vector Ft and Bt respectively. Let vector It encode the input at time t, vectors Ft and Bt are defined by the following recurrent bidirectional equations: Ft = φ(Ft−d , It )

(19)

Bt = ψ(Bt+d , It )

(20)

Title Suppressed Due to Excessive Length

13

Fig. 5 Bidirectional segmented-memory recurrent neural network Architecture

where φ() and ψ() are nonlinear transition functions. They are implemented by the forward SMRNN Nφ and the backward SMRNN Nψ respectively (left subnetwork and right subnetwork in Figure 5). The final output is obtained by combining the two hidden representations of context and the current input: Ot = ζ(Ft , Bt , It )

(21)

where ζ() is realized by MLP Nζ (top subnetwork in Figure 5).

3.2 Inference and Learning

Given an amino acid sequence X1 , . . . , Xt , . . . , XT , the BSMRNN can estimate the posterior probabilities of secondary structure classes for each sequence position t. Starting from F0 = 0, the forward SMRNN reads in the preceding substring X1 , . . . , Xt−1 from left to right and updates its states Ft , following eq. 19. Similarly, starting from BT +1 = 0, the backward SMRNN scans the succeeding substring Xt+1 , . . . , XT from right to left and updates its states Bt , following eq. 20. After the forward and backward propagations have taken place, the output at position t is then calculated with eq. 21. The BSMRNN is trained in the way that minimizes the mean squared error. Optimization is based on gradient descent, where gradients are computed by a noncausal version of ERTRL algorithm. The weights of the top subnetwork Nζ are adjusted in the same way as standard

14

Jinmiao Chen, Narendra S. Chaudhari

MLP. At position t, the derivatives of error with respect to weights in the forward subnetwork Nφ and those in the backward subnetwork Nψ are given by: ∂Et ∂Et ∂Ft = ∂Wφ ∂Ft ∂Wφ

(22)

∂Et ∂Et ∂Bt = ∂Wψ ∂Bt ∂Wψ

(23)

The derivatives of the error function with respect to states Ft and Bt , i.e.

∂Et ∂Ft

and

∂Et ∂Bt ,

are calculated and injected into the forward SMRNN Nφ and the backward SMRNN Nψ respectively. Then the error signal is propagated over time in both directions and the weights of Nφ and Nψ are adjusted using the same formulas as those of causal SMRNN (refer to section 2.4).

4 Experimental evaluation

4.1 Data preparation

The assignment of the PSS categories can be performed by three programs, namely program DSSP (Kabsch & Sander 1989), STRIDE (Frishman & Argos 1995) and DEFINE (Richards & Kundrot 1988). Assignment schemes impact prediction performance to some extent (Cuff & Barton 1999). In this project, we concentrate on the widely used DSSP assignment. The DSSP program classifies each residue into 8 classes: H(α helix), B(isolated βbridge), E(extended β-strand), G(310 -helix), I(π-helix), T(hydrogen bonded turn), S(bend) and C(not HBEGIT or S). However, prediction methods are normally trained and assessed for only 3 standard classes associated with helices(H), beta-strands(E), and coils(C), so the 8 classes must be reduced to 3. There are four main methods to perform the reduction process. – Method A:H{H, G}, E{E, B}, C{S, T, I, C} – Method B:H{H}, E{E}, C{G, S, T, B, I, C} – Method C:H{H, G, I}, E{E, B}, C{S, T, C} – Method D:H{H, G}, E{E}, C{S, T, B, I, C}

Title Suppressed Due to Excessive Length

15

A study of the effect of various assignments on the prediction performance can be found in (Cuff & Barton 1999). In this article, method A which is also called CASP reduction was adopted, because it is considered to be the strictest definition, which usually results in lower prediction accuracy than other definitions (Moult, Hubbard, Bryant, Fidelis & Pedersen 1997, Orengo, Bray, Hubbard, LoConte & Sillitoe 1999). Three main data sets are used to develop and test the BSMRNN predictor: RS126, CB396 and PSIPRED set. RS126 : Many PSS prediction methods were developed and tested on this set. It is a nonhomologous dataset according to the definition given by Rost and Sandar (Rost & Sander 1993). They used percentage identity to measure the homology and defined non-homologous to mean that no two proteins in the dataset share more than 25% sequence identity over a length of more than 80% residues. The protein chains and multiple sequence alignments of the RS126 set are available at http://www.compbio.dundee.ac.uk/~www-jpred/data/ pred_res/. With sevenfold cross validation, approximately six-sevenths of the RS126 set are used for training, and the remaining one-seventh is used for testing. In order to avoid the selection of extremely biased partitions that would give inauthentic prediction accuracy, the RS126 set is divided into seven subsets with each subset having similar size and content of each type of secondary structure (Hua & Sun 2001, Casbon 2002). In practice, we tried several (> 10) different random partitions of the RS126 set. For each partition, we calculated the number of residues and the content of each secondary structure type (H, E and C) of each subset. The partition finally selected has the minimal bias, i.e. we kept the partition which distributed the three SS classes most evenly. The seven subsets of the RS126 set are given in the Appendix. CB396 : Cuff, J. and Barton, G. have developed a new non-redundant dataset of 396 protein domains for evaluation of PSS prediction algorithms (Cuff & Barton 1999). The protein chains and multiple sequence alignments of the CB396 set are available at http://www. compbio.dundee.ac.uk/~www-jpred/data/pred_res/. The CB396 set does not include

16

Jinmiao Chen, Narendra S. Chaudhari

any of the 126 proteins in the RS126 set, nor does it contain homologs of those 126 proteins as measured by a stringent test of sequence similarity. Therefore, the RS126 and the CB396 sets are a suitable pair of training and testing sets for evaluation of PSS prediction methods.

PSIPRED set : David T. Jones selected three independent training and testing set pairs with which to develop and evaluate the PSIPRED method by using PSI-BLAST profiles. The PSIPRED set is available at ftp://bioinf.cs.ucl.ac.uk/pub/psipred/old/data/. It was used for three-way cross validation in our experiments.

The distribution of the three classes in the above data sets is summarized in Table 3.

Table 3 Three-class assignment statistics C(%)

H(%)

E(%)

number of protein chains

RS126

44.95

31.85

23.20

126

subset A

44.72

27.63

27.65

17

subset B

42.30

40.19

17.51

22

subset C

47.36

24.90

27.74

19

subset D

40.78

39.58

19.64

18

subset E

49.32

23.72

26.96

16

subset F

52.90

22.36

24.74

15

subset G

43.03

37.23

19.73

19

CB396

41.85

35.45

22.71

396

train1

44.50

33.08

22.42

1156

test1

43.52

34.37

22.10

63

train2

44.13

34.73

21.13

1096

test2

43.01

31.76

25.23

62

train3

44.34

33.08

22.58

1092

test3

41.65

36.98

21.37

62

Title Suppressed Due to Excessive Length

17

Table 4 Parameters and total weights of BSMRNN and BRNN. Ct=size of semi-window of context states; NHC=number of hidden units in the forward/backward context subnetworks; NFB=number of output units in the forward/backward context subnetworks; NHO=number of hidden units in the output subnetwork. Ct

NHC

NFB

NHO

Weights

BSMRNN

0

10

10

20

1860

BRNN

0

10

10

20

1680

4.2 Profiles

Prediction from a multiple alignment profile of protein sequences rather than a single sequence has long been recognized as a way to improve prediction accuracy

(Rost &

Sander 1993, Rost & Sander 1994, Riis & Krogh 1996, Francesco, Garnier & Munson 1996, Rost 1996). During evolution, residues with similar physico-chemical properties are conserved if they are important to the fold or function of the protein. The sequence alignment of homologous proteins accords with their structural alignment and aligned residues usually have similar secondary structures. The multiple-sequence alignment profiles can be generated from BLAST or PSI-BLAST program. In this article, we used BLAST generated profiles on the RS126 and CB396 sets, and PSI-BLAST generated profiles on the PSIPRED set.

4.3 Architecture details and training

We carried out many experiments to tune up the prediction system. Qualitatively, we selected the optimal sizes of BSMRNN and BRNN as illustrated in Table 4. According to G.Pollastri’s study, BRNN can reliably utilize information contained within about ± 15 amino acids around the predicted residue. Thus we set the interval of SMRNN to be 15. We shuffle the training set at each pass, presenting it in different random order each time, so as to prevent various undesirable effects such as oscillation and convergence to local minima. Training is stopped using a fixed threshold on the reduction in error. For both

18

Jinmiao Chen, Narendra S. Chaudhari

BSMRNN and BRNN, no bias weights are added to the neurons, and all the activation functions are logistic function sigmoid in [0,1].

4.4 Results

For the purpose of evaluating the methods, we firstly used sevenfold cross validation on the RS126 set and obtained results illustrated in Table 5. Q3 is the overall three-state Table 5 Seven fold cross validation on RS126 set Q3(%)

SD

Q E(%)

Q H(%)

Q C(%)

CE

CH

CC

SOV

BSMRNN

72.3

0.08

67.5

70.9

76.0

0.53

0.67

0.53

62.0

BRNN

71.4

0.08

59.7

70.5

77.5

0.50

0.62

0.52

61.3

prediction percentage defined as the ratio of correctly predicted residues to the total number of residues. SD is the standard deviation of the accuracy per protein. Q E, Q H and Q C are the percentage of correctly predicted residues observed in class E, H, C respectively. C H, C E, C C are the Matthews’ correlation coefficients (Matthews 1975, Rost & Sander 1993). SOV(Segment OVerlap) is a segment-based measure for PSS prediction assessment (Zemla, Venclovas, Fidelis & Rost 1999, Cuff & Barton 1999). Table 6 provides the confusion matrices of BSMRNN measured on the RS126 set. The number in row Xobs and column Ypred represents the percentage of times structure Y is predicted, given that structure X has been observed; the number in row Xpred and column Yobs represents the percentage of times structure Y is observed, given that structure X has been predicted. Each row sums to 100%. To further compare our system with Pollastri’s BRNN, we also used the RS126 set for training while the CB36 set for testing. The performance of BSMRNN and BRNN is showed in Table 7. Finally, we used PSI-BLAST profiles as input and achieved even higher accuracy 75.8% which is comparable to the accuracy 76% of D. T. Jones’ PSIPRED method (Jones 1999).

Title Suppressed Due to Excessive Length

19

Table 6 Confusion matrices for BSMRNN on the RS126 set. Xpred=structure X is predicted; Yobs=structure Y is observed Hpred

Epred

Cpred

Hobs

70.78%

9.25%

19.97%

Eobs

4.52%

67.49%

27.99%

Cobs

8.44%

15.52%

76.04%

Hobs

Eobs

Cobs

Hpred

83.90%

10.97%

23.67%

Epred

4.02%

60.02%

24.89%

Cpred

8.44%

15.52%

76.04%

Table 7 Comparison of performance on the CB396 set Q3(%)

SD

Q E(%)

Q H(%)

Q C(%)

CE

CH

CC

SOV

BSMRNN

73.1

0.08

64.8

74.0

76.6

0.53

0.67

0.53

63.0

BRNN

72.0

0.07

56.3

74.7

76.9

0.50

0.63

0.53

60.7

5 Discussion

Protein secondary structure prediction is a problem of long-term dependencies. In order to capture the long-term dependencies in protein sequences, we propose a segmented-memory RNN(SMRNN) and replace the conventional RNNs in BRNN with SMRNNs, resulting a bidirectional segmented-memory RNN(BSMRNN). The test on the problem of information latching validates that SMRNN is more capable of capturing long-term dependencies than conventional RNNs. For the problem of PSS prediction, we provide experimental results showing that BSMRNN achieves higher prediction accuracy than BRNN. Given the large number of protein sequences available through genome and other sequencing projects, even small percentage improvement in SS prediction can be significant. Most importantly, the higher prediction accuracy of beta-sheets indicates that BSMRNN captures longer ranges of dependencies in protein sequences than BRNN.

20

Jinmiao Chen, Narendra S. Chaudhari

However, there is a trade-off between the efficient training of gradient descent and longrange information latching. For BSMRNN, the training algorithm is also gradient descent essentially, hence BSMRNN does not circumvent the problem of long-term dependencies. Nevertheless, BSMRNN is more efficient in learning long-term dependencies for PSS prediction, especially the prediction of β sheets whose conformation involves interactions between distant amino acids. The known protein structures have been classified into four structural classes, namely all-α, all-β, α/β and α+β (Levitt & Chothia 1976, Richardson & Richardson 1989). Several approaches have been proposed for protein structural class prediction (Kumarevel, Gromiha & Ponnuswamy 2000, Luo, Feng & Liu 2002). The information about the structural class may help to improve the accuracy levels of secondary structure prediction schemes. An evaluation (Gromiha & Selvaraj 1998) of several structure prediction methods in different structural classes revealed that in spite of the differences in the approach, they are able to predict uniformly the secondary structures of proteins belonging to the all-α class better than those of other classes of proteins. A plausible reason for this tendency is that short- and medium-range interactions predominate in all-α class proteins and most of the secondary structure prediction algorithms take into account only the effect of neighboring residues which include short- and medium-range interactions. The authors suggested that developing secondary structure prediction techniques that are specific for each structural class incorporating the influence of short-, medium- and long-range interactions may improve the prediction accuracy. Since the BSMRNN predictor is designed specially for long-range interactions, it may be advantageous to train BSMRNN specially for all-β or α/β proteins which incorporate long-range interactions , or use different intervals for different classes of proteins, say short intervals for all-α proteins whereas long intervals for all-β proteins. Recently, the accuracy of protein secondary structure prediction keeps on rising due to various sources. Several researchers have analyzed the effect of alignment quality and found that more divergent profiles yield better prediction. Przybylski and Rost’s study (Przybylski & Rost 2001) of the influence of various alignment strategies on PHD method indicates

Title Suppressed Due to Excessive Length

21

that more than 60% of the improvement originated from the growth of current sequence databases; about 20% resulted from detailed changes in the alignment procedure(substitution matrix, threshold, gap penalities); another 20% resulted from carefully using iterated PSIBLAST searches. Cuff and Barton also revealed that different types of multiple sequence alignment profiles provide a range of accuracy from 70.5% to 76.4% (Cuff & Barton 2000). Hence, we can further improve the performance of the BSMRNN predictor by selecting larger database and more appropriate searching method, alignment algorithm and scoring scheme. Our preliminary experiments encourage further investigations. The BSMRNN predictor can be extended in additional directions. In addition to the use of larger input window for It , we may consider embedded memory in the forward and backward subnetworks, and the use of priors on the parameters especially the value of intervals. It may also be advantageous to combine several BSMRNNs of different sizes and different architectural details to form an ensemble, using a simple averaging scheme.

References

Bengio, Y., Simard, P. & Frasconi, P. (1994), ‘Learning long-term dependencies with gradient descent is difficult’, IEEE Transactions on Neural Networks 5(2), 157–166. Casbon, J. (2002), Protein secondary structure prediction with support vector machines, Master’s thesis, University of Sussex. Cuff, J. & Barton, G. (1999), ‘Evaluation and improvement of multiple sequence methods for protein secondary structure prediction’, Proteins 34, 508–519. Cuff, J. & Barton, G. (2000), ‘Application of multiple sequence alignment profiles to improve protein secondary structure prediction’, Proteins 40, 502–511. Elman, J. (1991), ‘Distributed representations,simple recurrent networks,and grammatical structure’, Machine Learning 7(2/3), 195–226. Francesco, V., Garnier, J. & Munson, P. (1996), ‘Improving protein secondary structure prediction with aligned homologous seqeunces’, Prot.Sci. 5, 106–113.

22

Jinmiao Chen, Narendra S. Chaudhari

Frick, R. (1989), ‘Explanations of grouping effects in immediate ordered recall’, Cognition 17, 551– 562. Frishman, D. & Argos, P. (1995), ‘Knowledge-based protein secondary structure assignment’, Proteins 23(4), 566–579. Gibrat, J., Garnier, J. & Robson, B. (1987), ‘Further developments of protein secondary structure prediction using information theory’, J. Mol. Biol. 198, 425–443. Gromiha, M. & Selvaraj, S. (1998), ‘Protein secondary structure prediction in different structural classes’, Protein Eng. 11, 249–251. Hitch, G., Burgess, N., Towse, J. & Culpin, V. (1996), ‘Temporal grouping effects in immediate recall: A working memory analysis’, Quarterly Journal of Experimental Psychology 49A, 116– 139. Hua, S. & Sun, Z. (2001), ‘A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach’, J. Mol. Biol. 308, 397–407. Jones, D. (1999), ‘Protein secondary structure prediction based on position-specific scoring matrices’, J. Mol. Biol. 292, 195–202. Kabsch, W. & Sander, C. (1989), ‘Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features’, Biopolymers 22, 2577–2637. Kumarevel, T., Gromiha, M. & Ponnuswamy, M. (2000), ‘Structural class prediction: an application of residue distribution along the sequence’, Biophysical Chemistry 88, 81–101. Levitt, M. & Chothia, C. (1976), ‘Structural patterns in globular proteins’, Nature 261, 552–558. Luo, R., Feng, Z. & Liu, J. (2002), ‘Prediction of protein structural class by amino acid and polypeptide composition’, Eur. J. Biochem. 269, 4219–4225. Matthews, B. (1975), ‘Comparison of the predicted and observed secondary structure of t4 phage lysozyme’, Biochim. Biophys, Acta 405, 442–451. Moult, J., Hubbard, T., Bryant, S., Fidelis, K. & Pedersen, J. (1997), ‘Critical assessment of methods of protein structure prediction(casp): Round ii’, Proteins: Structure, Function and Genetics 29, 2–6. Orengo, C., Bray, J., Hubbard, T., LoConte, L. & Sillitoe, I. (1999), ‘Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction’, Proteins: Structure, Function and Genetics 37, 149–170.

Title Suppressed Due to Excessive Length

23

Pollastri, G., Przybylski, D., Rost, B. & Baldi, P. (2002), ‘Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles’, Proteins 47, 228–235. Przybylski, D. & Rost, B. (2001), ‘Alignments grow, secondary structure prediction improves’, Proteins: Structure, Function, and Genetics 46, 197–205. Richards, F. & Kundrot, C. (1988), ‘Identification of structural motifs from protein coordinate data:secondary structure and first-level supersecondary structure’, Proteins 3, 71–84. Richardson, J. & Richardson, D. (1989), Principles and patterns of protein conformation, Plenum Press, New York, pp. 1–98. Riis, S. & Krogh, A. (1996), ‘Improving prediction of protein secondary structure using structural neural networks and multiple sequence alignment’, J.Comp.Biol. 3, 163–183. Rost, B. (1996), ‘Phd:predicting one-dimensional protein structure by profile based neural networks’, Meth. Enzymol. 266, 525–539. Rost, B. & Sander, C. (1993), ‘Prediction of protein secondary structure at better than 70% accuracy’, J.Mol.Biol. 232, 584–599. Rost, B. & Sander, C. (1994), ‘Combining evolutionary information and neural networks to predict protein secondary structure’, Proteins 19, 55–72. Ryan, J. (1969a), ‘Grouping and short- term memory: different means and patterns of grouping’, The Quarterly Journal of Experimental Psychology 21, 137–147. Ryan, J. (1969b), ‘Temporal grouping, rehearsal and short-term memory’, Quarterly Journal of Experimental Psychology 21, 148–155. Severin, F. & Rigby, M. (1963), ‘Influence of digit grouping on memory for telephone numbers’, Journal of Applied Psychology 47, 117–119. Wickelgren, W. (1967), ‘Rehearsal grouping and hierarchical organization of serial position cues in short-term memory’, Quarterly Journal of Experimental Psychology 19, 97–102. Yi, T. & Lander, E. (1993), ‘Protein secondary structure prediction using nearest-neighbor methods’, J. Mol. Biol. 232, 1117–1129. Zemla, A., Venclovas, C., Fidelis, K. & Rost, B. (1999), ‘A modified definition of sov, a segmentbased measure for protein secondary structure prediction assessment’, PROTEINS: Structure, Function, and Genetics 34, 220–223.

24

Jinmiao Chen, Narendra S. Chaudhari

APPENDIX Table 8 shows the division of the dataset into seven subsets (set A-set G).

Title Suppressed Due to Excessive Length Table 8 Subsets of the RS126 set used for sevenfold cross validation Set A

4cpai 2or1l 256ba 9wgaa 3tima 4tsla 8adh 2tgpi 2pcy 1fc2c 2gn5 7rsa 1gp1a 3blm 4cms 1lap 2sns 1bmv1

Set B

9insb 3ebx 1fkf 2i1b 1mcpl 2gbp 6cpp 2mev4 5hvpa 1crn 3icb 1eca 2gcr 1pyp 5ldh 5gr1 3cln 9pap

Set C

1cbh 1il8a 2hmza 6dfr 1r092 6tmne 1bmv2 4sgbi 2wrpr 4rhv4 1ubq 1azu 3hmgb 1s01 1gd10 2glsa 1sdha 3cla

Set D

2mhu 3ait 4cpv 1bbpa 1wsya 2cyp 4xiaa 1fdx 5cytr 6hir 3b5c 2ccya 1rbp 2tsca 5er2e 7icd 1tnfa 2stv

Set E

1ppt 2utga 2paba 2lh4 4rhv3 2fnr 2aat 4rxn 3rnt 1bds 1hip 5lyz 2ltna 2cab 9apia 6acn 4fxn 1ak3a

Set F

1mrt 1csei 2rspa 2tmvp 3pgm 8abp 2phh 1ovoa 1lrd3 2ltnb 1cc5 4bp2 1cd4 1rhd 3hmga 6cts 2sodb 2alp

Set G

9apib 1cdta 1acx 1l58 1fdlh 6cpa 1wsyb 1tgsi 1fxia 1sh1 2fxb 1paz 1etu 4rhv1 4pfk 7cata 2lhb 3gapa

25