Predicting RNA secondary structure based on the class information ...

Report 4 Downloads 87 Views
Computers in Biology and Medicine 39 (2009) 206 -- 214

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / c b m

Predicting RNA secondary structure based on the class information and Hopfield network Quan Zou a,∗ , Tuo Zhao b , Yang Liu a , Maozu Guo a,∗ a b

Department of Computer Science and Technology, Harbin Institute of Technology, 150001 Harbin, China Department of Mathematics and Statistics, University of Minnesota Duluth, Solon Campus Center 140, 1117 University Drive, Duluth, MN 55812-3000, USA

A R T I C L E

I N F O

Article history: Received 19 January 2008 Accepted 16 December 2008 Keywords: RNA secondary structure prediction The maximum independent set problem Hopfield network Atomic force microscopy (AFM)

A B S T R A C T

One of the models for RNA secondary structure prediction is to view it as maximum independent set problem, which can be approximately solved by Hopfield network. However, when predicting native molecules, the model is not always accurate and the heuristic method of Hopfield network is not always stable. It is because that the class information is lost and the accuracy is not determined by the number of base pairs only. Secondary structures of non-coding RNAs are believed conservative on the same class. However, software and web servers nowadays for RNA secondary structure prediction do not consider the class information. In this paper, we involve class information in the initialization of Hopfield network. When the initialization is improved, the promising experimental result shows the efficacy and superiority of our proposed methods. Crown Copyright © 2008 Published by Elsevier Ltd. All rights reserved.

1. Introduction RNA secondary structure prediction is to find all the base pairs from a primary sequence. Although it is viewed as an open problem in computational biology for a long time, more and more researchers take interest, as the development of non-coding RNA (ncRNA) [1], because secondary structure is always used recently when identifying some ncRNA from genome [2–4]. Much of the literature considers that the secondary structures of ncRNAs or precursors are conservative in the same class. If class information is given, biology knowledge can tell the around shape of the predicted RNA secondary structure. Molecule class information is always known in the biology research. Even it is unknown, a new technique named atomic force microscopy (AFM) [5–7] can also tell the around shape. However, there are no software programs or web servers, to our knowledge, utilizing the class or AFM around shape information for predicting RNA secondary structure. It means that all the algorithms before ignore the classes information, which will be involved in our method. The current algorithms employed in the prediction of RNA structure include comparative methods [8], kinetic folding [9], dynamic programming [10] and evolutionary algorithms [11–13].

∗ Corresponding authors. Tel.: +86 451 86402407. E-mail addresses: [email protected] (Q. Zou), [email protected] (M. Guo).

Comparative methods work by analyzing multiple sequences to determine points of base covariance within the sequences; these points indicate probable base pair positions. Kinetic folding attempts to simulate the folding dynamics of RNA secondary structure through the gradual modification of base pair positions. Structure prediction can also be viewed as an energy minimization problem using thermodynamic models [14]. So dynamic programming and some heuristic methods can be applied to this optimization. Zuker has succeeded in finding optimal and suboptimal RNA secondary structures with dynamic programming [15]. When the sizes of bugles and internal loops are limited, time complexity can be reduced to O(kn2 ) [16] (k is the maximum size of internal loops). However, the optimal structure in minimum free energy model deviates from the native secondary structure sometimes. So evolutionary algorithms are employed including genetic algorithm [17–19], Hopfield network [20,21]. Genetic algorithm can be parallel computed, and Hopfield network also run faster than dynamic programming. We will improve the Hopfield network combining with the class or around shape information. Under the minimum free energy model, RNA secondary structure prediction can be viewed as a maximum independent set problem. An independent set in a graph is a set of vertices, no two of which are adjacent. A maximum independent set is an independent set whose cardinality is the largest among all independent sets of a graph. RNA secondary structure prediction is to give the base pairs from primary sequence. Bases match into pairs in Watson–Crick rule. Since the alphabet of RNA sequence is so small (only four, which is {A, G, C, U}) that it would happen to make lots of pairs besides the native ones. It is considered in minimum free energy model that base

0010-4825/$ - see front matter Crown Copyright © 2008 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiomed.2008.12.010

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

pairs could reduce the energy. However, some base pairs are incompatible, because one nucleotide cannot make pair with two or more nucleotides at the same time, and pseudoknots are not considered either. So we should search for a base pair set in which all are compatible and they give the minimum free energy. Sometimes energy is thought to be proportional to the number of base pairs. So minimum energy means maximum base pairs. We view every possible base pair as a vertex, and if the two base pairs are incompatible, there is an edge between the vertices corresponding. Therefore, when we want to know the maximum compatible base pairs, it equals to find out the maximum independent set in the corresponding graph above. Finding out the maximum independent set is an NP-hard problem for arbitrary graphs. However, a near-maximum independent set of circle graph can be found efficiently by Hopfield network. In this paper, we briefly introduce a Hopfield network for maximum independent set in Section 2; then in Section 3, we modify the Hopfield network for predicting more accurately according to the characters of homologous RNA structures. Experiments for contrast are presented in Section 4. At last, we will conclude the paper and discuss for future research in Section 5.

207

G C U

A

G

C

U

A

G

C

A

U

U

C A

G

U

G

U

U

C

A

C

U

G

C A

G C

G

Fig. 1. Conflicts between acceptor-arm and D-loop in tRNA.

2. Hopfield network for maximum independent set Hopfield network is an artificial neural network in which the input comes from the output circularly. It can search for maximum independent set as follows. Take neurons as many as the vertices. Let the ith neuron stand for the ith vertex. For the ith neuron, there are two functions U(i) and V(i). U(i) is a float variable while V(i) is a Boolean variable. If U(i) > 0, V(i) = 1, else V(i) = 0. And V(i) = 1 means the ith vertex is not selected into the maximum independent set. Hopfield network will change the values of all the U(i) circularly until they are no longer changed. We call that it reaches a stable state. How to compute the changed value of U(i) is important. Takefuji et al. [20] used: ⎛ ⎞ n  −1 dU(i) = A ⎝ dij (1 − Vj )(dis(i)) ⎠ (1 − Vi ) j=1

⎞ ⎛ n  ⎝ dij (1 − Vj )⎠ Vi − Bh

(1)

j=1

Here dU(i) means the changed value of U(i) every time; A and B are two weight parameters; dij is a function on compatibility; if there is a edge between the ith vertex and the jth vertex, dij = 1; else dij = 0; dis(i) can be viewed as a weight of the ith vertex; h(x) is a special function that if x = 1, h(x) = 0, else h(x) = 1. Eq. (1) means that if the ith vertex is selected into the maximum independent set (Vi = 0), then the latter part in (1) is 0, so dU(i) = n A( j=1 dij (1 − Vj )(dis(i))−1 ). When dij (1 − Vj )  0, it can be known that the jth vertex is also selected into the maximum independent set (Vj  1) and incompatible with ith vertex (dij  0). Then dU(i) will equal to the size of the selected but incompatible ones multiplied with a weight A and the ith vertex's weight dis(i)−1 . If the ith vertex is not selected into the maximum independent set (1−Vi = 0), then  the former part in Eq. (1) is 0, so dU(i)=−Bh( nj=1 dij (1−Vj ))Vi . When n n h( j=1 dij (1 − Vj ))  0, it equals to j=1 dij (1 − Vj ) = 0. It means for any vertex j, dij (1 − Vj ) = 0. So if the jth vertex is selected (1−Vj  0), dij will be 0. It equals that the ith vertex and the jth vertex are compatible. Therefore, when the ith vertex is not selected, and if all the selected ones are all compatible with it, dU(i) will be subtracted a value B, else there is no change for dU(i). When all the neurons reach the stable state, if V(i) = 0, the ith vertex is selected into the maximum independent set, Hopfield network can find an approximate maximum set. Hence, Takefuji et al. [20] and Liu et al. [22] aimed to predict RNA secondary structure with

it. Liu combined the thermodynamic characteristic of different base pairs for a more energy stable structure. They selected stems instead of base pairs with this approach. It performed better, in the condition that the native stems should not be any longer. Fig. 1 shows a secondary structure for a tRNA molecule. Two stems in this figure are acceptor-arm (five base pairs) and D-loop each (four base pairs). But the bases combined with dots are legal when we have not known the native secondary structure. So the U–A and C–G pairs are both found in acceptor-arm and D-loop. When we predict the secondary structure, the two helices are conflicted. So Liu et al.'s method [22] would exclude one of the stems. Therefore, we should unfold the stems and deal with bases directly. However, the real secondary structure always deviates from the most energy stable one. Most of the time homologous molecules are structurally similar. Therefore, comparative sequence analysis method is developed and here we will combine it with Hopfield network for more accurate RNA secondary structure. 3. Improved Hopfield network for RNA secondary structure prediction 3.1. Analysis of convergence Standard Hopfield network revises the values of neurons with a linear formula: n 

netj =

(ij oi + xi )

(2)

i=1&i  j

Here oj is a special function that  oj =

1

if netj > j

0

if netj  j

j is a threshold value. We denote the matrix {ij } with . In 1983, Cohen and Grossberg have proved that the simplest Hopfield network is stable if  is a symmetrical matrix and all the diagonal elements are 0 [23]. However, Hopfield network used here cannot be formalized as Eq. (2), because Eq. (1) is too complex to simplify as a linear formula.

208

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

So we could not analyze the stability according to the Cohen and Grossberg's theory. Here we analyze the stability of Takefuji's Hopfield network from Eq. (1). As it is more complex, we could not conclude as Cohen and Grossberg, but the analysis is still more useful. From Eq. (1) we can see that dU(i) has two different computations  according to Vi . If Vi = 1, dU(i) = −Bh( nj=1 dij (1 − Vj )). Only when for every j all the dij (1−Vj )s equal to 0, dU(i) is not 0. It means that if the ith neuron is not selected, only when all the incompatible neurons are not selected, its value would change. Hence, for every unselected neuron, if an incompatible one is selected, its value U(i) would not be changed.  When Vi = 0, it is some more complex. Here dU(i) = A( nj=1 dij

(1 − Vj )(dis(i))−1 ). It means that for every selected neuron i, when there are some incompatible but selected ones, U(i) will be added to a small positive value until it is positive or other incompatible ones are unselected. If the selected neurons have different negative values at the beginning, the incompatible ones will be added with small values each times. After several circles, the values for some neurons will be more than 0. When the remaining neurons are compatible, it stops. So when the initial values for the neurons are proper and A is small enough, the Hopfield will be convergent and stable. Obviously, for each of the two incompatible neurons, the less of its initial value, the more possible it will be selected. And different initial values may influence the final output. Therefore, giving proper initialization is rather important. Takefuji and Liu initialized every value of U(i) with a equal small negative number or randomized them when predicting RNA secondary structure. It destroys the stability of Hopfield and often deviates from the real structure. In this paper, we initialize according to the similarity of homologous molecules. It cannot solve the maximum independent set problem effectively, but it improves the performance when predicting RNA secondary structure. It will be depicted detailed below. 3.2. Hopfield combining with comparative analysis method The output of Hopfield network is strongly influenced by the initialization. Refs. [20] and [22] ignore the structural character of special RNA molecules. For instance, the secondary structure of tRNA is similar like a clover, while microRNA and siRNA precursors seems like a hairpin. Since the output is influenced by the initialization, we could initialize U(i) with the special structural information if the class of the predicted RNA is known. What is the class information told to the computer? For the short RNA sequence, we can align the predicted structure with a clover or a hairpin. But when dealing with more complex RNA, for example RNase P RNA, the structure is too complex. In this case, we call for a homologous structure together with its sequence. It is easily attained from the known knowledge or RNA secondary structure databases, such as Rfam [24,25], NONCODE [26], RNAWorld [27], etc. Sometimes class information is not told. RNA molecule ought to be done ab initio prediction as before. However, AFM, a new technique developing recently, can give the around shape for a molecule. Although it cannot tell the exact bases which make pairs, researchers can see the around position of the stems. Furthermore, this technique costs much less than X-ray crystallography and nuclear magnetic resonance, which are used for detecting molecule structure before. We also develop the method utilizing the AFM information.

3.2.1. Initialize only with AFM around shape Using AFM, we can get the around shape for predicted RNA molecule. When the reference sequence is unknown, a shape of secondary structure is only used for supervising other structures prediction; we could position all the stems then base pairs

near that positions will be considered to be real ones with more possibility. Position of a stem in a sequence could be described with three parameters, a beginning site, an ending site and the length of the stem. We use a triple {i, j, k} to denote the three parameters. For every stem, no matter whether it is real stem, its own triple describes its position. So from the triples of two stems, which come from different homologous molecules, we can compute the “distance between the stems”. With the distance information we could initialize the neurons properly. Here we use the “distance between stems” instead of “distance between base pairs” in that conservation between stems, and counterparts is more than base pairs and counterparts. Given the predicted sequence, we should find out all the possible stems firstly. The set of the possible stems is called “stems pool”. All of the base pairs in these stems will consist of the neurons in Hopfield network later. So they should be initialized differently according to their similarity with the counterpart ones in the known structure. How to measure the similarity? We assume that the length of the known structure sequence is Lkn ; the ith stem is denoted with triple {bkn , ekn , lkn }; the length of i i i the predicted structure sequence is Lpr ; the jth stem is denoted with pr pr pr triple {bj , ej , lj }. So the distance between this two stems can be computed with pr pr bkn bj ekn ej i pr i dis(i, j) = kn − pr + kn − pr + |lkn i − lj | L L L L

(3)

The parameter  is given for weight. It always falls around 1/Lpr . For every jth stem in the possible stems pool, there should be an ith stem that dis(i, j) is least for all the stems in the known structure. We record this least value with a function D(j). From the homologous conservation, we can conclude that the less D(j) is, the more possible base pairs in jth stem are real. Therefore, we could initialize every base pair in the Hopfield with the value D(j), in which the base pair belongs to the jth stem. 3.2.2. Initialize with class information When the reference sequence is known, D(j) can be acquired more accurately. If dis(i, j) is computed with Eq. (3), sometimes counterpart stems may drift for the mutation in the homologous evolutionary. As Fig. 2 shows that the red region is inserted during the evolution, so dis(i, j) for counterpart stems may be more than others. It is in that even the two stems in Fig. 2 are corresponding, (i/Lpr , j/Lpr ) will deviate from (m/Lkn , n/Lkn ) because of insertion mutation. When reference sequence is unknown, we cannot find where the insertion or deletion takes place. Hence we can only reduce the influence by dividing the length of sequence. Here the reference sequence is provided, so sequence alignment can be used for avoiding the positions drift. Dynamic programming is a typical method for pairwise sequence alignment. It is a fact that occurrence of a gap with k spaces is more probable than occurrence of k isolated spaces when mutations are involved. If the two homologous sequences do not exist with same length, there are always a few gaps instead of many spaces. So in the alignment here, gap penalty function is helpful for finding the corresponding stems in homologous molecules. When two sequence is globally aligned, there will be some spaces “-” added to the sequences. Hence for every character in the original sequence, its position may drift in aligned sequence. Here we describe the changed position with a function (x). For the ith character in the reference sequence, its position is denoted as (i) after aligned with predicted sequence. And for the jth character in the predicted sequence, its position is denoted as  (j) after aligned with reference sequence.

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

3'

5' m

known

n

n-k

m+k

209

predicted

5' i

i+k

3'

j

j-k

Fig. 2. Positions drift for two counterpart stems.

As the global aligned sequences are same long, Eq. (3) can be modified as follows: pr

pr

 kn  dis(i, j) = |(bkn i ) −  (bj )| + |(ei ) −  (ej )| pr

+ |lkn i − lj |

(4)

Then the neurons in Hopfield are initialized as 3.3.1 introduced above. 3.2.3. Implementation Some skills are also important in the implementation besides the initialization. In order to make the secondary structure more stable, long stems should be selected more possible than shorter ones. For example, there is a 4 nt stem in the homologous molecule, and we can find a three-length and a five-length stem at the corresponding position. Length difference in Eqs. (3) and (4) will have the same penalty. We easily know that the latter is more stable than the former. So we use pr pr pr a new function Ld(lkn , lj ) instead of |lkn − lj |. Ld(lkn , lj ) is defined as i i i  pr Ld(lkn i , lj ) =

pr

pr

− lj lkn i

if lkn > lj i

0

else

(5)

Another skill comes from experiments. In the experiments, we find that the Hopfield network always outputs some single base pairs which cannot make up stems. Mathews et al. [28] have proved that single base pair is not stable and stem must contain at least two base pairs. Hence we check the output of Hopfield network and pick the single base pairs out. Stems in some classes are even longer. For instance, stems in tRNA are at least three-length. So we can exclude the short stems according to the predicted class. Algorithm presented in this paper can be divided into two phases that are initialization and training. In initialization phase, there are two computation paths according to the different situations about the reference sequence. In training phase, we follow Takefuji's work. All the process can be seen in Fig. 3. We should notice that Hopfield network here is used for optimization instead of classification, although it is a machine learning method. Therefore, training phase in our flow is different from the training in classification of machine learning. It is no wonder that there is no “testing phase” in our method. Because our “training phase”, which repeat modifies the network for a stable state, is object to optimizing, rather than studying disciplines. 4. Experiments 4.1. Measurement for prediction Sensitivity (X) and specificity (Y) are common measures for determining the accuracy of prediction methods [29]. Specificity is also known as the “selectivity” [30] and “positive predictive value”

[31,32]. We use the standard definitions of X and Y for examining RNA secondary structure prediction, although some researchers prefer to modify specificity with a parameter  [8]. X and Y here are defined as follows: X=

TP TP + FN

Y=

TP TP + FP

(6)

where TP is the number of “true positives” (correctly predicted base pairs), FN is the number of “false negatives” (base pairs in the real structure that are not predicted) and FP is the number of “false positives” (incorrectly predicted base pairs). Matthews correlation coefficient (MCC) [8] is always used for combining both specificity and sensitivity when ranking algorithms. It is defined as follows MCC =

TP ∗ TN − FP ∗ FN (TP + FP)(TP + FN)(TN + FP)(TN + FN)

(7)

where TN is the number of “true negatives” (base pairs not in the real structure that are not predicted either). Usually TN is much more than TN, FP and FN. So Eq. (7) can be simply approximated with the geometric-mean of X and Y [8,30], because MCC ≈

=

TP ∗ TN (TP + FP)(TP + FN) ∗ TN ∗ TN TP (TP + FP)(TP + FN)

=



XY

(8)

4.2. Compared with Takefuji's Hopfield network Our first experiment is to check the improvement from Takefuji's Hopfield network. Takefuji's Hopfield network is modified mainly in two ways, which can be viewed as a preprocessing and a postprocessing. The preprocessing is the initialization phase introduced in Sections 3.2.1 and 3.2.2. The postprocessing includes some skills introduced in Section 3.2.3. The experiment will show the improvement of preprocessing and postprocessing. Here we choose tRNAs from Anaplasma marginale St Maries Genome for experiments. All the sequences and secondary structures come from the Genomic tRNA Database [33,34]. The tRNA molecules are structural conservative and take on cloverleaves. There are mainly four stems in common tRNAs. They are called acceptor-arm, TC-loop, anticodon-loop and D-loop each. The length of tRNA is between 71 and 78 nt, while a few are longer (more than 80 nt). The longer ones usually contain five stems. 4.2.1. Performance of Takefuji's Hopfield network The Anaplasma marginale St Maries Genome contains 37 different tRNAs, by which we will check our improvement from Takefuji's Hopfield network. Takefuji initializes neurons with two strategies. One is small negative numbers and the other is to initialize randomly. Hence we will test Takefuji's prediction quality with the two initialization strategies respectively next.

210

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

Initialization Phase

Training Phase

sequece predicted

compute every Vi form U(i)

find all the possible stem

compute every dU(i) with formula 1

whether the reference structure Yes is known

No update every U(i)

sequence alignment

compute all the dis(i,j) with Equ.3

compute all the dis(i,j) with Equ.4

whether every dU(i) = 0 compute every D(j) for all the possible stem

No

Yes output all the selected basepairs

initialize all the possible basepairs

Fig. 3. The flow chart for the improved Hopfield network.

Table 1 The average prediction quality and training times of fixed initialization Hopfield network. Parameters setting Initialization value

A

B

−01 −01 −01 −02 −1 −2

0.0001 0.001 0.01 0.01 0.01 0.01

0.0005 0.005 0.05 0.05 0.05 0.01

Training times

Prediction quality TP

FN

FP

X

Y

MCC

1520 207 107 121 206 361

3 3 3 3 3 3

18 18 18 18 18 18

17 17 16 16 17 16

0.152 0.148 0.153 0.166 0.144 0.157

0.152 0.146 0.162 0.159 0.147 0.166

0.152 0.144 0.155 0.156 0.145 0.159

When Hopfield network is initialized with small negative numbers, there are three independent parameters in the network. So we predict six times setting with different parameters. In the experiments we find that parameters setting influences the outputs of Takefuji's Hopfield network seriously. For the same sequence, if the initialization values of neurons and A, B in Eq. (1) are changed, the secondary structures will be predicted differently. Parameters are set six times as Table 1's order and the six different outputs are shown in Fig. 4, where we draw the secondary structure with PseudoViewer [35,36]. Table 1 shows the average training times and prediction quality of the 37 tRNAs. Although different parameters setting will change the outputs of Takefuji's Hopfield network for the same sequence, the average prediction qualities are similar. However, training times are influenced seriously by the parameters. In this experiment, TP, FN, FP, X, Y and MCC are all the average of 37 different tRNAs. So the values in Table 1 do not follow the Eqs. (5) and (7), and for the same TP, FP and FN there may be different X, Y and MCC.

When values of neurons are initialized randomly, Hopfield network is unstable even setting with the fixed parameters. For the same sequence in Fig. 4, we set A and B to 0.01 and 0.05, respectively. Then every times, the outputs are different, as shown in Fig. 5. MCC for the three outputs are only 0, 0.06 and 0.05, respectively. 4.2.2. Performance of our Hopfield network From the experiments above we can see that Takefuji's Hopfield network can not predict tRNA secondary structure accurately. However, when Hopfield network is modified as introduction above, the performance is perfect. For our method demonstrated in Fig. 3, we test the tRNAs from Anaplasma marginale St Maries Genome two times. The secondary structure of the first tRNA in this class is considered as the reference structure of tRNA. The first time, we utilize Eq. (3) to predict the remain 36 tRNAs without the reference sequence. The second time, reference sequence and Eq. (4) are used for predicting.

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

211

Fig. 4. The different outputs of Takefuji's Hopfield network with different parameters setting.

Fig. 5. Three different outputs of random initialized Hopfield network.

4.2.3. Prediction with AFM around shape When only AFM around shape is provided and Eq. (3) is used, there are three independent parameters, which are  (Eq. (3)), A and B (Eq. (1)). Here we predict the group tRNAs three times with different parameters setting. Experiments prove that the stabilization of our Hopfield network is much more than Takefuji's. In Takefuji's Hopfield network, almost every sequence is predicted to different structures when parameters values are changed. But in our Hopfield network, when 36 sequences are folded, only four sequences are folded to different structures when parameters values are changed. In the instability of the four sequences, half happens in that the sequences is longer than 80 nt and there are five stems in these two sequences. The structural conservation is destroyed, so instability of our Hopfield network comes. The long sequences destroy the structural conservation with reference structure. So they cannot be predicted completely right. However, our Hopfield network still outputs nearly half of the real structure. Fig. 6 shows the distribution of our prediction experiments. And the average prediction indices are shown in Table 2. In Table 2, we choose the best outputs from Takefuji's prediction, but all the indices are all far less than ours. Although setting different parameters hardly influence the outputs for the same sequence, the training times change rather large. It is difficult to make three-dimensional displays, so we fix  to −0.1 and change A and B for the same rate. Three groups of experiment on the same 36 tRNAs above have been done. The average training times with different A and B are shown in Table 3.

long sequence

9

completely right

27

others

17 10

Fig. 6. The distribution of the prediction with our Hopfield network.

4.2.4. Prediction with class information From the experiments above, we can see that the outputs are hardly influenced by parameters setting. So here we fix , A and B to 5, 0.1 and 0.2, respectively, and predict the 36 tRNAs only once. The average prediction indices are shown in Table 2. We can see that the performance with reference is a little better than the performance with only AFM around shape. When the 36 outputs are all compared with the counterparts predicted without reference sequence, we can find that more than half outputs are the same and most of the different outputs are better predicted than the counterparts without reference sequence. The comparison is shown in Fig. 7. In the ten sequences, which are folded better with the reference sequence, most of them are much longer than the reference sequence. The average length of the 10 sequences is 81.5 nt while

212

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

Table 2 Contrast of prediction quality of Takefuji's and our Hopfield network. Method

Table 5 Quality of predictions of RNase P RNA sequences with four different kinds of software.

Prediction quality

Takefuji's Hopfield network Without reference sequence With reference sequence

Softwares

TP

FN

FP

X

Y

MCC

3 18 19

18 4 3

16 3 2

0.16 0.85 0.88

0.17 0.85 0.91

0.16 0.85 0.90

FN

FP

X

Y

MCC

SM-A46(74)

Our method Pfold MARNA RNAStructure

70 38 52 52

9 41 27 27

9 14 47 43

0.89 0.48 0.66 0.66

0.89 0.73 0.55 0.55

0.89 0.59 0.60 0.60

Lake griffy B#41

Our method Pfold MARNA RNAStructure

69 49 36 36

16 36 49 49

6 9 61 70

0.82 0.58 0.42 0.42

0.92 0.84 0.37 0.34

0.87 0.70 0.40 0.38

Volunteer ESH212C

Our method Pfold MARNA RNAStructure

62 52 33 60

34 44 63 36

17 15 76 46

0.65 0.54 0.34 0.63

0.78 0.78 0.30 0.57

0.71 0.65 0.32 0.59

Pond scum#26

Our method Pfold MARNA RNAStructure

53 54 59 53

44 43 38 44

28 25 43 55

0.55 0.56 0.61 0.55

0.65 0.68 0.58 0.49

0.60 0.62 0.59 0.46

Total

Our method Pfold MARNA RNAStructure

254 193 180 201

106 167 177 156

60 63 227 214

0.71 0.54 0.50 0.56

0.84 0.75 0.44 0.48

0.77 0.64 0.47 0.52

Table 3 Average training times of our Hopfield network on different parameters setting. Parameters setting

TP

Average training times

A

B

0.01 0.001 0.0001

0.05 0.005 0.0005

Same

22 60 578

better

worse

10

24

12 2

Fig. 7. Prediction with reference sequence compared with prediction without reference sequence.

Table 4 Quality of predictions of tRNA sequences with five different kinds of software. Software

Complete right

TP

FN

FP

X

Y

MCC

Our method Pfold CARNAC MARNA RNAStructure

17 0 0 1 8

19 20 7 14 19

3 2 14 8 3

2 2 0 10 3

0.88 0.92 0.32 0.64 0.87

0.91 0.90 1 0.59 0.86

0.90 0.91 0.57 0.61 0.87

the reference sequence is 73 nt. So it means that Eq. (4) is superior to Eq. (3) especially when the predicted sequence is a few longer or shorter than reference sequence. Here we prove that our improved Hopfield network performs much better than Takefuji's. However, Takefuji has given his method more than ten years. What about the performance of recent software? Next we will compare our method with another four kinds of main software for predicting RNA secondary structure. 4.3. Compared with other software The second experiment is to prove the superiority of our Hopfield network for predicting RNA secondary structure comparing with other software. These kinds of software are Pfold [37–39], CARNAC [40–42], MARNA [43,44] and RNAStructure [45,46]. First we test every software with tRNAs from Anaplasma marginale St Maries Genome used above. Table 4 shows the comparison of the four kinds of software. The most advantage of our Hopfield network is that the rate of complete right predicted is the most. It is mainly in that we provide a reference structure in our

Hopfield network and most of the tRNAs in Anaplasma marginale St Maries Genome are structural conservative. Then we turn to more complex molecules for testing. RNase P RNA is a kind of complex ncRNA molecular, of which the size is more than 300. There are always one or two pseudoknots in RNase P. They are conserved in structure, but less conservative than tRNA. In our experiments, the reference secondary structures and sequences come from department of microbiology in North Carolina State University [47]. Here we use four kinds of molecular, of which all belong to Alpha Purple Bacteria RNase P RNA. The length of them is all more than 300, and the structure cannot be described with any simple polygon. Each of the four molecular contains a pseudoknot. The prediction in our method is reported detailed in Table 5. We use four software programs instead of five for tRNA. It is because CARNAC can only deal with the sequences no more than 80. Pfold uses a stochastic context-free grammar (SCFG) to produce a prior probability distribution of RNA structure. Given an alignment and phylogenetic tree relating to the sequences, posterior probabilities of the structures can be calculated using the inside–outside algorithm. The posterior probability is based on individual probabilities for alignment columns or pairs of columns in the case of a base pair. Pfold runs fast and gives accurate outputs, but it cannot find the pseudoknots. From Table 4, we can see that Pfold performs as well as our Hopfield network. However, it strongly depends on the consensus structure found by SCFG. So when there is something, errors prediction of every sequence will be influenced. In the tRNA experiment, there is a redundant base pair predicted by SCFG in the consensus structure, so none of the 37 sequences is folded completely right. CARNAC comes from the famous Sankoff algorithm, which folds and aligns sequences simultaneously. It is a dynamic programming approach to obtain a common base pairs list with maximal sum of base pairs weights. Basically, this is a merger of sequence alignment and Nussinov's dynamic programming method [48]. The limit of CARNAC is that it cannot deal with large and long sequences because of the time complexity of Sankoff Algorithm. In our

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

experiment, CARNAC only finds one stem in every molecule. So TP and X are low. However, it hardly outputs any wrong base pairs, FP and Y are better therefore. MARNA is a particular software working in the way that alignment is used after folding. It may exclude the sequence alignment step, predict secondary structures for each sequence (or sub-group of sequences) separately, and directly align the structures. As CARNAC, it cannot predict the longer sequences. The maximal sequence length should not exceed 500 nt in the online prediction system. The number of sequences depends on sequence lengths. The total sum of all sequence lengths is restricted to 10 000 nt. MARNA prefers the free energy for stability to the structural conservation. Hence, it always gives the secondary structure with more base pairs. So the indices FP and Y are often worse than other software. When folding with single sequence, RNAStructure runs with the dynamic programming and searches for the minimum free energy. However, it is time consuming and sometimes the real structure is not the one of which the free energy is the least. In our experiment, sequences of tRNA are short and the real structures always are the least free energy ones. Hence, the performance of RNAStructure here is well and it runs as well as our method.

[4] [5] [6] [7]

[8] [9] [10]

[11]

[12]

[13]

[14]

5. Conclusion In this paper, we improve the Hopfield network for RNA secondary structure prediction. The main idea comes from the structural conservation of homologous molecules. Preprocessing is added in order to make the Hopfield network more stable and combine the initialization with structure information. Postprocessing is used to improve the outputs. Hopfield network deals with base pairs instead of stems, so it will output some redundant short stems sometimes. According to the characters of the class of the RNA molecules, we can limit the least length of stems. The skills can make the outputs more accurate. Eqs. (3) and (4) are two cores presented by us. It can compute the similarity of stems from different homologous molecules instead of manual measurement. Experiments have proved the superiority of our method to Takefuji's Hopfield network and other four kinds of main software programs separately, especially when dealing with complex molecules, such as RNAse P RNA, 5s rRNA. Although it involves more information (AFM around shape or class information), it adapts to the development of the new technique, which was ignored by other software programs. All the programs and data can be download at Tuo's homepage (http://www.d.umn.edu/∼zhaox240/hopfield/).

[15] [16] [17] [18]

[19]

[20]

[21]

[22]

[23]

[24] [25]

[26]

Acknowledgments Thanks to Dr. Xiujie Wang for her helpful discussion and suggestion. We also appreciate the anonymous reviewers for their valuable reviews. The work was supported by the Natural Science Foundation of China under Grant nos. 60671011, 60741001 and 60871092, the Science Fund for Distinguished Young Scholars of Heilongjiang Province in China under Grant no. JC200611, the Natural Science Foundation of Heilongjiang Province in China under Grant no. ZJG0705, and Foundation of Harbin Institute of Technology under Grant no. HIT.2003.53.

[27] [28]

[29]

[30] [31]

[32]

References [33] [1] C. Alkan, E. Karakoc, S.C. Sahinalp, P. Unrau, H.A. Ebhardt, K. Zhang, J. Buhler, RNA Secondary Structure Prediction via Energy Density Minimization, ACM RECOMB, 2006. [2] C. Xue, F. Li, T. He, G.-P. Liu, Y. Li, X. Zhang, Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine, BMC Bioinformatics 6 (2005) 310. [3] P. Jiang, H. Wu, W. Wang, W. Ma, X. Sun, Z. Lu, MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with

[34] [35] [36]

213

combined features, Nucleic Acids Research (July 2007) 35 (Web Server issue), W339–W344. J. Hertel, I.L. Hofacker, P.F. Stadler, SnoReport: computational identification of snoRNAs with unknown targets, Bioinformatics 24 (2) (2008) 158–164. ¨ A. Engel, D.J. Muller, Observing single biomolecules at work with the atomic force microscope, Nature Structural Biology 7 (9) (2000) 715–718. R. Rounsevell, J.R. Forman, J. Clarke, Atomic force microscopy: mechanical unfolding of proteins, Methods 34 (1) (2004) 100–111. J. Zlatanova, S.M. Lindsay, S.H. Leuba, Single molecule force spectroscopy in biology using the atomic force microscope, Progress in Biophysics and Molecular Biology 74 (2000) 37–61. P.P. Gardner, R. Giegerich, A comprehensive comparison of comparative RNA structure prediction approaches, BMC Bioinformatics (2004) 1–32. C. Flamm, RNA in silico: the computational biology of RNA secondary structures, Advances In Complex Systems 1 (1999) 65–90. M. Zuker, P. Stiegler, Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information, Nucleic Acids Research 9 (1) (1981) 133–148. Y. Zhan, M. Guo, A permutation-based genetic algorithm for predicting RNA secondary structure-a practicable approach, in: The Second International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Lecture Notes in Artificial Intelligence (LNAI), vol. 3614, 2005, pp. 861–864. J.-H. Chen, S.-Y. Le, J.V. Maizel, Prediction of common secondary structures of RNAs: a genetic algorithm approach, Nucleic Acids Research 28 (4) (2000) 991–999. K.C. Wiese, E. Glen, A permutation-based genetic algorithm for the RNA folding problem: a critical look at selection strategies, crossover operators and representation issues, BioSystems 72 (2003) 29–41. K.C. Wiese, A. Hendriks, Comparison of P-RnaPredict and mfold-algorithm for RNA secondary structure prediction, Bioinformatics 22 (8) (2006) 934–942. M. Zuker, On finding all suboptimal foldings of an RNA molecular, Science 244 (1989) 48–52. R.B. LyngsB, M. Zuker, C.N.S. Pedersen, Fast evaluation of internal loops in RNA secondary structure prediction [J], Bioinformatics 15 (6) (1999) 440–445. Y.-J. Hu, Predicting of consensus structural motifs in a family of coregulated RNA sequences, Nucleic Acids Research 30 (17) (2002) 3886–3893. B.A. Shapiro, J.C. Wu, D. Bengali, M.J. Potts, The massively parallel genetic algorithm for RNA folding: MIMD implementation and population variation, Bioinformatics 17 (2) (2001) 137–148. K.C. Wiese, A.A. Deschenes, A.G. Hendriks, RnaPredict—an evolutionary algorithm for RNA secondary structure prediction, IEEE/ACM Transactions on Computational Biology and Bioinformatics 5 (1) (2008) 25–41. Y. Takefuji, L. Chen, K. Lee, J. Huffman, Parallel algorithms for finding a near-maximum independent set of a circle graph, IEEE Transaction on Neural Networks 1 (1990) 263–267. Y. Takefuji, K.C. Lee, Authors' reply to comment on parallel algorithms for finding a near-maximum independent set of a circle graph, IEEE Transaction on Neural Networks 2 (1991) 329. L. Qi, Z. Yin, Y. Xiuzi, Y. Rongdong, A discrete Hopfield neural network based MIS finding algorithm for stems selecting and its application in RNA secondary structure prediction, Chinese Journal of Computers 31 (1) (2008) 51–58. M.A. Cohen, S. Grossberg, Absolute stability and global pattern formation and parallel memory storage by competitive neural networks, IEEE Transactions on Systems, Man and Cybernetics 13 (1983) 815–821. J.S. Griffiths, A. Bateman, M. Marshall, A. Khanna, S.R. Eddy, Rfam: an RNA family database, Nucleic Acids Research 31 (2003) 439–441. S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S.R. Eddy, A. Bateman, Rfam: annotating non-coding RNAs in complete genomes, Nucleic Acids Research 33 (2005) D121–D124. C. Liu, B. Bai, G. SkogerbB, L. Cai, W. Deng, Y. Zhang, D. Bu, Y. Zhao, R. Chen, NONCODE: an integrated knowledge database of non-coding RNAs, Nucleic Acids Research 33 (Database Issue) (2005) D112–D115. C. Hammann, RNA for everyone! The RNAWorldWebsite at IMB Jena, ChemBioChem 4 (2003) 1103. D.H. Mathews, J. Sabina, M. Zuker, D.H. Turner, Expand sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure, Journal of Molecular Biology 288 (1999) 911–940. P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, H. Nielsen, Assessing the accuracy of prediction algorithms for classication: an overview, Bioinformatics 16 (2000) 412–424. J. Gorodkin, S.L. Stricklin, G.D. Stormo, Discovering common stemloop motifs in unaligned RNA sequences, Nucleic Acids Research 29 (10) (2001) 2135–2144. R. Dowell, S. Eddy, Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction, BMC Bioinformatics 5 (2004) 71. D. Mathews, Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization, RNA 10 (8) (2004) 1178–1190. T.M. Lowe, S.R. Eddy, tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence, Nucleic Acids Research 25 (5) (1997) 955–964. http://www.daimi.au.dk/∼compbio/pfold/. K. Han, Y. Lee, W. Kim, PseudoViewer: automatic visualization of RNA pseudoknots, Bioinformatics 18 (2002) S321–S328. Y. Lee, K. Han, PseudoViewer: web application and web service for visualizing RNA pseudoknots and secondary structures, Nucleic Acids Research 34 (2006) W416–W422.

214

Q. Zou et al. / Computers in Biology and Medicine 39 (2009) 206 – 214

[37] B. Knudsen, J. Hein, Pfold: RNA secondary structure prediction using stochastic context-free grammars, Nucleic Acids Research 31 (13) (2003) 3423–3424. [38] B. Knudsen, J.J. Hein, Using stochastic context free grammars and molecular evolution to predict RNA secondary structure, Bioinformatics 15 (6) (1999) 446–454. [39] http://www.daimi.au.dk/∼compbio/pfold/. [40] H. Touzet, O. Perriquet, CARNAC: folding families of non coding RNAs, Nucleic Acids Research 142 (2004). [41] O. Perriquet, H. Touzet, M. Dauchet, Finding the common structure shared by two homologous RNAs, Bioinformatics 19 (1) (2003) 108–116. [42] http://bioinfo.lifl.fr/carnac/carnac.php. [43] S. Siebert, R. Backofen, MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons, Bioinformatics 21 (16) (2005) 3352–3359. [44] http://biwww2.informatik.uni-freiburg.de/Software/MARNA/index.html. [45] D.H. Mathews, M.D. Disney, J.L. Childs, S.J. Schroeder, M. Zuker, D.H. Turner, Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure, Proceedings of the National Academy of Sciences USA 101 (2004) 7287–7292. [46] http://rna.urmc.rochester.edu. [47] J.W. Brown, The ribonuclease P database, Nucleic Acids Research 27 (1) (1999) 314.

[48] I.L. Hofacker, S. Bernhart, P. Stadler, Alignment of RNA base pairing probability matrices, Bioinformatics 20 (2004) 2222–222748 I.L. Hofacker, S. Bernhart, P. Stadler, Alignment of RNA base pairing probability matrices, Bioinformatics 20 (2004) 2222–2227.

Quan Zou born in 1982, Ph.D. candidate. He received the bachelor degree in Computer Science from Harbin Institute of Technology in 2004. His research interests include RNA secondary structure prediction and microRNA gene finding. Tuo Zhao born in 1983. He received the master degree in Computer Science from Harbin Institute of Technology in 2007. Now he is pursuing his 2nd master degree in Applied Math in University of Minnesota Duluth. His research interests includes Machine Learning, Computer Vision and Bioinformatics. Yang Liu born in 1976, Ph.D., Lecturer in Harbin Institute of Technology. His research interests include machine learning and image process. Maozu Guo born in 1966, Ph.D., Professor in Harbin Institute of Technology. His research interests include computational biology and machine learning.