Knowledge-Based Prediction of DNA Atomic Structure from Nucleic

Report 0 Downloads 38 Views
12

Genome Informatics 16(2): 12–21 (2005)

Knowledge-Based Prediction of DNA Atomic Structure from Nucleic Sequence Marcos J. Ara´ uzo-Bravo

Akinori Sarai

[email protected]

[email protected]

Department of Biosciences and Bioinformatics, Kyushu Institute of Technology, 1-1 Kawazu, Iizuka, Fukuoka 820-8502, Japan

Abstract A simple knowledge-based method for DNA atomic structure prediction from nucleic sequence is presented. We used free B-DNA crystal structures to estimate the distribution of trinucleotide base pairs and tetranucleotide base-pair steps conformational coordinates. We used these distributions as a basis to predict the 3D position of the non-hydrogen atoms of the nucleic bases of any arbitrary DNA sequence of any length. The only constraint imposed was that the structure is a B-DNA one with Watson-Crick complementary base pairs. The method was tested on not seen DNA structures with sequence lengths varying from 6bp to 12bp. The obtained predictions have RMSE around 0.5 ˚ A for the translational conformational coordinates, and around 5◦ for the rotational. For the estimation of the nucleic base non-hydrogen atom coordinates the RMSE is around 1.1 ˚ A. The knowledge-based method outperformed a technique based on genetic algorithms in the prediction of B-DNA structures.

Keywords: DNA atomic structure prediction, conformational coordinates

1

Introduction

Since Anfinsen established in 1957 the principle that the sequence defines the macromolecular structure [1], efforts have been made to discover such a relation in proteins [2, 16, 20] and DNA structures [10, 11, 18, 19]. In the case of the protein structure prediction, the “protein-folding problem” makes the estimation of an exact relationship between sequence and conformation very difficult. But the DNA 3D structure is simpler than the protein structure, and the number of the different DNA building blocks is smaller than the protein ones. Therefore, we can expect that the DNA 3D structure prediction is easier than the protein one. Here, we present a method to predict the 3D position of all nucleic base non-hydrogen atoms of B-DNA structures. Though we tested the technique over oligonucleotide structures, the simplicity of the method allows us to simulate structures of any length that can be used as a comparison basis for very long DNA length studies. Instead to predict directly the atom positions, we reduced the dimensionality of the problem, considering each of the nucleic bases in the DNA structure as a rigid body. This is a common practice in DNA structural analysis [9, 17], where the DNA molecule is often approximated as an elastic object with several degrees of freedom between adjacent rigid bases. From here on we define a base pair as the pair of bases, where one of the bases belongs to one DNA strand, and the other base belongs to the complementary strand. We define the base-pair step as a group of two consecutive base pairs in a double stranded helical DNA structure. The local conformation of the DNA is identified at each location of a base pair (from complementary strands) in terms of known deformations, such as base-pair translations: Shear (Sx ), Stretch (Sy ), Stagger (Sz ), and basepair rotations: Buckle (κ), Propeller (π) and Opening (σ). Whereas the base-pair steps are interpreted in terms of translations: Shift (Dx ), Slide (Dy ), Rise (Dx ), and base-pair step rotations Tilt (τ ), Roll

DNA Structure Prediction from Sequence Base-pair parameters

13 Base-pair step parameters

Shear (Sx )

Buckle (κ)

Shift (Dx )

Tilt (τ )

Stretch (Sy )

Propeller (π)

Slide (Dy )

Roll (ρ)

Stagger (Sz )

Opening (σ)

Rise (Dz )

Twist (ω)

Figure 1: The twelve rigid body conformational parameters describing the geometry of sequential DNA base pairs and base-pair steps [9].

(ρ) and Twist (ω), describing the stacking geometry of a dinucleotide step from a local perspective [9, 14, 17], see Figure 1. This approach allows reducing the number of estimation parameters, e.g. the 123 3D atom coordinates associated to the 41 non-hydrogen atoms of the Adenine-Guanine base-pair step can be approximated by 18 independent conformational coordinates, two groups (AT and GC) of six base pair coordinates and a group (AG/CT) of six base-pair step coordinates. But the base pair and base-pair step conformational parameters are context-dependent. Therefore, their values depend on the conformational state of the neighbor elements in the structure. Actually, the definition of the base-pair step coordinates is already a context-dependent definition. It is believed that the minimum motif that discriminates among the double-helical forms of B-DNA and A-DNA is the trinucleotide base pair [5]. In the case of the base-pair steps conformational coordinates, there exist context effects at least at the level of the tetranucleotide base-pair steps [3, 7, 19]. Once the six conformational parameters are known for the 64 unique trinucleotide base pairs and the 136 unique tetranucleotide base-pair steps, it is possible to recover the 3D coordinates of all the non-hydrogen atoms of all the DNA Watson-Crick complementary base pairs at a context level that considers the influence of the first proximal neighbors. Here, we adopt a knowledge-based method using the analysis of known free DNA structures to derive the statistical distribution (averages and covariance matrices) of the DNA conformational states. The method is based on extracting the statistical distribution of the base pair coordinates and base-pair step coordinates from free B-DNA crystal structures.

14

2 2.1

Ara´ uzo-Bravo and Sarai

Method Data Selection

The annotated list of experimental free B-DNA crystal structures was taken (March 2005) from the Nucleic Acid Database (NDB) http://ndbserver.rutgers.edu/ [6]. For each file in the list, we took the first biological unit structure from the Biological Units repository of Protein Data Bank (PDB) (http://www.rcsb.org/pdb/) [8]. To build a test dataset different from the training dataset, we took into account that there were not enough data in the whole dataset to represent all the 136 non-symmetric tetranucleotides. Therefore, we tried to preserve as many different tetranucleotide base-pair steps as possible in the training set. From the whole dataset we grouped all the first strand DNA sequences with identical sequence in the same cluster. Finally, we built the test data by picking one structure from each cluster with more than one member. We rejected from the test dataset the structures with overhanging bases, or with bases different from A, C, G or T. But we preserved the structures with base mismatches, to analyze the possible structural effect of the mismatch in comparison with the standard structure without mismatches, generated by our technique. The remaining structures that have not been used for the construction of the test dataset built the training dataset.

2.2

Acquisition of Knowledge from the Conformational Coordinates

The 3DNA software [14] was used to obtain the conformational coordinates of each mononucleotide base pair and each dinucleotide base-pair step, The data was prefiltered by applying the following criteria: Eliminate the non-B-DNA base-pair steps, these were discriminated following [14] (using the projection of the phosphorus atom onto the z-axis of the dimer middle frame [15]); delete anomalous configuration with outlier values in their conformational coordinates, to eliminate kinks and other defective structural states; eliminate the non-standard bases; delete the sequence symmetries: repeated and palindromic sequences to reduce possible biases towards the more abundant sequences, preserving only a half of the symmetric structure; eliminate the 5’ and 3’ DNA terminals to reduce the boundary effects; delete the base-pair steps that are not in blocks of 4 consecutive steps, to use in the subsequent calculations the local DNA regions without intercalated distortions. To consider the context-dependence effect, we calculated the conformational coordinates using a sliding window in a neighborhood of length one, that for base pairs involves local regions that extend to odd ranges R, and for base-pair steps extend to even ranges. The number #R−nucleotide of different types of R−nucleotides belonging to a range R is #R−nucleotide=22·R . Thus, for the base pair case, instead of a range R=1 of four different mononucleotides, corresponding to non-neighbor bases, we used a range R=3 of 64 different trinucleotides. Thus, the data were classified into 64 trinucleotide categories. Additionally we calculated mononucleotide base-pair coordinates to predict the structure of the terminals, using conformational states of the 5’ and 3’ DNA terminals. In the base-pair step case, for a range R of analysis (R=2 for dinucleotides, R=4 for tetranucleotides) initially 22·R groups of six-dimensional conformational parameters are necessary. However, considering the symmetries, the number of different R-nucleotides (dinucleotides or tetranucleotides) is equal to the number of different elements (one half plus the diagonal) of a symmetric matrix of dimension 2R × 2R , #R-nucleotides = 22·R−1 + 2R−1

(1)

Using this simplification, special care should be taken in the case of the Shift and the Tilt conformational coordinates when dealing with symmetries, since the conformational coordinates are calculated using one of the DNA strands [14]. The Shift and Tilt coordinates of the other DNA strand are inverted for the symmetric steps. In the statistical calculation of the conformational base-pair steps conformational parameters, when a symmetric base-pair step appears, the signs of the Shift and the

DNA Structure Prediction from Sequence

15

Tilt coordinates are reverted. When reconstructing the 3D atomic structure, we have to take into account that the conformational coordinate statistics were calculated in the direction of one of the DNA strands. Then, for the case of a symmetric base-pair step, the recovery of the Shift and the Tilt coordinates should follow the same convention used in the statistical calculations. With the symmetric reduction, for the theoretical 16 possible dinucleotides, applying Eq. (1) with (R = 2) results in 10 unique dinucleotide classes. For the theoretical 256 possible tetranucleotides, the application of Eq. (1) with (R = 4) provides 136 unique tetranucleotide classes (we calculated dinucleotide base-pair step coordinates to predict the structure of steps for which there were not tetranucleotide members in the final training dataset). The step dataset was classified into 10 dinucleotide and 136 tetranucleotide categories. In this classification, the Shift and the Tilt coordinates of the symmetric base-pair steps were considered with opposite signs. The distribution of each class of R−nucleotides was culled to reduce the influence of outlier conformational states. An iterative procedure was implemented, so that for each DNA sequence of length L the six conformational coordinates θi of each base-pair step s = 2, . . . , L − 1 of base-pair type p(s) p(s) p(s) = 1, . . . , #R-nucleotide we calculated the average θi and the standard deviation σi of the associated R−nucleotide. Then, the conformational instantaneous fluctuation from its equilibrium p(s) p(s) p(s) as ∆θi = θi − θ i was calculated, and applying a culling method based on filtering the step p(s) p(s) s with high fluctuation: if |∆θi | ≤ 3σi , the base-pair step s was rejected. With the remaining p non-rejected base-pair steps, the statistics averages θi (k + 1) and the standard deviations σip (k + 1) were recalculated. The rejection and statistics calculation processes k = k + 1 were repeated till no more rejection events happened. A similar culling procedure was applied to the six conformational base-pair coordinates. This filtering techniques are to provide an automatic method that without a human curator can produce a feasible learning dataset of nucleic conformational states. That dataset can be periodically updated with the growth of the structural nucleic public databases. To alleviate the boundary effects in the terminals prediction, two types of specific databases with the conformational coordinates of the of 5’ and 3’ tails were built, using a monocleotide range for the case of the base pair, and a dinucleotide range for the case of the base-pair steps terminal conformational coordinates.

2.3

Prediction of the Conformational Coordinates

Once the conformational coordinates databases for the central and 5’ and 3’ tails are built, the system is ready to perform predictions of conformational coordinates using a simple algorithm. For each DNA input sequence, we used a sliding window of one neighbor around each mononucleotide base pair and around each dinucleotide base-pair step. Thus, the DNA sequence was split into trinucleotide base pairs and tetranucleotide base-pair steps. We assigned to the respective conformational coordinates the values for each trinucleotide and tetranucleotide obtained in the training dataset processing. For the tetranucleotide group, for the steps for which reliable examples in the BDNA crystal database were not available, we calculated the base-pair step conformational coordinates of the dinucleotide base-pair steps, placed in the tretranucleotide center. In the terminals, to reduce the boundary effects, for the two each terminal types (5’ and 3’) we used mononucleotide base pair and dinucleotide base-pair step conformational coordinates. Once a sequence of length L is used as an input, i.e. X1 , . . . , Xi , . . . , XL where Xi ∈ N = {A,C,G,T}, i = 1 . . . , L, the algorithm works as follows: • Processing the 5’ terminal. The first character X1 of the input sequence is considered the 5’ terminal nucleic base of the first strand. The base-pair six-dimensional conformational coordinates vector, corresponding to this base, is taken from the database of 5’ terminal base pairs. In the database of 5’ terminal base-pair steps, search for the first dinucleotide X1 X2 , and retrieve the corresponding base-pair six-dimensional conformational coordinate vector.

16

Ara´ uzo-Bravo and Sarai • Processing the central part of the sequence. For each non-terminal base, e.g. Xi 1 < i < n: build the trinucleotide neighbor Xi−1 Xi Xi+1 , and retrieve the corresponding base-pair six-dimensional conformational coordinate vector from the database of base-pairs. Build the tetranucleotide sequence Xi−1 Xi Xi+1 Xi+2 , look for its entry in the database of tetranucleotide base-pair steps, or the entry of its reverse complement X i+1 X i+1 X i X i−1 (where X i is the complementary base of Xi ). If neither of them exists in the tetranucleotide database, build the dinucleotide Xi Xi+1 sequence and look for its entry, or the entry of its reverse complement X i+1 X i in the database of dinucleotide base-pair steps, and retrieve the corresponding base-pair step six-dimensional conformational coordinate vector from the database of base-pair steps. If the entry in the databases is through a symmetric tetranuceotide or dinucleotide step, change the sign of the retrieved Shift and Tilt coordinates. • Processing the 3’ terminal. The last character XL of the input sequence is considered the 3’ terminal of the first strand. The base-pair six-dimensional conformational coordinate vector corresponding to this base is taken from the database of 3’ terminal base pairs. In the database of 3’ terminal base-pair steps, search for the last dinucleotide XL−1 XL , and retrieve the corresponding base-pair step six-dimensional conformational coordinate vector.

2.4

Three-Dimensional Coordinate Estimation

Once the base pair and base-pair step conformational coordinates are estimated, to produce the 3D atomic structure we use our in home program, that achieves the same results as the program Rebuild of Lu and Olson [14]. Thus we integrated the conformational coordinate prediction and the atomic structure generation in the same tool, allowing to provide a stand alone application. This procedure also allows us to test different backbone configurations. As structural building blocks we use the standard nucleic bases coordinates from Olson et al. [17].

2.5

Calculation of the Performance Criteria

Since our technique is based primarily on the estimation of the conformational coordinates we, evaluated the performance of the method for the conformational coordinate predictions. For the structures in the test subset, not used during the training phase, we calculated the conformational coordinates using the 3DNA software [14] and we estimated them with our method. We calculated the root mean square error (RMSE) for each of the twelve conformational parameters over those structures. To uncover the strongest local structural anomalies we calculated the maximum absolute error (MXAE) and the indices of the nucleic bases in which such MXAE occurs (IMXAE). For a comparison of the predicted and real atom positions, we calculated the RMSE between the cartesian coordinates of the non-hydrogen nucleic bases atom positions of the predicted structures and the real ones. Whereas the conformational coordinates are axes independent, the atom cartesian coordinates are axes dependent. The atom positions of the real structures are in relation to different coordinate axes from the fixed axes in respect to which our algorithm predicts the structures. Therefore, to calculate the Cartesian RMSE it is necessary first to perform a structural alignment between the predicted and the real structures. We implemented the structural alignment using a least-square procedure to fit the real structure with the predicted one. As in Babcock et al. [4] we implemented the closed-form solution of absolute orientation using unit quaternions, introduced by Horn [12]. The data management, filtering and culling, the prediction of the conformational coordinates, building of the 3D coordinates and structural alignment where programmed in home using Matlab 7.0.

DNA Structure Prediction from Sequence

17

Table 1: Errors for the clustered dataset for the base pair, base-pair step conformational parameters, and atom positions. The translations are given in Angstroms ˚ A, and the rotations in degrees ◦ . Code1 Sequence

Error2 Sx ˚ A

Sy ˚ A

Base pairs Sz κ ◦ ˚ A

423d ACCGACGTCGGT RMSE 0.1 0.2 0.3 bd0001 Form: A MXAE 0.3 0.3 0.8 Resolution: 1.60 ˚ A IMXAE [10] [2] [7] 5dnb CCAACGTTGG RMSE 0.1 0.1 0.2 bdj019 Form: A MXAE 0.4 0.2 0.5 Resolution: 1.40 ˚ A IMXAE [7] [6] [6] 183d CCAGCGCTGG RMSE 0.1 0.1 0.2 bdjb57 Form: A MXAE 0.2 0.1 0.4 Resolution: 1.60 ˚ A IMXAE [2] [6] [6] 2d25 CCAGGCCTGG RMSE 0.2 0.1 0.3 bdjb27 Form: A MXAE 0.3 0.2 0.5 Resolution: 1.75 ˚ A IMXAE [1] [6] [5] 1d8g CCAGTACTGG RMSE 0.3 0.1 0.1 bd0023 Form: A MXAE 0.7 0.1 0.2 Resolution: 0.74 ˚ A IMXAE [4] [4] [6] 1d56 CGATATATCG RMSE 0.2 0.1 0.3 bdj036 Form: A MXAE 0.2 0.2 0.6 Resolution: 1.70 ˚ A IMXAE [7] [10] [4] 1da3 CGATCGATCG RMSE 0.2 0.2 0.3 bdjb48 Form: A MXAE 0.4 0.4 0.5 Resolution: 2.00 ˚ A IMXAE [8] [6] [1] 1d99 CGCAAATTCGCG RMSE 0.9 0.2 0.4 bdl011 Form: A MXAE 2.0 0.3 0.9 Resolution: 2.50 ˚ A IMXAE [9] [4] [3] 1s2r CGCAAATTTGCG RMSE 0.1 0.2 0.1 bd0067 Form: A MXAE 0.3 0.4 0.3 Resolution: 1.53 ˚ A IMXAE [10] [10] [12] 428d CGCGAATTCGCG RMSE 0.2 0.2 0.2 0.5 0.4 0.4 bd0005 Form: A MXAE ˚ Resolution: 1.50 A IMXAE [4] [7] [2] 1d27 CGCGAATTTGCG RMSE 0.4 0.1 0.2 0.8 0.2 0.3 bdlb26 Form: A MXAE Resolution: 2.00 ˚ A IMXAE [11] [11] [11] 1d29 CGTGAATTCACG RMSE 0.3 0.2 0.3 0.6 0.4 0.6 bdl029 Form: A MXAE Resolution: 2.50 ˚ A IMXAE [5] [3] [9] 1ih1 GGCGCC RMSE 0.2 0.1 0.2 0.4 0.2 0.3 bd0050 Form: A MXAE Resolution: 2.00 ˚ A IMXAE [5] [5] [2] Total RMSE 0.3 0.2 0.3 MXAE 2.0 0.4 0.9 1 pdb (top) and ndb (bottom) file names of the structures. 2 Error type, RMSE (root mean square error), MXAE (maximum

3

8.2 21.6 [7] 9.7 18.9 [6] 4.5 8.3 [6] 2.8 4.8 [5] 4.5 7.8 [5] 3.8 5.7 [6] 8.2 17.7 [9] 6.6 14.2 [3] 5.8 8.3 [2] 3.1 5.8 [2] 5.1 8.3 [8] 6.2 15.4 [2] 4.4 8.6 [5] 6.0 21.6

π

σ

3.9 7.4 [9] 2.7 5.2 [6] 2.2 4.0 [10] 4.3 7.2 [2] 6.9 13.6 [7] 6.9 11.7 [10] 3.7 6.0 [4] 6.9 17.4 [11] 7.2 14.9 [11] 3.7 8.5 [11] 4.9 10.4 [11] 4.0 8.3 [9] 9.7 13.8 [5] 5.4 17.4

3.9 9.3 [8] 3.3 8.5 [7] 1.7 3.1 [6] 2.4 5.1 [6] 2.8 4.8 [4] 3.3 8.2 [7] 3.0 5.6 [8] 4.1 10.4 [9] 3.0 4.7 [10] 3.0 7.1 [4] 3.4 4.8 [2] 5.2 9.1 [3] 1.9 3.1 [4] 3.4 10.4





Dx ˚ A

0.8 1.6 [3] 0.4 0.5 [3] 0.5 0.7 [2] 0.5 0.8 [8] 0.6 0.8 [2] 0.4 0.8 [9] 0.7 1.3 [5] 0.6 1.7 [10] 1.0 1.8 [10] 0.7 2.0 [10] 0.6 1.5 [10] 0.8 1.6 [10] 0.7 0.9 [3] 0.6 2.0

Dy ˚ A

0.4 1.2 [4] 1.5 3.1 [2] 0.9 1.8 [2] 0.8 1.2 [2] 1.0 1.8 [2] 0.4 0.6 [4] 0.3 0.5 [6] 0.5 0.8 [10] 0.3 0.6 [1] 0.2 0.6 [2] 0.2 0.4 [9] 0.5 1.2 [9] 0.4 0.8 [5] 0.7 3.1

Base-pair steps Dz τ ◦ ˚ A

0.2 0.4 [6] 0.3 0.6 [5] 0.1 0.2 [2] 0.1 0.3 [6] 0.1 0.3 [5] 0.2 0.3 [2] 0.1 0.1 [8] 0.3 0.8 [3] 0.1 0.3 [9] 0.3 0.7 [2] 0.2 0.4 [2] 0.1 0.2 [3] 0.2 0.2 [2] 0.2 0.8

3.5 8.2 [3] 1.6 2.3 [4] 2.9 5.2 [4] 1.6 4.0 [4] 1.7 2.4 [4] 2.9 4.1 [5] 2.0 4.4 [1] 3.5 7.2 [11] 2.2 4.7 [11] 3.6 8.8 [10] 2.2 3.8 [10] 3.0 5.5 [10] 1.0 1.4 [3] 2.7 8.8

ρ

ω

3.2 6.6 [9] 8.1 14.5 [2] 5.0 7.8 [2] 3.1 5.8 [9] 5.6 8.2 [2] 4.7 8.7 [5] 4.8 7.2 [6] 7.8 17.0 [10] 5.5 12.1 [1] 6.9 15.5 [10] 4.6 7.8 [11] 4.5 9.4 [10] 3.1 5.5 [2] 5.5 17.0

3.5 6.5 [6] 9.4 14.6 [2] 5.3 7.7 [5] 2.0 4.3 [9] 6.2 7.6 [3] 5.5 8.8 [3] 2.5 4.1 [5] 4.7 9.8 [3] 4.5 9.8 [3] 4.0 9.4 [2] 4.6 7.7 [9] 4.2 7.0 [1] 1.5 2.3 [3] 4.9 14.6





Atom xyz ˚ A

1.0

1.9

0.9

0.8

1.4

0.6

0.6

1.4

1.2

1.2

0.8

1.1

0.6

1.1

absolute error), IMXAE (index of the base that produces the MXAE).

Results

The different performance errors for our B-DNA clustered test dataset are shown in Table 1. The RMSE of the conformational parameters are within the average standard deviations of the corresponding conformational parameters. The conformational coordinates with high RMSE correspond to rotational coordinates (Buckle κ, Propeller π, Roll ρ, Twist ω). The most difficult conformational coordinates for reproduction in this dataset appeared to be the Buckle κ and the Slide Dy . The RMSE values of the atomic coordinates are (except for 1d8g) within the resolution of the crystal structures. Figure 2 shows the 3D representation of the non-hydrogen base atoms of the 1d27.pdb crystal structure (left), and the corresponding predicted structure (right). This is an interesting example, since in spite of some apparent local divergence in the atom positions, the predicted structure recovers the global bending shape of the original one. Some divergence is due to the fact that the 1d27.pdb structure contains a mutagenic lesion which produces a mismatch [13]. In Figure 2 the coordinates of the real crystal structure (left) are the coordinates given in the 1d27 pdb file. Thus, they are based on the crystal coordinate center, whereas for the coordinates of the simulated structure (right) the coordinates center is in the center of one the base-pair terminals, (the 5’ terminal of the input DNA

18

Ara´ uzo-Bravo and Sarai

Table 2: Errors for the Farwer et al [10] dataset for the base pair, base-pair step conformational parameters, and atom positions. The last column shows the RMSE of the atom positions, for the nucleic bases calculated with our algorithm (top), and for all the atoms calculated by the best model of Farwer et al [10] (bottom in parenthesis). The translations are given in Angstroms ˚ A, and the ◦ rotations in degrees . Code1 Sequence

Error2 Sx ˚ A

Sy ˚ A

Base pairs Sz κ ◦ ˚ A

π ◦

σ ◦

Dx ˚ A

Dy ˚ A

Base-pair steps Dz τ ◦ ˚ A

ρ



ω ◦

425d ACCGGTACCGGT RMSE 0.3 0.2 0.3 6.8 5.0 3.2 0.3 0.7 0.2 2.0 3.9 4.6 bd0003 Form: B MXAE 0.6 0.4 0.5 10.1 10.6 6.4 0.7 1.4 0.3 4.2 8.6 8.6 Resolution: 2.80 ˚ A IMXAE [11] [12] [4] [10] [12] [7] [4] [6] [9] [6] [3] [9] 167d CCATTAATGG RMSE 0.4 0.2 0.2 5.7 3.8 4.5 0.3 0.4 0.2 3.4 3.1 4.3 bdj055 Form: B MXAE 0.8 0.3 0.4 12.2 6.3 8.0 0.4 0.8 0.4 7.1 7.2 6.4 Resolution: 2.30 ˚ A IMXAE [9] [5] [2] [2] [10] [3] [2] [5] [7] [7] [3] [1] CCACTAGTGG RMSE 0.2 0.3 0.2 10.2 6.2 3.1 0.3 1.0 0.2 2.4 3.0 5.4 bdj061 Form: B MXAE 0.5 0.4 0.4 24.6 12.5 5.7 0.6 2.4 0.4 5.5 5.5 8.3 Resolution: 1.95 ˚ A IMXAE [4] [7] [10] [7] [1] [7] [9] [5] [8] [9] [7] [6] 252d CGCAATTGCG RMSE 0.2 0.1 0.4 8.9 2.2 3.4 0.4 0.5 0.4 2.9 3.8 5.7 bdj069 Form: B MXAE 0.4 0.1 0.7 23.1 3.8 5.5 0.7 1.4 0.8 4.5 6.3 13.7 Resolution: 2.30 ˚ A IMXAE [8] [10] [3] [3] [8] [4] [5] [1] [2] [1] [3] [2] 1hq7 GCAAACGTTTGC RMSE 0.2 0.1 0.2 7.4 1.7 2.4 0.5 0.4 0.2 2.9 4.0 5.1 bd0047 Form: B MXAE 0.4 0.3 0.5 16.4 3.1 5.9 0.8 0.7 0.3 6.5 8.2 10.2 Resolution: 2.10 ˚ A IMXAE [3] [9] [1] [7] [10] [7] [6] [2] [6] [10] [10] [6] 1dcv CCGCTAGCGG RMSE 0.5 0.2 0.4 7.3 6.4 5.7 0.4 1.0 0.2 4.3 3.5 4.9 bd0028 Form: B MXAE 1.0 0.5 0.9 12.5 13.3 11.1 0.8 2.0 0.4 10.3 6.8 9.8 [8] [9] [9] [9] [3] [6] [8] [1] [9] [9] [9] [6] Resolution: 2.50 ˚ A IMXAE 1d28 CGTGAATTCACG RMSE 0.3 0.2 0.3 6.6 5.3 6.2 0.7 0.6 0.1 2.9 4.2 3.8 0.5 0.4 0.5 10.4 10.8 13.4 1.3 1.0 0.2 5.8 10.3 7.4 bdl028 Form: B MXAE Resolution: 2.70 ˚ A IMXAE [3] [8] [10] [9] [10] [3] [10] [8] [8] [10] [10] [10] 287d CGCGATATCGCG RMSE 0.4 0.1 0.3 3.3 5.1 2.8 0.8 0.4 0.3 5.4 7.0 5.6 0.9 0.3 0.6 6.2 8.5 5.4 2.0 0.8 0.6 13.1 12.8 9.8 bdl078 Form: B MXAE Resolution: 2.20 ˚ A IMXAE [9] [9] [2] [8] [6] [2] [10] [6] [2] [10] [10] [8] 194d CGCGTTAACGCG RMSE 0.5 0.2 0.4 9.7 7.2 5.8 0.6 0.4 0.3 5.6 6.2 5.3 0.7 0.4 0.9 18.3 13.8 11.7 1.1 0.7 0.5 10.4 13.1 11.0 bdl059 Form: B MXAE [1] [12] [12] [7] [11] [6] [10] [10] [11] [11] [10] [7] Resolution: 2.30 ˚ A IMXAE GGGTACCC RMSE 0.2 0.1 0.3 12.9 4.1 2.8 0.4 2.1 0.2 2.8 5.6 7.1 0.3 0.2 0.4 20.5 6.7 5.7 0.7 2.9 0.4 5.1 11.1 16.6 adh031 Form: A MXAE Resolution: 2.50 ˚ A IMXAE [7] [5] [1] [5] [7] [7] [6] [6] [4] [1] [6] [4] GGGCGCCC RMSE 0.2 0.3 0.5 10.1 3.6 2.3 0.3 1.7 0.4 3.1 5.6 4.4 0.3 0.3 0.8 18.4 6.3 5.0 0.4 2.2 0.7 4.3 9.2 7.7 adh029 Form: A MXAE Resolution: 1.90 ˚ A IMXAE [7] [3] [3] [3] [6] [7] [4] [2] [4] [6] [4] [4] 257d GCCGGC RMSE 0.5 0.2 0.5 10.4 6.5 4.9 0.6 2.1 0.2 1.7 10.1 5.7 adf073 Form: A MXAE 0.9 0.4 1.0 14.9 12.4 8.0 1.1 2.4 0.3 2.4 14.3 11.0 Resolution: 2.30 ˚ A IMXAE [3] [4] [4] [5] [3] [4] [1] [2] [3] [2] [4] [4] 281d GGCATGCC RMSE 0.3 0.2 0.4 6.3 14.1 3.2 0.3 1.9 0.2 2.7 5.9 5.8 adh076 Form: A MXAE 0.5 0.3 0.6 12.0 24.2 6.1 0.5 2.4 0.3 5.1 9.8 10.4 Resolution: 2.38 ˚ A IMXAE [4] [4] [7] [5] [7] [1] [2] [1] [1] [1] [7] [1] 440d AGGGGCCCCT RMSE 0.2 0.1 0.4 13.7 9.2 2.6 0.7 2.1 0.2 3.3 8.6 5.1 ad0003 Form: A MXAE 0.4 0.2 0.7 21.4 20.2 4.6 1.2 2.6 0.4 6.2 15.3 8.9 Resolution: 1.10 ˚ A IMXAE [1] [6] [5] [8] [10] [2] [4] [7] [5] [4] [5] [4] 137d GCGGGCCCGC RMSE 0.3 0.1 0.4 10.2 5.7 1.8 0.7 2.4 0.2 1.2 6.1 5.4 adj050 Form: A MXAE 0.6 0.1 0.7 17.1 10.5 3.1 1.2 3.3 0.4 2.5 12.2 8.1 Resolution: 1.70 ˚ A IMXAE [3] [5] [3] [4] [10] [5] [4] [2] [4] [6] [5] [4] 1dc0 CATGGGCCCATG RMSE 0.3 0.1 0.2 10.4 7.5 2.0 0.7 2.1 0.2 2.5 2.4 5.2 0.6 0.3 0.4 21.4 13.2 4.2 1.3 2.7 0.5 5.4 6.9 10.5 bd0026 Form: A/B MXAE Resolution: 1.30 ˚ A IMXAE [10] [2] [6] [8] [12] [8] [11] [3] [6] [11] [6] [11] 1dn6 GGATGGGAG RMSE 1.1 0.3 0.7 10.6 20.6 11.4 0.7 2.0 0.8 6.2 12.7 15.9 adi009 Form: A MXAE 2.2 0.5 1.2 15.5 31.5 27.9 1.4 3.3 1.3 11.1 24.9 27.6 Resolution: 3.00 ˚ A IMXAE [9] [2] [9] [4] [8] [8] [2] [6] [6] [5] [2] [8] 160d CCCGGCCGGG RMSE 0.3 0.1 0.6 12.9 5.8 2.1 0.8 2.3 0.4 3.4 7.4 5.4 0.5 0.2 0.8 21.8 11.3 3.5 1.6 2.8 0.7 6.8 15.7 11.0 adj049 Form: A MXAE Resolution: 1.65 ˚ A IMXAE [4] [7] [8] [9] [6] [4] [4] [7] [3] [3] [4] [4] 240d CCGGGCCCGG RMSE 0.4 0.2 0.6 12.0 8.0 1.4 0.7 2.3 0.4 2.6 7.6 4.4 adj069 Form: A MXAE 0.9 0.4 1.0 18.9 13.8 2.4 1.1 2.9 0.8 5.6 15.4 7.4 Resolution: 2.00 ˚ A IMXAE [3] [8] [9] [7] [10] [5] [5] [2] [8] [2] [4] [4] 399d CGCCCGCGGGCG RMSE 0.2 0.1 0.4 14.9 7.3 2.4 0.6 2.1 0.3 3.3 7.3 8.4 adl047 Form: A MXAE 0.4 0.3 0.8 22.5 18.4 4.6 0.9 3.0 0.7 6.2 14.0 18.7 Resolution: 1.90 ˚ A IMXAE [5] [8] [8] [4] [1] [7] [8] [7] [1] [9] [6] [1] 1 pdb (top) and ndb (bottom) file names of the structures. The hyphen - stands for the structures that do no exist in pdb database. 2 Error type, RMSE (root mean square error), MXAE (maximum absolute error), IMXAE (index of the base that produces the MXAE).

Atom xyz ˚ A

0.9 (0.8) 0.7 (1.1) 0.9 (1.2) 0.7 (1.2) 0.9 (1.4) 0.8 (1.7) 1.0 (1.6) 1.6 (1.8) 1.4 (1.9) 2.7 (0.9) 2.6 (1.0) 2.4 (1.1) 2.6 (1.3) 4.3 (1.6) 4.7 (2.1) 3.7 (2.6) 2.8 (2.5) 5.0 (2.3) 4.8 (3.6) 5.9 (5.0)

DNA Structure Prediction from Sequence

19

sequence). Therefore, the coordinates of the real and the simulated structure are different, but their range is the same. After performing the structural alignment, the RMSE of the nucleic base atom positions of this structure is 0.8 ˚ A and the RMSE of the whole dataset is 1.1 ˚ A.

Figure 2: Three-dimensional representation of the non-hydrogen base atoms of the 1d27.pdb crystal structure (left), and the corresponding predicted structure (right). We compared the performance of our algorithm with the recently published algorithm of Farwer et al. [10] based on genetic algorithms and empirical methods. The comparison cannot be perfect since the Farwer’s method estimates all the atoms (nucleic base and sugar backbone), whereas ours in the present implementation estimates only the nucleic base atoms. We applied our algorithm to estimate the conformational states and the atomic structures of the dataset proposed by Farwer et al. [10]. To overcome biases we recalculated our training dataset eliminating the structures from the Farwer’s dataset. The different performance errors are presented in Table 2, where the last column shows the RMSE of the atomic coordinates. For each structure, the RMSE of the position prediction of the nucleic base atoms given by our method is shown in the top, the RMSE of the position prediction of all the atoms given by the best model of Farwer et al. [10] is shown in the bottom (in parenthesis). Over half of the DNA structures of the Farwer’s dataset are A-DNA forms. For such structures the Farwer’s method outperforms ours. This is natural, since our method is trained only with the more common DNA form in the nature (and in the public available databases), the B form. For the B-DNA structures, our method generally performs better than the Farwer’s one. Anyway, the comparison is not completely just, since we still do not estimate the position of the backbone atoms.

4

Discussion

The DNA molecular dynamic (MD) simulations can produce the 3D atomic structure of oligonucleotides [3, 7]. There exist also structure prediction techniques based on minimization of theoretical energy functions, using genetic algorithms [10]. The presented here method plays a complementary

20

Ara´ uzo-Bravo and Sarai

role to that of MD and genetic algorithms in the generation of DNA atomic 3D structures, since it can provide a first and fast insight in the 3D structure of arbitrary long DNA sequences which are computationally very demanding to simulate with other techniques. We have also tried a backpropagation neural network to learn longer range R base pair and base-pair steps conformational coordinates, but the results did not improve the presented here ones. The strength of our method is in the simplicity of the knowledge-based approach that uses a very simple and robust stastistical analysis based only in the calculation of average and covariance matrices. Our implementation aims at the simulation of the nucleic bases for long sequences, for which the actual approaches can be computationally very demanding. Some applications of the atomic 3D structures, generated with this technique are: to supply an initial state for molecular mechanics and MD simulations; to provide a free of binding DNA structure for reference, to compare with DNA binding states induced by proteins, drugs of other ligands; to compare anomalous DNA structures (with base mistmaches, kinks, etc) with a standard structure; to help in the search of possible binding sites for transcription factors where the specific conformation states of some DNA regions can play the role of a recognition signal for a target transcription factor; The implementation is prepared for estimations considering longer ranges of neighbors, incrementing the range R of the R-nucleotides used in the statistical analysis during the learning phase. But the trials done with the current available data did not improve significantly the results. The presented technique relies on the availability of DNA crystal structures. The increase in the number of DNA crystal structures in the public databases can improve the training dataset, which in turn can improve the predictions. Until now we have limited our work to the prediction of B-DNA structures, and to their nucleic base atoms. In the future, we plan to extend our technique to A-DNA and Holliday junctions, to predict not only the base atoms but also the sugar backbone atoms. We are also generalizing our knowledge-based method to the prediction of double stranded RNA structures.

Acknowledgments M.J. Ara´ uzo-Bravo would like to acknowledge the Japanese Society for the Promotion of Science (JSPS) for supporting him for this research. This work is supported in part by Grants-in-Aid for Scientific Research 16014219 and 16041235 (A. Sarai) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References [1] Anfinsen, C.M., Principles that govern the folding of protein chains, Science, 181(96):223–220, 1973. [2] Ara´ uzo-Bravo, M.J. and Ahmad, S., Protein sequence and structure databases: A review, Current Analytical Chemistry, 1(3):355–371, 2005. [3] Ara´ uzo-Bravo, M.J., Fujii, S., Kono, H., Ahmad, S., and Sarai, A., Sequence-dependent conformational energy of DNA derived from molecular dynamics simulations: Toward understanding the indirect readout mechanism in protein-DNA recognition, Journal of the American Chemical Society, 127(46):16074–16089, 2005. [4] Babcock, M.S., Pednault, E.P.D., and Olson, W.K., Nucleic acid structure analysis: Mathematics for local cartesian and helical structure parameters that are truly comparable between structures, J. Mol. Biol., 237:125–156, 1994. [5] Basham, B., Schroth, G.P., and Ho, P.S., An A-DNA triplet code: Thermodynamic rules for predicting A- and B-DNA, Proc. Natl. Acad. Sci. USA, 92:6464–6468, 1995.

DNA Structure Prediction from Sequence

21

[6] Berman, H.M., Olson, W.K., Beveridge, D.L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.H., Srinivasan, A.R., and Schneider, B., The nucleic acid database: A comprehensive relational database of three- dimensional structures of nucleic acids, Biophys. J., 63:751–759, 1992. [7] Beveridge, D.L., Barreiro, G., Byun, K.S., Case, D.A., Dixit, S.B., Cheatham III, T.E., Giudice, E., Lankaˇs, F., Lavery, R., Maddocks, J.H., Osman, R., Seibert, E., Sklenar, H., Stoll, G., Thayer, K.M., Varnai, P., and Young, M.A., Molecular dynamics simulations of the 136 unique tetranucleotide sequences of DNA oligonucleotides. I. Research design and results on d(Cp G) steps, Biophys. J., 87:3799–3813, 2004. [8] Deshpande, N., Addess, K.J., Bluhm, W.F., Merino-Ott, J.C., Townsend-Merino, W., Zhang, Q., Knezevich, C., Lie, L., Chen, L., Feng, Z., Kramer-Green, R., Flippen-Anderson, J.L., Westbrook, J., Berman, H.M., and Bourne, P.E., The RCSB protein data bank: A redesigned query system and relational database based on the mmCIF schema, Nucleic Acids Res., 33(1):D233–237, 2005. [9] Dickerson, R.E., Bansal, M., Calladine, C.R., Diekmann, S., Hunter, W.N., Kennard, O., Kitzing, E., Lavery, R., Nelson, H.C.M., Olson, W.K., and Saenger, W., Definitions and nomenclature of nucleic acid structure parameters, Nucleic Acids Res., 17(5):1797–1803, 1989. [10] Farwer, J., Packer, M.J., and Hunter, C.A., Prediction of atomic structure from sequence for double helical DNA oligomers, Biopolymers, 2005. In press. [11] Hays, F.A., Teegarden, A., Jones, Z.J.R., Harms, M., Raup, D., Watson, J., Cavaliere, E., and Shing-Ho, P., How sequence defines structure: A crystallographic map of DNA structure and conformation, Proc. Natl. Acad. Sci. USA, 102(20):7157–7162, 2005. [12] Horn, B.K.P., Closed-form solution of absolute orientation using unit quaternions, J. Opt. Soc. Am. A, 4:629–642, 1987. [13] Leonard, G.A., Thomson, J., Watson, W.P., and Brown, T., High-resolution structure of a mutagenic lesion in DNA, Proc. Natl. Acad. Sci. USA, 87(24):9573–9576, 1990. [14] Lu, X.J. and Olson, W.K., 3DNA: A software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures, Nucleic Acids Res., 31(17):5108–5121, 2003. [15] Lu, X.J., Shakked, Z., and Olson, W.K., A-form conformational motifs in ligand-bound DNA structures, J. Mol. Biol., 300(4):819–840, 2000. [16] Matthews, B.W., Structural and genetic analysis of the folding and function of T4 lysozyme, FASEB, 10:35–41, 1996. [17] Olson, W.K., Bansal, M., Burley, S.K., Dickerson, R.E., Gerstein, M., Harvey, E.C., Heinemann, U., Lu, X.J., Neidle, S., Shakked, Z., Sklenar, H., Suzuki, M., Tung, C.S., Westhof, E., Wolberger, C., and Berman, H.M., A standard reference frame for the description of nucleic acid base pair geometry, J. Mol. Biol., 313(1):229–237, 2001. [18] Packer, M.J., Dauncey, M.P., and Hunter, C.A., Sequence-dependent DNA structure: Dinucleotide conformational maps, J. Mol. Biol., 295:71–83, 2000. [19] Packer, M.J., Dauncey, M.P., and Hunter C.A., Sequence-dependent DNA structure: Tetranuleotide conformational maps, J. Mol. Biol., 295:85–103, 2000. [20] Persikov, A.V., Ramshaw, J.A., Kirkpatrick, A., and Brodsky, B., Amino acid propensities for the collagen triple-helix, Biochemistry, 39(48):14960–14967, 2000.