Picking Alignments from (Steiner) Trees - CiteSeerX

Report 1 Downloads 68 Views
Picking Alignments from (Steiner) Trees Lior Pachter

Fumei Lam

Department of Mathematics, U.C. Berkeley

Department of Mathematics, M.I.T.

[email protected]

Marina Alexandersson Department of Statistics, U.C. Berkeley

[email protected]

The

ABSTRACT

The application of Needleman-Wunsch alignment techniques t,o biological sequences is complicated by two serious problems when t,he sequences are long: the running time, which scales as the product of t,he lc~~gt,hs of sequeucvs, and the tlifficult,,y in obtaiuing suitable parameters that produce meanillgflll aligllluellt~h. Tl 1e Iturning time problem is often corrected by reducing the search space, using techniques such as banding, or chaining of high scoring pairs. The parameter problem is more difficult to fix, partly because the probabilistic model, which Needleman-Wunsch is equivalent to, does not capture a key feature of biological sequence alignments, namely the alternation of conserved blocks and seemingly unrelated non-conserved segments. We present a solut,ion t,o t,he problem of designing efficient search spaces for pair hidden Markov models that align biological sequences by taking advantage of their associated features. Our approach leads to an optimization problem, for which we obtain a 2-approximation algorithm, and that is based on the construction of Manhattan networks, which are close relat.ives of Steiner trees. We describe the underlying theory and show how our methods can be applied to alignment of DNA sequences in practice, succesfully reducing the Viterbi algorit,hm starch space of alignment PHMMs by three orders of magnitude.

[email protected]

problems

l

l

.

are:

The alignment problem for biological sequences. Development of efficient, hidden Markov models. Construction

of

rectilinear

Viterbi Steiner

algorit,hms

for pair

networks.

The firs1 of those problems, the alignment of biological sequences, is arguably the most successful application of computational biology to date. The classical example is the BLAST program [3], which is the most cited paper in computational biology, and in fact one of the most cited papers in science altogether. BLAST has been applied successfully to a range of important problems, and is very useful in rapidly querying large databases to identify similarity with query sequences, but it remains a challenge to develop accurate alignment algorithms that are able to correctly align exons and other biologically interesting sequence features in large sequences. There have been many attempts designed to address these problems, and it is impossible to give a comprehensive review here. Most approaches fall int,o one of t,wo categories: improvement of parameter selection and “carrectness” of alignments, and secondly methods for speeding up the alignment methods. By way of example we mention a few recent research efforts aimed at addressing these difficulties. For the first problem, there have been numerous investigations on how to normalize alignments taking lengths into account (e.g. [a]) or how to select parameters, the latter issue being a very difficult one as is illustrated by work on parametric alignment [lo, 221 which shows that “optimal alignmen&” are very sensit,ive to the choice of paThere is also a vast literature on the topic of rameters. speeding up alignments. Some of the popular approaches are the banding method for global alignments [16, 91, the method of chaining together regions of strong local alignment or high scoring pairs (HSPs) [13, 91, and more recently, suffix tree anchoring approaches [4]. Finding accurate and fast methods for aligning long sequences, or whole genomes, remains a difficult and important problem.

1. INTRODUCTION

In this paper we address the problem of designing efficient search spaces for pair hidden Markov model alignment algorithms that take advantage of the conservation patterns of biological secluences. This leads naturally to the consideratiou of three computational problems, each of which has been studied individually in considerable detail, but whose co11~i(&ou has not. been well explored.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RECOMB ‘02, April 18-21, 2002 Washington, D.C., USA Copyright 2002 ACM ISBN l-58113-498-3-02/04 . ..$5.00

One of t,he approaches to improving alignment accuracy has been to use probabilistic, rather than heuristic, models for alignment. In particular the hidden Marlcov model (HMM) approach, which has had a lot of success in gene finding and protein family identification, has been employed to provide a probabilistic basis and interpretation for many popular

246

alignment st,rategies. Alignment. problems are connected t.o pnir hidden Markov models (PHMMs). In fact there is a dircct. correspondence het,ween the Viterbi algorit,hm for certain PHMMs and the classic Needleman-Wunsch [18] and Smith-Waterman alignment algorithms [5, 121. One of the advantages of aligning sequences with PHMMs is the ability (by appropriately selecting the state space) to model a key feature of biological sequences, namely the presence of highly conserved regions with seemingly unrelated non-conserved rrgions in between (for an example, scse [14]). As we show in the following section, the speed up of the Viterbi algorithm for PHMMs that model the conservation pattern seen in biological sequences, leads nataurally to an interesting opLimizaiion problem which can be solved efficiently. Stalldard heuristic met,hods such as chaining of HSPs (the idea is to select a “maximal” collection of non-overlapping HSPs that, can form an alignment) are not, suitable in t,he probabilist,ic setting because they do not take the regions in between the HSPs into account in determining the global alignment, and thus compromise the ability to t,ake advant,age of features of both conserved and non-conserved segments. Other heurist.ic- n~et.l~ods, such as banding, have tllc drawback that, they are inefficient in cases where there have been large insertions in the sequences, and furthermore fail to take full advantage of some of the properties of the probabilistic models.

Figure 1: A minimum spanning tree and Steiner for a configuration of four points in the plane.

2. MATHEMATICAL PRELIMINARIES 2.1 Alignments and PHMMs

In what follows we assume that t.he reader is familiar wit,h the dynamic programming (DP) approach to alignment, the associated alignment graphs [9] (see Figure 2b), and the basic definitions of hidden Markov models [20]. W e b e g i n by reviewing the important connection between pair hidden Markov models and the Needleman-Wunsch alignment algorit,hm [5, 121.

(a)

UJ)

AC-GCAT-G AC .syr i.e. one of s or t is up and to the left of the other. The dynamic programming algorithm for finding the minimum rectilinear Steiner arborescence for a slide has running time O(n3) where n is the number of points in the slide.

Not,e that. t,his problem is an inst,ance of t.he point,-t,o-point pairs connect,ion p r o b l e m [6], where t,he sollrcr-clrst.ination are those pairs (s, t) with t up and to the right of 3. Since we can restrict t,he grid t,o the Hanan grid of t,he input, (.crminals and since there are O(n’) up-right, pairs for an input t,erminal set N of size n, the algorithm in [6] gives an O(n2”) algorithm for finding the optimal minimum network. We now modify the algorithm from [8] to give an O(n”) Z-approximation algorithm.

3. RESULTS

Our main contribution is the application of Steiner network ideas, in particular a variant of Manhattan networks, for reducing t,he search space for dynamic programming (the Viterbi algorithm) applied to PHMMs that contain a mixt,ure of conserved region states and independent region states. The general strategy is as follows:

1.

3 A.

3.

A 2-approximation algorithm for Manhattan networks on alignment graphs

For any node v E N, let Q1(~),Q2(2)),Q3((‘),Q4(2)) denote t,he four quadran& obtained by drawing horizont,al and vertical lines though v. Let f(u) be the topmost node of N in Q~(w) and let g(o) be t,he leftmost, node of N in Ql (21). For any node U, consider the sets A(o) = (11 : f(u) = II} and B(v) = {U : g(u) = u}. Note that for any v, v u A(v) is a slide and u U B(w) is a slide rotated by 180 degrees.

Given two sequences, find all the high scoring pairs using some local alignment algorithm. Build a Manhattan type network in the alignment graph so that, any subset, of HSPs can be included in the opt,imal alignment of the subgraph. Run the Vit,erbi algorithm, rest,ricted to t,he subgraph of t.he alignment graph obtained from step 2.

Step 1 can be performed using appropriate local alignment tools such as BLAST, [3], BLASTZ [21] or MUMmer [4]. The implicit assumption is that the “true” biological alignment of the sequences will have the property that conserved regions in the sequences will lie in HSPs (high scoring pairs), and therefore the corresponding conserved states will only IX used in HSPs. Since the paths used t,o connect the HSPs will corrrspontl t,o st,ate paths restricted to t,he intlel)ct~tlcnt states. t,he exact. paths selected for the network are unimportant, only their endpoints (Lemma 2) a n d t,heir t,otal lengt,h matt,er, affecting the running time and memory requirement,s of the Viterbi algorithm. Therefore, we connect

F i g u r e 6 : T h e f o u r q u a d r a n t s f o r a v e r t e x 21. The algorithm constructs edges in two stages:

Algorithnl

Corlllect,UI,R,ight

1 . S w e e p t,he point,s from left, t,o r i g h t . F o r rach node 0 E N, run the minimum slide arborescencr algorit.hni on u and the slide formed by .4(t)).

249

2.

Sweep t,he points front t,op to hott,om. For each node 11 E ,V, rot,at,e t,he set. B(t)) by 180 degrees and run the tttittitttutti slide arborescence algorithm on t,he resulting slide.

In this algorithm, we run only two of the four stages of the [8] algorithm, since we need only to connect up-right pairs. The proof of correctness is given by the following lemma.

LEMMA 3. [a] For each pair of ljoznts s, t E N with t ubove and to the right of s, there is a Munhattan path in G C’O,1,1( cling .s at1d I. Proof: If s a n d t s h a r e t h e s a m e x or y coordinate, the algorit,ltttt cotutects J t,o t with a line segment. If g(s) = 1 or f(t) = 3, t,hett the algorithnt connects s and t during the call (0 the minimum slide arborescence algorit,hnt. Ot,herwise, s is tlirectJy cottncct,c:cl to a point s’ = y( 3) E Qt (s) to the left of I a11r1 t is tlir,ct.ly connected t,o a point, t’ = f(s) E Qs(t) above s. If either of s’ or t’ lies in the rectangle formed l)v s and !, t.hrn recurse on the point inside the rectangle. OAerwise, bot,h s’ and t’ lie outside t,he rect,angle, so the Manhattan paths connect,ing J to s’ a n d t to t’ tttust. int,erscct~ inside the rectanglr. This gives a Manltattan path connecting s and t.

0 Figure 7: A slide showing the origin 0, a corner and a slide of points ~1,. . ,p5. (i) f continues to the north or east of its intersection with e. (ii) f continues to the south or west of its intersection with e.

In the first case, we can replace e with either a vertical OI horizontal edge of the same cost without affecting any of the other edges (since by construction f was the first edge intersecting e).

lising t,he minimutn slide arborescence algorithm for a slide that. runs itt Ointe O(n3) [15], the above algorit,hm gives a Z-approximation for our problem that runs in time O(n3).

Diagonal

Edges

The alignment graphs we are interested in are different front rect,angular grids in that they also cont,ain diagonal edges. Fortuttately, witlt sottte tttodifications it is possible to convert the O(n”) algorithm just described to work for alignment graphs as well.

Figure 8: Case (i) in the backtrack proof.

Let R dettote the rectilinear grid and D denote the rectilinear grid with additional up-right edges. If the horizontal distance between two points s and t is x and the vertical dist,ance is y, the shortest path betwern s and t in D has length tttax{.r, y} and there will be (:z,::c;:,‘) such paths.

rjEMMA

4. l,d P =

(111 ,j&,. .pn} be cl Skk O

f

pOintS

increasing x-coordinate. Then the minimal nrboresence of P (with origin 0) on grid D is the minimal rectilinear nrhorescence of P wit/z origin 0’ = ((~t)~, (P~)~) on grid R together with the minimal path from 0 to 0’ in D. in

Proof: To show t,his, it, is enough to show that tttttttt arhorescence of P wit.lt origin 0’ as defined t tt(, SiltttC’ tctt#l tt Ott gritls II atlcl I?. Since 0’ ttcWls to c011tic~G~cl t,o poittbs ~,t and pn, t,tte vert.ical atid cdgc~s

O’,

0’ --f ,I, ,

0 ’

-+ pr, ttltls~~ be presetlt~. N

o

w

the miniabove has b e ttorizottlal ,

In the second case, recurse on f, following the edge back towards the origin. There will be a unique path to follow since the rectilinear Steiner arborescence defines a tree with root node 0’ and leaves pl ,pz, . .pn. The claim is that eventually this path towards the origin must contain an edge e’ intersecting an edge f’ as in case (i). This is true since P is a slide and since the extreme edges 0’ + ~1, 0’ -+ pn must be present in the arboresence. Then, since there are no other intersections along the travelled path, by construction no other edges are modified by replacing the path with a single vertical or horizontal edge of the same length as the path. The path will be replaced by a vertical (horizontal) edge if the edge f’ is horizontal (vertical). This proves that we need the diagonal edges in constructing the minimum arborescence on D only for connecting the origin 0 with the node 0’. Now, to approximate the minimum up-right connected network on D, we can use the same algorithm we used for R, modifying only the calls we make to the minimum slide arborescence algorithm to allow the diagonal edge connections from 0 to 0’.

SlqJpose

here is sottte tliagottal edge f’ in the arboresc-encr. Follow e back (towards t,he origin) u&l it intersects an edge f. Then either I

250

MMs such as in Figure 4, depends on the sequences between the HSPs, and it is not sufficient to simply look for a maximal collection of HSPs using a sequence independent gap penalty as in [17]. This is particularly important in the gene finding application. In that case, HSPs will correspond t,o exons and CNSs (conserved non-coding sequences), and the intervening sequences to introns, and the length of the introns (which is typically geometrically distributed) plays a significant role in the overall probability of the alignment. Thus, from the probabilistic perspective, the St,einer method for reducing t,he search space for alignments is in some sense the natural approach for reducing the search space, both speeding up t,he alignments and reducing the memory requirements.

Figure 9: Case (ii) in the backtrack proof.

3.2 Application to alignment

The algorithm described above was implemented for the purposes of reducing alignment search spaces in comparat,ive gene finding PHMMs such as described in [19]. In order to illustrate the effectiveness and performance of t,he algorithm in practice, we analyzed the first -5Okb from a pail of regions consisting of 250kb of syntenic human and mouse DNA. The region (GENBANK accession numbers U47924 and AC002397) was selected from an alignment example used in the MUMmer paper [4] w h’1~ :I I was first described in [l], and is the region that contains (he C!Xl gene.

4. CONCLUDING REMARKS

The connect,ion between alignments, PHMMs, and St,einer trees raises a number of interesting quest,ions that go beyond the immediat,e applications we have highlight,ed in t.his paper. Optimal networks for more complicated PHMMs, such as the GPHMMs discussed in [19] lead t,o more complicated variants of the Manhattan network problem presented here. Even the Manhattan network problem has not been “solved”, in the sense that it is still unknown whether it is NP-complete [8].

We used BLASTZ to find the HSPs, although in practice any local alignment algorithm with low cutdoffs on the parameters is suitable (our goal in the preprocessing step is to capture all possibly relevant HSPs so we prefer false positives over false negatives). The resulting BLASTZ alignment is depicted as black and white boxes in Figure 10 with HSPs overlapping coding regions being black and non-coding regions being shown as white boxes.

The running time of our algorithm is O(n?) (worst case) where n is the number of HSPs, and the resultant PHMM algorithm for producing an alignment will run in t.ime proportional to the size of the network, which in the worst case will be O(n*). It is possible to reduce the O(n3) running time for obtaining the network to O(nlogn) at the expense of increasing the bound for the size of the network from twice optimal to four times optimal. We have found that for most biological applications, the HSPs will lie along a diagonal and therefore it is beneficial to use the O(n3) algorithm since it will be much fast,er in practice. The worst case O(n*) time for running the Viterbi algorithm is, in a sense, best possible, since a dynamic programming chaining algorithm would have the same time bound if gap scores are to depend on the sequences. An advantage of our algorithm is that the size of the network can be much smaller depending on the geometry of the HSPs (and this is often the case for biological sequences where the HSPs lie along a diagonal), whereas a dynamic programming chaining algorithm that does not take advantage of the independence of the scores in the noncoding regions will always have running time R(n’) regardless of the geometry of the HSPs.

The 2-approximation algorithm for rectilinear grids including diagonal edges that we have described was implemented in a program called SLIM, and was used to construct a Manhattan network which is shown in Figure 10. Our algorithm produced a network of total length 859,358. This should be compared with the size of the entire alignment graph which contains 5.824 . 10’ edges (3 orders of magnitude larger). It is clear in this example that a banded alignment algorithm consisting of restricting the search space to a band along the diagonal would not work well because of the shift of the exons of the main diagonal. The smallest band containing all the exons has a width of approximately 10,000 bases resulting in an alignment subgraph with about 10’ edges. In fact, it is not difficult to construct (or find) examples that fail for any type of banding algorithm; difficulties will always arise when there have been significant insertions or deletions in the sequences.

Another nice feature of the algorithm, is the ability to tune the number of HSPs considered so as to build a network of an appropriate size for the computational resources available (in the worst case the network size will grow ql~adratically with the number of HSPs, but again in practice it should be closer to linear).

It should be pointed out that the traditional method of obt,aining global alignments from local alignments, namely by chaining together HSPs, is not appropriate (or relevant) in the probabilistic setting we have described for alignment. The reason is that in the mixed PHMM models the alignment, scores depend on the sequences used bet,wecn the HSPs; in particular, for the non-conserved regions, the product of the output probabilities in each of the sequences (independently) will contribute to the final alignment score. Thus the optimal alignment (in a probabilistic sense), using PH-

Alt,hough we have not, provided details hrrr, t,llc, t.rchniques we have outlined generalize for multiple alignments, and can be applied to more complicated PHMMs, or even generalized PHMMs such as the comparative gene finding generalized PHMM described in [19].

251

AC002397 A

50000

40000

30000

20000

10000

0 6

10000

20600

30000

. 5oiJoo u47924

40600

Figure 10: The Manhattan network for 52 x 56 kb of the CD4 region in human and mouse (the first parts of GENBANK accession numbers U47924, AC00239’7). The boxes are the BLASTZ local alignments (not drawn to scale), i.e. the HSPs. Black boxes overlap the annotated coding exons and white boxes are HSPs in non-coding regions. program is available for download at http://bio.math.berkeley.edu/slim/. The SLIM

Whole Genomes. 2369-2376. [5]

Acknowledgments M. A. was partially supported by STINT, the Swedish Foundation for International Cooperation in Research and Higher Education. Thanks to Nick Bray for helping with some of the implementation of SLIM and to Simon Cawley for helpful discussions and suggestions.

5.

REFERENCES

[7]

[8]

.4lt~scl1ul, S. F., Gish, W., Miller. W., Myers, E. W. Lipman, D. J. (1990). Basic local alignment search tool Journal of Molecular Biology, 215, 403-410.

[4]

Delcher. A. L. Kasif, S. Fleischmann, R. D. Peterson, 3. White, 0. Salzberg, S. L. (1999). Alignment of

Research,

2’7(11),

Durbin, R. Eddy, S. Krogh, A. Mitchison, G. (1998). Biological sequence analysis. Cambridge University press.

Garey, M.R. Johnson, D.S. (1977). The Rectilinear Steiner Tree Problem is NP-Complete. SIAM J. Appl. Math., 32(4), 826-834. Gudmundsson, J. Levcopoulos, C. Narasimhan, G.

(2001).

Approximating

Minimum

Networks. Nordic 1. Comp., 8(2),

Arslan, A. Egecioglu, 6. Pevzner, P. A. (2001). A new approach to sequence comparison: normalized sequence alignment. RECOMB 2001: Proceedings of the Fifth International Conference on Computational Molecular Biology.

[3]

Acids

[6] Feldman, J. Ruhl, M. (1999). The Directed Steiner Network problem is tractable for a constant number of terminals. FOCS’99: IEEE Sympos. Found. Comp. Sci, 299-308.

[I] Ansari-Lari, M. A. Oeltjen, J. C. Schwartz, S. Zhang, Z. Muzny, D. M. Lu, J. Gorrell, J. H. Chinault, A. C. Belmont, J. W. Miller, W. Gibbs, R. A. (1998). Comparative sequence analysis of a gene-rich cluster at human chromosome 12~13 and its syntenic region in mouse chromosome 6. Genome Research, 8, 29-40. [2]

Nucleic

[9]

Manhattan 219.

Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences.

Cambridge

University

Press.

[lo] Gusfield, D. Balasubramanian, K. Naor, D. (1994). Parametric optimization of sequence alignment. Algorithmica, 12, 312-326. [ll] Hanan, M. (1966). On Steiner’s Problem with Rectilinear Distance. SIAM J. Appl, Math., 14(2), 2555265.

252

[la] Holmes, I. (1998). Studies in Probabilistic Sequence Alignment and Evolution. Ph.D. Thesis, University of Cambridge and Sanger Centre, UK. [13]

Joseph, D. Meidanis, J. Tiwari, P. (1992). Determining DNA sequence similarity using maximum independent set algorithms for interval graphs. In Proc. of the This &and. Workshop on Algorithm Theory, Springer LNCS 621, 326-337.

[14]

Kent, W. J. Zahler, A. M. (2000). Conservation, Regulation, Synteny, and Introns in a Large-scale C.briggsae-C.elegans Genomic Alignment Genome Research, 10, 1115-1125.

[15]

Lingas, A. Pinter, R. Rivest, R. Shamir, A. (1982). Minimum edge length rectangular partitions of rectilinear figures. In Proc. 6th GI Conf. Theoret. Comput. Sci., volume 145 of Lecture Notes Comput. Sci., Springer-Verlag, 199-210.

[16]

Myers, E. (1986). An O(nd) difference algorithm and its variations. Algorifhmica, 1, 251-266.

[17]

Myers, E. Miller, W. (1995). Chaining Multiple-Alignment Fragments in Sub-Quadratic Time. Proceedings of the Symposium on Discrete Algorithms, 38-47.

[18]

Needleman, S. B. Wunsch, C. D. (1970). A general method applicable to the search for similarities in t,he amino acid sequences of two proteins. Journal of Molecular Biology, 48, 443-453.

[19]

Pachter, L., Alexandersson, M. Cawley, S. (2001) RECOMB 2001: Proceedings of the Fifth International Conference on Computational Molecular Biology.

[20]

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257-286.

[21]

Schwartz, S. Zhang, Z. Frazer, K. A. Smit, A. Riemer, C. Bouck, J. Gibbs, R. Hardison, R. Miller, W. (2000) PipMaker-A Web Server for Aligning Two Genomic DNA Sequences. Genome Research, 10(4), 577-586.

[22]

Zimmer, R. Lengauer, T. (1997). Fast and Numerically Stable Parametric Alignment of Biosequences. RECOMB 2001: Proceedings of the First International Conference on Computational Molecular Biology.

253