Evaluation Measures of Multiple Sequence ... - Semantic Scholar

Report 9 Downloads 158 Views
Evaluation Measures of Multiple Sequence Alignments Gaston H. Gonnet, *Chantal Korostensky and Steve Benner Institute for Scienti c Computing ETH Zurich, 8092 Zuerich, Switzerland phone: ++41 1 632 74 79 e-mail: fgonnet,[email protected]

Abstract Multiple sequence alignments (msas) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of structure, functionality and ultimately evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of a multiple sequence alignment (msa). It is based on the use of a heuristic that approximates a solution to the Travelling Salesman Problem, which identi es a Eulerian tour through an evolutionary tree connecting the sequences in a protein family. The algorithm gives an upper bound, the best score that can possibly be achieved by any msa for a given set of protein sequences. Alternatively, if presented with a speci c msa, the algorithm provides a formal score for the msa, which serves as an absolute measure of the quality of the msa. The CS measure can be calculated without the need to construct a corresponding evolutionary tree. It does, however, yield a direct connection between a msa and the associated evolutionary tree, which allows us to use the measure to construct and evaluate evolutionary trees and as a tool for evaluating di erent methods for producing msas. A brief example of the last application is provided. Because it weights all evolutionary events on a tree identically, the CS algorithm has advantages over the frequently used sum-of-pairs measures for scoring msas, which weight some evolutionary events more strongly than others.

Keywords: multiple sequence alignment, phylogenetic tree, scoring function, TSP, evolution

2

1 Introduction One of the most important applications of the data being generated from genome sequencing projects is to reconstruct models for the evolutionary histories of proteins, nucleic acids, and the organisms that carry them. A model for an evolutionary history typically consists of three parts: (a) an evolutionary tree, which shows the familial relationships of evolutionary objects (proteins, for example), (b) a multiple sequence alignment, which shows the evolutionary relationships between parts of these objects (in this case, individual amino acids in the protein sequence), and (c) reconstructed ancestral sequences, models for objects that were intermediates in the evolution of the family. Evolutionary histories are important in deducing biological function in biomolecules, predicting the folded conformation of protein sequences, and reconstructing the history of life on Earth. Constructing trees, multiple sequence alignments, and ancestral sequences encounters three sorts of problems. First, all are dependent on a model for evolutionary processes. At the level of the protein (which will be our exclusive concern here), modelling the evolutionary history of a family must begin with assumptions about the frequencies of amino acid mutations, insertions, and deletions. These frequencies are now available from empirical studies of protein sequences (Gonnet et al., 1992). In this work, a simple model, derived originally from work of Margaret Dayho (Dayho et al., 1978) and subsequently ampli ed, will be used. Second, constructing models for evolutionary histories encounters problems of mathematical complexity. The multiple sequence alignment (msa) problem and the construction of evolutionary trees is NP complete, and becomes computationally expensive for many 3

real protein families encountered in a contemporary database. This requires that virtually all msas and evolutionary trees that will be used in the post-genomic era will be constructed using approximate heuristics. The use of heuristics in constructing multiple sequence alignments creates a third problem, one centering on evaluation. Before using a msa or tree built by a heuristic, one would like to know approximately how closely the heuristic has approximated an optimum msa or tree. Even today, it is common for biochemists to evaluate by eye (and adjust by hand) the output of multiple sequence alignment tools. This is a clearly inadequate approach for any systematic reconstruction of natural history using genomic sequence data. A formal method for judging the quality of a multiple sequence alignment is needed. Accordingly, a variety of groups have proposed or used scoring functions (Altschul, 1989; Thompson et al., 1994; Higgins and Sharp, 1989; Lawrence et al., 1993) that assess the quality of a msa, following a simple approach that examines every pair of proteins in the family, generates a score for each pairwise alignment using a Dayho matrix (Dayho et al., 1978), and creates a score for the msa by summing each of the scores of the pairwise alignments. We shall call these \sum of pairs" (SP) methods. A

B

1

C

2

3

4

Figure 1: A sample evolutionary tree (phylogenetic tree)

4

Although simple to implement, SP methods are obviously de cient from an evolutionary perspective. Consider a simple tree (Figure 1) constructed for a family containing four proteins. The score of a pairwise alignment of sequence 1 and 2 (as generated by a standard SP scoring method) evaluates the likelihood of evolutionary events on edges B!1 and B!2 of the tree, the edges that represent the evolutionary distance from sequence 1 to sequence 2. Likewise, the score of a pairwise alignment of sequences 3 and 4 evaluates the likelihood of evolutionary events on edges C!3 and C!4 of the tree. By the same logic, scoring a pairwise alignment of sequences 1 and 3 evaluates the likelihood of evolutionary events on edges A!B, B!1, A!C, and C!3 of the tree. A

B

1

C

2

3

4

Figure 2: Traversal of a phylogenetic tree using the SP measure (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) By adding \ticks" to the evolutionary tree (Figure 2), it is readily seen that if the SP methods are completely implemented, di erent episodes of the evolutionary history of the protein family are counted di erent numbers of times. In this example, episodes A!B and A!C are each counted four times by the SP method, while episodes B!1, B!2, C!3, and C!4 are each counted only three times. There is no theoretical justi cation to suggest that these earlier evolutionary episodes are more important than later ones. Thus, SP methods are intrinsically problematic from an evolutionarily perspective for scoring 5

msas. Further, it is clear that as the number of sequences in the protein families in general increases as genome projects are completed, the over-weighting of more ancient events in preference to more recent events increases as well. What is needed is a scoring tool that counts each evolutionary episode equally in evaluating a multiple sequence alignment. We report here a \circular sum" (or CS) method for evaluating the quality of a multiple sequence alignment. The method exploits a heuristic that approximates a solution to the Travelling Salesman Problem, which identi es a Eulerian tour through an evolutionary tree connecting the sequences in a protein family. The algorithm gives an upper bound, the best score that can possibly be achieved by any msa for a given set of protein sequences. The bound can be derived without an explicit knowledge of the correct evolutionary tree; thus the method can be applied without need to address the mathematical issues involved in tree construction. Last, it gives us an absolute score of multiple sequence alignments which is important in designing and verifying multiple sequence alignment heuristics.

6

2 Methods The task of generating an evolutionary unbiased score for a multiple sequence alignment can be rephrased as a simple problem: How can a tree be traversed, going from leaf to leaf, such that all branches (each representing its own episode of evolutionary divergence) are counted the same number of times? Assume that a tree contains 4 leaves, numbered from 1 to 4. The problem can be solved by walking through the tree in a circular order (known as Eulerian tour in a undirected graph), that is, from leaf 1 to 2, from 2 to 3, from 3 to 4, and then back from 4 to leaf 1 (Figure 3). A

B

1

C

2

3

4

Figure 3: Traversal of a phylogenetic tree in circular order (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) The Eulerian tour traverses all branches and visits each leaf exactly twice. Thus, all branches of the evolutionary tree are weighted equally. It is easily veri ed that many good circular routes exist, e.g. 1-2-4-3, but not 1-4-2-3 (Figure 4). In the second example, branches A!B and A!C are counted four times, while branches B!1, B!2, C!3, and C!4 at the leaves are counted only twice.

7

Circular order

Non-circular order

A

A

B

1

C

2

3

B

4

1

C

2

3

4

Figure 4: Traversal of a phylogenetic tree in a circular (1-2-4-3) and non-circular order (1-4-2-3)

2.1 Probabilities and Scores Scores of Pairwise alignments To obtain a score from a Eulerian path traversing an evolutionary tree, we must evaluate a probability of the evolutionary events that are represented by each segment of the tree (or graph) traversed. For example, in Figure 3), the branch B!1 is associated with the probability that the common ancestor of 1 evolved from ancestor B, while the branch B!2 is associated with the probability that the same ancestor evolved to give protein 2. The probability that protein 1 and protein 2 are related is therefore the product of the two probabilities associated with segments B!1 and B!2. To actually calculate these probabilities, one applies a Markovian model for sequence evolution. This begins with an alignment of the two sequences, e.g. VNRLQQNIVSL____________EVDHKVANYKPQVEPFGHGPIFMATALVPGLYLLPL VNRLQQSIVSLRDAFNDGTKLLEELDHRVLNYKPQANPFGNGPIFMVTAIVPGLHLLPI

The gaps arise from insertions (or their counterpart deletions) during divergent evolution. The alignment is normally done by a dynamic programming (DP) algorithm using Dayho 8

matrices (Needleman and Wunsch, 1970), which nds the alignment that maximizes the probability that the two sequences evolved from an ancestral sequence as opposed to being random sequences. More precisely, we are comparing two possibilities a) that the two sequences arose independently of each other (implying that the alignment is entirely arbitrary, with amino acid i in one protein being aligned to amino acid j in the other is occurring no more frequently than expected by chance, which is equal to the product of the individual frequencies with which amino acids i and j occur in the database)

Prfi and j are independentg = fifj

(1)

b) that the two sequences have evolved from some common ancestral sequence after t units of evolution where t is measured in PAM units, the number of point accepted mutations per 100 amino acids (Gonnet, 1994b). x

i

(unknown) ancestor sequence

j

present day sequences

Prfi and j descended from xg = = =

X x

X x

X x

fxPrfx ! igPrfx ! j g

(2)

fx(M t)ix(M t)jx

(3)

fj (M t )ix(M t )xj

(4)

= fj (M 2t)ij = fi(M 2t)ji 9

(5)

The entries of the Dayho matrix are the logarithm of the quotient of these two probabilities.

Dij = 10 log10

Prfi and j descended from some xg Prfi and j are independentg

!

(6)

Scores are de ned as follows: Score = 10 log10

Prfeventg Prfnull eventg

!

(7)

The constant 10 and the logarithm base 10 are used for historical/practical purposes, so that scores are comparable with standard sequence alignments. Scores are preferred over probabilities because they can be added and have a more reasonable range.

Probability of an evolutionary tree con guration To obtain the probability for an entire evolutionary tree con guration, one must sum the logarithms of the probabilities of each event represented by the tree, which yields an overall score that is equivalent to the logarithm of the product of the probabilities of each event. For the example tree in Figure 3, the total evolutionary history has six associated probabilities: A!B, A!C, B!1, B!2, C!3 and C!4. If we knew the sequences of the ancestral proteins at the internal nodes of the tree, the probability of the entire tree is:

Prftree g 2g = Prf A!B gPrf A!C gPrf B!1 gPrf B!2 gPrf C!3 gPrf C!4 g (8) 10

where Prf X!Y g is interpreted as the probability of mutating from the amino acid in the node X to the amino acid in node Y according to the distance of the branch X-Y. The individual probabilities would be obtained by aligning the protein sequences at the beginning and end of each episode of evolution and scoring them using a Dayho matrix (for example) as described above. Normally, of course, the sequences for the ancestor proteins A, B and C are not known, as the organisms that contained them have long since died. In (Gonnet and Benner, 1996), a formula for the probability of an entire evolutionary con guration (the ensemble of the phylogenetic tree and the corresponding msa) is derived. It corresponds exactly to the notion of computing the probability of traversing each edge of the tree. We can compute the probability of the tree con guration by adding the probabilities of all episodes represented by each of the lines.

Prftree g 2g =

X A

fA

XX B C

Prf A!B g

(9)

where PX denotes a sum where X takes all 20 possible amino acid values. The probability associated with the entire evoutionary con guration is:

Prfevol.confg =

m X Y k=1 I1 [k]

fI1[k]

X I2 [k]

:::

X Y In?1 [k] a

da MDe [k] ;An[k] a a

(10)

Where k runs from 1 to m (each amino acid), Ij is the unknown sequence at the internal node j , and hence Ij [k] runs through the 20 amino acids, Qa runs through all the branches

of the phylogenetic tree, and De[ak] , An[ak] are the descendent/ancestor amino acids at position k of the branch a, and da is the length in PAM units of the branch a. 11

2.2 Travelling Salesman Problem Application The probability (exponential of the score) derived from pairwise alignments are now the key to identifying an Eulerian Tour. For a set of protein sequences it is computationally simple to obtain a set of (n)(n +1)=2 pairwise scores by aligning each sequence with every other sequence using a DP algorithm to obtain the Optimal Pairwise Alignment. We shall refer to these as OPA scores, to distinguish (see below) these from a pairwise alignment inferred from a msa. These scores can be obtained without either a tree or a msa. To compute probabilities of composed paths, we have to multiply the probabilities. By using logarithms of probabilities (scores) we need to add them. It is readily seen that an Eulerian tour is the shortest tour through a tree. Any other route would evaluate at least one branch or evolutionary event more than twice which would result in a lower probability (and in a longer path). Hence, using such an ordering also implies maximizing the probability of the evolution for a particular phylogenetic tree. This means that we would be using an underlying phylogenetic tree which is the one with the highest probability, or a maximum likelihood tree. To nd such an Eulerian tour using OPA scores, one applies a heuristic for addressing the Travelling Salesman Problem, [ref] a well known task that seeks the shortest route between a set of points that visits each point once1. Since the CS score using OPA scores is the maximum possible score for a given set of sequences, we refer to it as CSmax. 1

We understand that the TSP may be very dicult to solve, in the sense that it may have super-

polynomial complexity. For real computations, this is not a problem. First, we only need an approximate solution, as all our distances are estimates and have an inherent random noise.

12

To calculate the score, let  be the output permutation of the TSP, then the maximum score is n X max(CS ) = 21 OPAScore(Si ; Si+1 ) i=1

(11)

with n+1 = 1 and OPAScore being the score of aligning two sequences as de ned above. The maximal score CSmax serves as an upper bound. This is an immediate conclusion from the observation that the score for each pair of sequences obtained using dynamic programming is the maximum score of any possible alignment (OPA score). Whenever we score a msa, regardless of the algorithm used to generate it, the score of this alignment must be less than or equal to the upper bound. The upper bound can be used to evaluate a given msa. The closer the score of a calculated msa is to the upper bound, the better (or more likely) the msa is. We have now a way to determine a circular order and thus an upper bound on the score of a msa and its associated evolutionary tree for a set of sequences. This score has been obtained without requiring the calculation of an evolutionary tree. Only the sequences at the leaves and their OPA scores with each other are needed. Nevertheless the CSmax score corresponds to the probability of an evolutionary tree con guration.

Circular vs non-circular paths To gain an intuition for the di erence between the scores of circular versus non-circular ordering, a simulation was run. Evolution was simulated starting with 100 random sequences, each of which was evolved using a Markovian model of evolution. At the end of the simulation, each ancestor had generated 16 descendants. Each of the 100 trees was scored in three di erent ways. First, the maximum score 13

was calculated using the TSP ordering that is calculated from the OPA scores ( rst row). Second, real scores have inherent error, implying that CS scores need not be identical if calculated by di erent circular paths, even though all are Eulerian 2. Therefore, the average score of eight other circular orders were calculated (second row). The variance of the circular path lengths is an estimate on the quality of the data 3 . Last, the average was calculated for the scores of eight non-circular orderings (third row)(Figure 5). Order

Score

Max score ? Score

TSP

11012  272

0

circular

10897  263

116  68

non-circular 5989  159

4908  145

Figure 5: Comparing di erent orders against the TSP order. The second column is the absolute score, the third column is the di erence of the score from the TSP ordering minus the score from the circular or non-circular orders

2 3

we expect, of course, these scores to be similar Note that the di erence between any circular and any non-circular path is at least twice the length

of the shortest branch in the tree, because any non-circular order counts at least one branch more than twice

14

3 Scoring of Multiple Sequence Alignments without evolutionary trees The CS algorithm provides a method to evaluate a speci c msa or to compare di erent algorithms for consructing msas. When a msa is scored using the CS measure, the score is the probability of such an alignment happening due to common acestry compared to random sequences. To do so, we abandon scores obtained pairwise from OPAs. Instead, the pairwise relationship between two sequences in the protein family are extracted from the multiple sequence alignment itself. These are scored using a Dayho matrix without dynamic programming re-alignment to give an optimal alignment. These are msa-derived pairwise alignments (MPA). MPA scores must be equal to or less than the scores obtained from the OPA pairwise alignments. The CS method can be applied now again without the need for an explitly de ned evolutionary tree by repeating the procedure outlined above, but using MPA scores instead of OPA scores as input for a TSP algorithm. The CS score of the msa is then the sum of MPA scores in the circular order  divided by two: n m X X Score = 21 (Si [k]; Si+1 [k]) k=1 i=1 (a; b) = Dab

(12) (13)

( ; b) = (a; ) = Deletion penalty (the whole gap is scored)

(14)

( ; ) = 0

(15)

(6 b; ?) = (?; 6 b) = 0

(16) (17) 15

where k runs through each amino acid in a sequence, m is the length of the msa, n is the number of sequences, a and b are amino acids, ' ' is an internal gap (deletion), '6 b' is a gap at the beginning or end of the sequence, and '?' represents any symbol. Note that if there is a deletion in both sequences, there is no penalty (score is zero) because that deletion happened in some ancestor, and its penalty has been counted already.

3.1 Connection between Trees and Msas The formula for deriving the CS score for a msa and for its associated evolutionary tree are identical. When we score a msa with the CS method, we immediately also derive the score of the probability of the entire evolutionary con guration represented by the tree that corresponds to the msa. Since the score was derived from an Eulerian tour, it is maximal and hence it is the score of the maximum likelihood tree for that particular multiple sequence alignment. Similarly, the maximal score CSmax based on the OPA scores is both an upper bound on the probability for the maximum likelihood tree and it is also an upper bound on the probability of an msa for the underlying sequences.

16

4 Simulation of Evolution To illustrate how the scoring function can be used, a variety of tools for generating msas were challenged with a set of protein families simulated following a Markovian model of evolution, and the outputs of each evaluated using the CS tool. This provides, of course, only an approximate assessment of the msa tools themselves. A better assessment must come with actual experimental sequence data. Random trees with a given structure and branch lengths and a random sequence at the root were generated. From this, sequence mutations, insertions and deletions were introduced according to the length of the branches of the tree. At each internal node a new sequence ws thus generated. At the end of the simulation, only the sequences at the leaves are retained. Since both the places of insertions and deletions, as well as the \real" tree are known, the correct msa is known as well. The retained sequences at the leaves can be given to di erent algorithms (MSA(Gupta et al., 1996; Lipman et al., 1989), MAP(Huang, 1994), ClustalW(Higgins and Sharp, 1989; Thompson et al., 1994) and the Probabilistic model (PAS) (Gonnet and Benner, 1996; Gonnet, 1994a)), and the score of the calculated msas can be compared to the score of the \real" (generated) msa using the CS measure. The results for 3 combinations of trees and sizes are shown. These are representative of all other results (Figures 6-8). For every tree the upper bound (CSmax) was calculated and compared to each of the scores (third column). Higher scores or smaller di erences mean better alignments. The rows are sorted into ascending order in this column. Figure 6 shows the result for balanced binary trees with 16 leaves. The length of the 17

Method

CS Score

Upper bound ? CS score

Upper bound: 22451  516

0

MSA:

21767  513

684  76

PAS

21414  481

740  74

Real:

21367  439

781  74

ClustalW:

21330  473

822  76

16

30

23

26 6

8

4 5

MAP:

18633  608

16

2

3521 583

32

5948  911

16200  826

39

28 40

4

8

18

16 8

44 24

13

8 13

5

35

9 7 33

15

1

16

14

4

6

4

4

Dummy:

22

16

28

44

3

15

10

11

12

Figure 6: Comparison of di erent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the di erence between the upper bound and the CS score is (third column) the better the alignment sequences is 300 amino acids and the average branch distance is 30 PAM (so the maximum PAM distance between two sequences is about 240 PAM). For this tree, MSA scored the best followed by PAS, ClustalW and MAP. The alignments of MSA and PAS score even slightly better than the \real" alignment. The reason for this is that the real msas are not necessarily optimal. As a comparison the score of a bad alignment (all the sequences aligned without deletions) was calculated in the last row (Dummy). The tree on the right is an example of a generated phylogenetic tree. The next trees (Figure 7) are unbalanced with 30 sequences of length 300 and the average branch distance is 30 PAM (so the maximum PAM distance is about 300). Only 18

CS Score

Upper bound ? CS score

Upper bound:

40282 581

0

Real:

38796 573

1486  138

PAS

38638 594

1643  151

ClustalW:

38465 597

1818  153

Method

7 25

32

36

17

17 26

28 3

43 43

21238 1565

18957 1507

8

30

7

9

20

3

12

15

10 11

18

4

32

21 13

3

33 35 22 36

42

23

26

33

4

13 7

14

5

32

16

14 30

19

8

21

33

27

15 37

18

12

30

6

29

23

26

97

2

28

31

22

16 15

32

17

17

21

14

37

16

37

41

5

MAP

19

11

6

26

30

25 24

109

30

42

Dummy:

10609 1321

28641 1308

1 20

3

Figure 7: Comparison of di erent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the di erence between the upper bound and the CS score is (third column) the better the alignment PAS, MAP and ClustalW were able to compute the alignments (in a reasonable time). In this case PAS did slightly better than ClustalW. The trees of the last table shown here (Figure 8) had 50 sequences with length 300 and an average branch length of 15 PAM. In the simulation the PAS method was the best followed by ClustalW. No other methods were able to calculate a msa in a reasonable time. These experiments show how the CS measure can be used. Obviously the ultimate evaluation of tools must be done with real rather than simulated data.

19

CS Score

Upper bound ? CS score

Upper bound:

91232 924

0

Real:

88960  913

2272  167

PAS:

88494  941

2704  175

ClustalW:

87313  1654

3884  1363

Method

32 43

31 28

35

37

1 25

31

31 33

31

39

41 44

44

2 23

34062  4558

574563954

1 43

27

3

13 1

3 10 11

6

12

8

15

13

5

22

37

12 28

1

23 19 99 44

18

23

26

85

33

113

23

33

19 41

15

5 21

33 23

44 102

26

43

46 50

38

31

25

41 42

55

35

29 30 16

22

18 18

37

33

32

22 24

32

148

17

47

12

48

38 49

34

28 45

17

40 24

9

40

26

37

31

19

7 37

38 5

21

16 31

20

17

10 18

8

2

6

7

14 15

3 13

24

2 3 46

38

39

17

24

24 23

30

3

28

10 30

22

Dummy:

54

17 4

41

30

37

29 44

20

27

36

Figure 8: Comparison of di erent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the di erence between the upper bound and the CS score is (third column) the better the alignment

5 Discussion We have de ned a new scoring function, called CS measure, for the evaluation of msas that is based on a Markovian model of evolution. The frequently used SP measure, which calculates the sum of the scores of all pairwise alignments, ignores the structure of any associated phylogenetic tree and thus weights some evolutionary events more than others. This can be avoided by traversing the tree in a circular order, where all branches are traversed exactly twice (Eulerian tour). Any other non-circular tour results in a longer path, because some branches are traversed more than twice, and hence some evolutionary events are counted more often, which corresponds to a lower probability. Therefore the shortest path through the tree, wich is an Eulerian tour, yields the highest CS score. Such a tour can be calculated with any TSP algorithm. 20

The TSP application allows us to avoid the calculation of an evolutionary tree altogether, and only the sequences at the leaves are needed with their OPA or MPA scores. In addition, the CS measure yields a direct connection between a msa and the associated evolutionary tree, because the formula for the calculation of the score is exactly the same. This allows us to use the CS measure to construct and evaluate evolutionary trees. The CS measure based on the OPA scores gives an upper bound on the score of the optimal msa and evolutionary tree. To illustrate the new scoring tool we simulated evolution by generating random trees according to a Markovian model with the corresponding msa. The CS score of the generated alignment was then compared to the score of alignments calculated by di erent methods (MSA, ClustalW, PAS, and MAP). For less than twenty sequences MSA and PAS gave the best score, whereas for larger samples PAS and ClustalW calculated the best alignments. A better assessment must come with actual experimental sequence data though. With the new CS measure we have now the possibility of improving current algorithms and nding new heuristics by maximizing the scoring function. The description of the new methodologies exceeds the scope of this paper and will be available in (Gonnet and Korostensky, 1998). Our scoring function and msa algorithms can be used freely through the CBRG server at http://cbrg.inf.ethz.ch.

21

References Altschul, S. F. (1989). Gap costs for multiple sequence alignment. J. theoretical Biology, 138:297{309. Dayho , M. O., Schwartz, R. M., and Orcutt, B. C. (1978). A model for evolutionary change in proteins. In Dayho , M. O., editor, Atlas of Protein Sequence and Structure, volume 5, pages 345{352. National Biochemical Research Foundation, Washington DC. Gonnet, G. H. (1994a). New algorithms for the computation of evolutionary phylogenetic trees. In Suhai, S., editor, Computational Methods In Genome Research, pages 153{ 161. Plenum Press, New York. Gonnet, G. H. (1994b). A tutorial introduction to computational biochemistry using Darwin. Gonnet, G. H. and Benner, S. A. (1996). Probabilistic ancestral sequences and multiple alignments. Fifth Scandinavian Workshop on Algorithm Theory, Reykjevik July 1996. Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Exhaustive matching of the entire protein sequence database. Science, 256:1443{1445. Gonnet, G. H. and Korostensky, C. (1998). New heuristics for multiple sequence alignments. In preparation.

22

Gupta, S. K., Kececioglu, J., and Scha er, A. A. (1996). Improving the practical space and time eciency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Computational Biology. To appear. Higgins, D. and Sharp, P. (1989). Fast abd sensitive multiple sequence alignments on a microcomputer. CABIOS, 5:151{153. Huang, X. (1994). On global sequence alignment. CABIOS, 10(3):227{235. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993). Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science, 262:208{214. Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA, 86:4412{4415. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443{453. Thompson, J., Higgins, D., and Gibson, T. (1994). Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionsspeci c gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673{ 4680.

23