Ant Algorithm for Construction of Evolutionary Tree Shin Ando School of Engineering, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan Abstract - This paper proposes an Ant algorithm implemented with stack count for the purpose of designing suffix representations of tree structures. The algorithm is applied to the problem of reconstructing an evolutionary tree from nucleotide sequences. The algorithm shows compatible results in simulated experiment and alignment of protein sequences from 15 species against the existing algorithms.
I. Introduction In this paper, we present a new ant algorithm to construct similarity tree structures from given set of objects and their mutual distance matrix. The method is based on the extension of ant colony optimization for traveling salesman problem [2]. The construction of evolutionary tree has become one of the major problems in computational biology since DNA and protein sequences have become easier to obtain. The objective in evolutionary tree construction is optimizing the tree score, which is calculated from the sequences at the leaves and the structure of the tree. Better score means higher probability of the tree to be the indication of actual evolutionary history for the nucleotide or DNA sequence. Since the number of tree topologies grows exponentially with the number of nodes, finding the optimal topology is generally an intractable problem. The exact construction of evolutionary trees are known to be NP-complete. Our algorithm searches for a tree structure, which minimizes the score for a given set of DNA sequences. It uses the mutual distance matrix of the leaves as the input. While the ant colony optimization for traveling salesman problem visits the respective cities once to construct a round trip, ants in tree constructing algorithm visit leaves and vertices of the tree to construct a suffix representation of the bifurcating tree. When the ants connect two leaves by visiting a vertex, it is indicated by the +, a suffix to connect the leaf. Our method competes with conventional method of the
0-7803-7282-4/02/$10.00 ©2002 IEEE
Hitoshi Iba Grad. School of Frontier Science, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan exhaustive search or the sequential insertion method, taken by the most popular methods. In the simulated experiment, the algorithms showed compatible results against existing software. Our method can also be used for constructing cluster trees for any object where the mutual distance matrix of the trees are given and the objective is to minimize the size of the tree edges. Bioinformatics tasks such as clustering DNA microarray expression patterns and tissue classifications also require the construction of bifurcating trees. II. CONSTRUCTING EVOLUTIONARY TREE Reconstructing the evolutionary history of the proteins and nucleic acids is one of the very important analyses of genome sequence data. It is a fundamental aspect for understanding the structure, functionality, and the evolution of the proteins and DNA. There has been many previous works on the subject[6][5]. An evolutionary tree is an unrooted binary tree such as Figure1, which is composed of leaves (T x1 − 4) of degree 1, or vertex (u, v, w) of degree 3, and the edges which connect them. An evolutionary tree with n leaves has n − 2 vertex and 2n − 3 edges. The number of trees n Q with n leaves is 2i − 5 [13]. Exploration of tree space l=3
by exhaustive search is computationally very expensive. Figure1 shows the example of DNA sequences and evolutionary tree structure. In evolutionary tree, each leaf indicates an evolutionary object, a sequence of a protein or DNA. The tree structure represents the familial relationships or the evolutionary history of the objects. The lengths of the edges correspond to how similar or far the sequences are. The objective is to find a tree structure that minimizes the score function. The total length of the edges can signify the score for the evolutionary tree. In the maximum likelihood method, the shorter edge lengths indicate higher likeliness for the transition of ancestral
of the edges that maximize the likelihood. This method is used to calculate the distance matrix between the sequences in later simulation. IV. SCORING THE TREE The total length of the edges indicate the evaluation of the tree, since the shorter edge lengths indicate higher likeliness for the transition of ancestral sequence to present sequence in the maximum likelihood method. Total edge length of a tree can be obtained from circular tour length, C(S), along the tree edges. The detailed description of C(S) as the tree scoring is given in [8].
Tx0: Tx1: Tx2: Tx3: Tx4:
TAAGTTTACATGCGATTTCT CTGGTCCACACTTCTAAATA GAGCAAGGACTCGACTTCGT AATATAGTTGCCTGACGTCC TACAAATTGATTGGGCCGCG Fig. 1. An evolutionary tree
sequence to present sequence. The construction of optimal evolutionary trees is a very challenging problem, since most versions of the problem are NP-complete [7][9]. Even though the problem has been studied extensively, evolutionary tree construction still remains an open problem. We consider only the unrooted trees in this paper, though the method is completely compatible with rooted tree as well. III. Definition of the Matrix There have been several proposed methodologies for determining the tree structure. The details of these methods can be found in [4]. The parsimony method usually counts the number of amino acid or nucleotide substitutions in a weighted or unweighted manner. They take a multiple sequence alignment (MSA) as an input and take the number of replacement, which could explain the corresponding evolutionary tree as the score of that tree. The maximum likelihood approach scores the tree by the probability that the observed data would have occurred from the tree. The conventional strategy is to search over all trees and for each topology T, find the length
0-7803-7282-4/02/$10.00 ©2002 IEEE
Fig. 2. The circular tour along the tree edges
Equation(1) gives the C(S) for Figure2, which is the sum of the trail indicated by broken arrows. C(S) = d0w + d1u + d2u + d3v + d4w + duv + dvw
(1)
Since the lengths of the edges represent the distances between the leaves, C(S) can be calculated from distance matrix [δij ] (equation 2). C (S) = (δ01 + δ12 + δ23 + δ34 + δ40 ) × 2
(2)
A correct circular tour establishes the upper bound for the best score that can possibly be achieved by a given set of protein or DNA sequences. Both the bound and the circular tour can be derived without an explicit knowledge of the correct evolutionary tree. The circular tour length is equal for all isomorphic trees. In our method, C(S) is used for scoring the tree, thus
a distance matrix that shows the distance between every pair of sequence is required to calculate the score. Further, our method can be used for any scoring function that correlates to the amount of changes along the branches of an evolutionary tree. V. The Extension of Ant Algorithm for Tree Design In this section, we will describe the implementation of our extended Ant algorithm. A. Basic Ant Algorithms The ant colony optimization metaheuristic was inspired by studied behaviors of the ants. Ants communicate among themselves through pheromone, a substance they deposit on the ground as they move about. It has been observed that the ants are attracted to pheromones deposited by the other ants, creating a mechanism similar to reinforcement learning system by concentration of stream of ants to a particular path. [1][3][2] offer detailed information on the works of the algorithm and the choice of the various parameters. In the standard algorithm of ant algorithm for traveling salesman problem, an ant at city i chooses the next city j on probability pkij (t) for pheromone τij (t) and inversed distance ηij (3). The pheromone tables are updated based on equation4 and equation5 for evaporation constant ρ.
pkij
α
(t) = [τij (t)] [ηij ]
X
α
h∈Jik
[τij (t)] [ηij ]
τij (t + 1) = (1 − ρ) · τij (t) + ∆τij (t)
∆τij (t) =
n X
Q (k)
T x0 T x1 T x2 + T x3 + +T x4+
(6)
To make the suffix representation to actually form a tree, certain requirements have to be met. This constraint can be easily explained with the idea of the Stack Count [10]. We assign leaves, the terminals, with the Stack Value of +1 and assign non-terminal vertices with -1. The Stack Count of a tree is the sum of the Stack Value of all its vertices. If the suffix representation of the tree structure is valid, Stack Count should be 1. Further, any subtree of that tree would have +1 as its Stack Count. Thus, while adding up the vertices of a valid representation in sequential order, Stack Count should never subceed 1. (Figure3)
Fig. 3. Stack Count of the Tree from Figure1
While the ants are making the round trip, they hold the history of leaves it has visited, so as not to visit a leaf more than once. At any point during the trip, the Stack Count for the visited leaves and non-terminal vertices cannot subceed 1. This Stack Count constraint prohibits the ant from visiting the non-terminal vertices, until the Stack Count has accumulated up to 2 or more.
(3) C. Choice of the Vertex (4)
(5)
k=1
Each of the N ants will make the round trip times before the iteration is complete. B. Suffix Representation The rules in previous section are inherited in our new algorithm as well. In addition, to represent the tree in suffix representation, an ant must visit N leaves once and the vertices N-1 times in between the cities while making the round trip. The vertices are not specified (as u, v, w), when the ants are making the rounds, and all indicated as + in the representation, since the tree structure is indecisive until the rounds are complete. An evolutionary
0-7803-7282-4/02/$10.00 ©2002 IEEE
tree shown in Figure1 is represented as equation?? in suffix representation.
The Stack Count constrains the ant to not to go to vertices at inappropriate timing. The following rule determines the probability that the ant will choose to go to the vertex, thus connecting the branches to form larger ones. For sequential order of leaves S = {T x0, , , T x3} , which appears in the order of the index, we define ∆(S) by eq. 7. ∆ (S) = δ01 + δ23 − δ02 − δ13 (7) In the evolutionary tree, the edges indicate the distances between the leaves. Thus, as in Figure4, if Tx1 and Tx2 are connected by a vertex, then ∆(S) = 0. When inner two leave are not connected by a vertex, for example ∆ (S 0 ) = (S 0 = {T x1, , , T x4}), ∆ (S) > 0 as in Figure5. When the ant, currently at leaf i, chose j as the next visiting leaf according to the equation, and has the latest history of visiting g, h (Figure6), calculate ∆(S) for {S = g, h, i, j} .
Fig. 4. ∆ (S) = δ01 + δ23 − δ02 − δ13 = 0
Fig. 7. ∆ (S) = δ01 + δ14 − δ03 − δ14 {S = T x0, subtree(T x1, T x2), T x3, T x4}
=
0
If the ∆(S) is small, the ant is highly probable to go to the vertex before going to next leaf j. The probability P is determined using the largest distance D of in the distance matrix and k, an empirical parameter. P = k∆(S)/2D(2D > ∆(S))
Fig. 5. ∆ (S) = δ12 + δ34 − δ13 − δ24 = 2 × δuv > 0
(8)
When the history of the ant includes a vertex, the calculation of ∆(S) is slightly modified. As in Figure7, when a vertex connects a leaf (T x3) and a subtreeSt(T x1, T x2), ∆(S) for S = T x0, St, T x3, T x4 is equal to 0. ∆(S) is calculated by representing the subtree by one of the consisting leaf. This enables the calculation of ∆(S) for all S. Note that a leaf is a form of a subtree. The revised algorithm for deciding whether to go to a vertex, would be as follows: An ant at a vertex in the current subtree I, which chose j as the next leaf to visit, and whose two previously visited subtree, G and H, will go to a vertex before going to j by probability of P = k∆(S)/2D(2D > ∆(S)) for {S = G, H, I, j} . The visited vertex can be identified by subsumption of the Stack Count (Figure 8). VI. Simulated Experiment
Fig. 6. History of visited leaves and choosing the vertex
0-7803-7282-4/02/$10.00 ©2002 IEEE
The simulation is done as follows: Based on a specified evolutionary tree structure, random sequences of nucleotide are generated. We used Seq-Gen, a sequence generator software. It simulates the evolution of nucleotide sequences along a phylogeny, using common models of the substitution process. Thus we know the correct tree for the set of sequences in each simulation.
Fig. 8. Subsumption of the Stack Count
ρ = 0.1
K=0.2
α = 1.2
N=16
M = 1000
TABLE I Ant Algorithm Parameters Fig. 10. 16-leaf target tree
Then, distance matrix between all the sequences is computed by dnadist, of the Phylip program. It computes a distance measure for DNA sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix. The input of sequence and distance matrix were also fed to FITCH and Neighbor algorithm of the PHYLIP programs[11][12] for comparison. The simulation was executed under different evolutionary tree parameter settings, which include maximum distance in the matrix , sequence length, and number of leaves. The structures of the trees for 8-leaf and 16-leaf simulation are shown in Figure9 and 10. The parameters for ant algorithms are shown in TableI, but they have not been proven to be the best parameters. Results of the simulation are summarized in Table II and Table III. In the first experiment for a Figure9 tree, the ant algorithm gave the best results in general. For 16 leaves, FITCH program gave the best result when the
D -3 3-6 6-
FITCH 85% 60% 35%
Neighbor 78% 33% 12%
Ant 82% 78% 64%
TABLE II 8-leaf simulation result
maximum distance was small, and otherwise the ant algorithm gave the best. More simulation should be done on larger trees to see which particular method is suited for specific parameter settings. We draw conclusion that this ant algorithms is useful and produces good results compatible to popularly used tree construction algorithms. VII. Multiple Alignment Examples We have also tested the method on an example protein family of the Cytochrome P450 CYP050A. The sequences of 15 species (Table IV) were taken from Cytochrome P450 Database (http://cpd.ibmh.msk.su/). The sequence lengths differ from 414 to 561, as indicated in Table IV. The multiple sequence alignment is obtained by ClustalW software[14]. The distance matrix is calculated by protdist of Phylip programs. Again, the comD -3 3-6 6-
Fig. 9. 8-leaf target tree
0-7803-7282-4/02/$10.00 ©2002 IEEE
FITCH 85% 60% 35%
Neighbor 78% 33% 12%
Ant 82% 78% 64%
TABLE III 16-leaf simulation result
arabidopsis thaliana penicillium italicum candida albicans rat candida glabrata baker yeast tomato schizosaccharomyces pombe human wheat issatchenkia orientalis smut fungus mycobacterium tuberculosis grape powdery mildew fungus rice
489 515 528 503 533 530 485 495 509 453 414 561 451 524 427
TABLE IV 15 species and the length of their CYP sequences
Fig. 11. Cytochrome P450 tree constructed with new Ant Algorithms
parison is made to FITCH and Neighbor programs, and also tree drawing algorithms of ClustalW. All the algorithms agreed on the tree topology, shown in Fig. 11. This result represents a very feasible structure, and the Ant algorithms gave the best score of 4023.8. VIII. Discussion We have introduced a new tree construction method that constructs the tree with the minimum score for a given set of sequences. By using the ants, we have introduced a metaheuristic search to a NP problem. By suffix representation and vertex choosing scheme, we have extended the possibilities of the ant colony
0-7803-7282-4/02/$10.00 ©2002 IEEE
optimization application. The method was compatible with existing software in the presented experiments. We will further test the algorithm for the scalability and accuracy by applying it to real MSAs. We should also discuss the noise factor in our further work. References [1] Colorni, A., M. Dorigo, et al. (). Distributed optimization by ant-colonies. European Conference on Artificial Life (ECAL91). [2] Dorigo, M. and G. D. Caro (1999). ”The Ant Colony Optimization Meta-Heuristic.” New Ideas in Optimization. [3] Dorigo, M. and L. M. Gambardella (1997). ”Ant colonies for the travelingsalesman problem.” Biosystems 43: 73-81. [4] Durbin, R., S. Eddy, et al. (1998). Biological Sequence Analysis: Probabilistic Models of Protein and Nucleic Acids, Cambridge University Press. [5] Felsenstein, J. (1996). ”Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods.” Methods Enzymol 266: 418-27. [6] Fitch, W. M. and E. Margoliash (1967). ”Construction of phylogenetic trees.” Science 155(760): 279-84. [7] Foulds, L. R. and R. L. Graham (1982). ”The steiner problem in phylogen is np-complete.” Proc. Natl. Academ Science 3: 43-49. [8] Gonnet, G., C. Korostensky, et al. (1999). ”Evaluation measures of multiple sequence alignments.” Journal of Computational Biochemistry. [9] Jiang, T. and L. Wang (1994). ”On the complexity of multiple sequence alignment.” J. Comp. Biol. 1: 337. [10] Keith, M. J. and M. C. Martin (1997). ”Genetic Programming in C++: Implementation Issues.” Advances in Genetic Programming. [11] Lim, A. and L. Zhang (1999). ”WebPHYLIP: a web interface to PHYLIP.” Bioinformatics 15(12): 1068-9. A web interface to PHYLIP (version 3.57 C) is implemented using CGI/Perl programming. It enables users to do phylogenetic analysis through the Internet. [12] Retief, J. D. (2000). ”Phylogenetic analysis using PHYLIP.” Methods Mol Biol 132: 243-58. [13] Rohlf, F. J. (1983). ”Numbering binary trees with labeled terminal vertices.” Bull Math Biol 45(1): 33-40. [14] Thompson, J. D., D. G. Higgins, et al. (1994). ”CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.” Nucleic Acids Res 22(22): 4673-80.