Computational Biology and Chemistry 29 (2005) 175–181
Software note
MSAID: multiple sequence alignment based on a measure of information discrepancy Min Zhang a,c,∗ , Weiwu Fang b , Junhua Zhang b , Zhongxian Chi a a
Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China b Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China c College of Information Engineering, Dalian University, Dalian 116622, China Received 22 June 2004; received in revised form 10 December 2004; accepted 10 December 2004
Abstract We propose an algorithm of global multiple sequence alignment that is based on a measure of what we call information discrepancy. The algorithm follows a progressive alignment iteration strategy that makes use of what we call a function of degree of disagreement (FDOD). MSAID begins with distance calculation of pairwise sequences, based on FDOD as a numerical scoring measure. In the next step, the resulting distance matrix is used to construct a guide tree via the neighbor-joining method. The tree is then used to produce a multiple alignment. Current alignment is next used to produce a new matrix and a new tree (with FDOD scoring measure again). This iterative process continues until convergence criteria (or a stopping rule) are satisfied. MSAID was tested and compared with other prior methods by using reference alignments from BAliBASE 2.01. For the alignments with no large N/C-terminal extensions or internal insertions MSAID received the top overall average in the tests. Moreover, the results of testing indicate that MSAID performs as well as other alignment methods with an occasional tendency to perform better than these prior techniques. We, therefore, believe that MSAID is a solid and reliable method of choice, which is often (if not always) superior to other global alignment techniques. © 2004 Published by Elsevier Ltd. Keywords: Multiple sequence alignment; Progressive alignment; Iterative strategy; FDOD measure
1. Introduction Multiple sequence alignment is a common task in bioinformatics. It plays an essential role as an efficient tool for detecting regions of significant sequence similarity in collections of primary structures (i.e., sequences) of nucleic acids and proteins. Multiple alignment is also a useful prerequisite for reconstruction of most probable evolutionary history of a set of homologous sequences of nucleotides or amino acids. For most practical purposes, multiple alignment is superior to the conventional (local) pairwise alignment because simultaneous comparison of many sequences can balance out the ‘noise’ caused by the excessive sequence information that is unrelated to the studied problem or phenomenon. In fact, multiple alignment can be viewed as an extension ∗
Corresponding author. Tel.: +86 411 86663342; fax: +86 411 86663342. E-mail address: zhangmin
[email protected] (M. Zhang).
1476-9271/$ – see front matter © 2004 Published by Elsevier Ltd. doi:10.1016/j.compbiolchem.2004.12.005
of the human-friendly “by eye” alignment done by laboratory biologists in the past. A multitude of alignment algorithms have been devised since the principles of dynamic programming were first applied to biomolecular sequence comparison by Needleman and Wunsch (1970). In fact, today it is a standard practice to use dynamic programming as an optimization technique embedded in pairwise alignment methods. This guarantees a mathematically optimal alignment, given a table of scores for matches and mismatches between all amino acids or nucleotides (e.g., the PAM250 matrix or BLOSUM62 matrix) and penalties for insertions or deletions of different lengths. Quite unfortunately, attempts at generalizing dynamic programming to multiple sequence alignment are limited to small numbers of short sequences (Lipman et al., 1989). The computational complexity in time and space of multi-dimensional dynamic programming are O(Ln 2n ) and O(Ln ), respectively, where L is the maximum length of sequences and n is the number of sequences. In
176
M. Zhang et al. / Computational Biology and Chemistry 29 (2005) 175–181
fact, multiple sequence alignment based on SP scores is an NP problem (Wang and Jiang, 1994). Therefore, for all practical purposes, heuristic approaches are widely adopted in all alignment methods capable of handling large problems (in terms of number and length of sequences.). The most commonly used heuristic methods are based on the progressive alignment strategy (Feng and Doolittle, 1987; Higgins and Sharp, 1988; Thompson et al., 1994; Notredame et al., 2000). They use the idea that homologous sequences are evolutionarily related. Therefore, one can build up a multiple alignment progressively by a series of pairwise alignments, following the branching order in a guide tree (phylogenetic tree). The closest two sequences on the tree are aligned first using normal dynamic programming; this pair of sequences is then fixed and any gaps that have been inserted cannot be shifted anymore. Then, the next closest two sequences are aligned or a sequence is aligned to the existing alignment of the first two sequences, depending on what is suggested by the guide tree. This continues until all sequences have been aligned. The main virtues of this approach are simple and fast. There are two major problems with the progressive approach: (1) the choice of alignment parameters (Thompson et al., 1994) problem, and (2) the local minimum problem. The alignment parameter choice problem is traditionally addressed by the choice of one weight matrix and two gap penalties (one for opening a new gap and another for extending an existing gap) and by the hope that these will work well over all parts of all the sequences in the data set. When the sequences are all closely related, it works. For very divergent sequences, however, those scores given to non-identical residues and gaps will become critical. ClustalW (Thompson et al., 1994), the most widely used progressive multiple alignment program, dynamically varies the gap penalties and amino acid substitution matrices in a specific position and different alignment stages. It improves alignment parameter choices and sensitivity of progressive alignments. The local minimum problem stems from the ‘greedy’ nature of progressive alignment strategy. The algorithm greedily adds sequences together, following the initial tree. It is not guaranteed that the global optimal solution could be found. The cause of failure is probably due to an incorrect branching order in the initial tree. The initial tree is derived from a matrix of distances between separate pairs of sequences and is less reliable than the tree from an entire multiple alignments. One way of improvement is to use an iterative procedure based on the last multiple alignment, recalculate the distance matrix by pairwise alignments included in the entire multiple alignment to create a new guide tree and to produce a new multiple alignment following the new tree. This procedure can be done repeatedly until the guide tree is unchanged. It should be noted that the distance measure, which is calculated
based on the percent of identity residues in pairwise alignments, is not sensitive to the gap insertion. Therefore, the results within an iterative procedure might converge quickly to local minimum solutions. In this paper, we use the concept of complete information set (CIS) to describe the sequences in consideration, and then, the information discrepancy among CIS’s of different sequences can be measured by function of degree of disagreement (FDOD), which uses the word frequencies of the sequences (Fang, 2000; Fang et al., 2001). The coupling effect of closest residues is taken into account by decomposing a sequence into subsequences. Moreover, the effect of residue order along the sequence is involved through the overlaps of subsequences; the longer the subsequence is, the more information of the original sequence it includes. The FDOD measure of information discrepancy is a distance measure and it is also sensitive to the gap insertion. So, we develop a new multiple alignment algorithm MSAID based on the measure of information discrepancy FDOD and attempts to improve the progressive alignment method. MSAID begins with the calculation of the distances between any two of the sequences using FDOD; the resulting distance matrix is used to construct a tree using the neighbor-joining method; the tree is used to produce a multiple alignment using progressive alignment method dynamically varying parameters similar to that used in ClustalW. The result of alignment is then used to produce a new distance matrix by the recalculation of FDOD, which is now based on the frequencies of the words with gaps inserted into the alignment. The process is repeated until convergence or until some stopping rule is satisfied. Although this method cannot guarantee to find the global optimal solution, its sensitivity to the gap insertion may decrease in tendency towards some local minimum solutions. In fact, the accuracy of the multiple alignment can be improved effectively; MSAID is tested and compared with other prior and favorable methods on BAliBASE(Bahr et al., 2001; Thompson et al., 1999a), and it receives, especially for the alignments with no large N/C-terminal extensions or internal insertions, the top overall average in the tests.
2. MSAID algorithm MSAID has two main features. First, it applies FDOD to calculate the distances of each pair of sequences. This method is based on the comparison of contiguous subsequences and involves the residue order along the sequence; therefore, it contains more information of sequences than traditional methods do. Moreover, the value of FDOD depends on the sequences themselves completely; it is free from any additional parameters, such as substitution matrices and gap penalties. The second main feature of MSAID is to use iterative strategy to improve the accuracy of progressive methods since the traditional progressive alignment has the local minimum problem.
M. Zhang et al. / Computational Biology and Chemistry 29 (2005) 175–181
2.1. Complete information set and FDOD Let us introduce two concepts: CIS and FDOD measure. Let Σ = {a1 , a2 , . . ., am } be an alphabet of m symbols, and suppose S = {S1 , S2 , . . ., Ss } is a set of sequences formed from Σ. For protein sequences, the 20 amino acids form the alphabet Σ = {a1 , a2 , . . ., a20 }. As to protein sequence alignments, there are 21 symbols to form the alphabet Σ after a gap is added. We denote the set of all different sequences formed from Σ with length l by Θl ; then, the number m(l) of all sequences of Θl equals ml . For a sequence Sk ∈ S, let Lk be its length and nlik denote the number of contiguous subsequences in Sk , which match the ith sequence of Θl , l ≤ Lk . It is easy to see that m(l) i=1
nlik = Lk − l + 1
for each l ≤ Lk and k.
(1)
Letting plik = nllk /(Lk − l + 1), we obtain a distribution: T
Ukl = (pl1k , pl2k , . . . , plm(l)k ) , where
m(l) i=1
plik = 1
(2)
for each k and l ≤ Lk . Let Γ l denote the set of all distributions m(l) satisfying i=1 plik = 1, i.e., m(l) T Γ l = (pl1k , pl2k , . . . , plm(l)k ) | plik = 1, pl1k ≥ 0 (3) i=1
Thus, for each sequence Sk , we can get a unique set of distributions (Uk1 , Uk2 , . . . , UkLk )
(4)
where Uk1 ∈ Γ 1 , Uk2 ∈ Γ 2 , . . . , UkLk ∈ Γ Lk . This set contains all primary information of a sequence. In particular, UkLk uniquely determines the original sequence; so, we call this set (Uk1 , Uk2 , . . . , UkLk ) a complete information set (CIS) of the sequence Sk . Any two different sequences have different complete information sets (and vice versa). Discrepancies among these sets mean discrepancies of all primary information among sequences; if a measure defined on CIS satisfies the augment, heredity property and other criteria (Fang et al., 2001), it can efficiently discriminate different sequences just by comparing the beginning information subsets with some window size l. A measure of statistical distance between sequences should ideally satisfy the following conditions: non-negativity, boundedness, symmetry, continuity, convexity, triangle inequality, identity and so on. We defined the FDOD measure, which does satisfy these conditions, as: s m(l) l s l l k=1 i=1 pik log(pik /( k=1 pik /s)) l l R(U1 , · · · , Us ) = s logs ≤1 (5)
177
where 0 · log 0/0 is defined as 0 as in the Kullback–Leiber entropy (Kullback, 1959); s denotes the number of the sequences; l denotes the window size. 2.2. Multiple sequence alignment algorithm MSAID The basic progressive alignment algorithm consists of three main stages: (1) a distance matrix giving the divergence of each pair of sequences is calculated; (2) a guide tree is constructed from the distance matrix; (3) the sequences are progressively aligned according to the branching order in the guide tree. MSAID algorithm begins with the calculation of the distance matrix between any two of the sequences using FDOD (where s = 2); the matrix is used to construct a guide tree using the neighbor-joining method (Saitou and Nei, 1987); a multiple alignment is produced by the progressive alignment, in which the parameters are dynamically chosen, such as in ClustalW method. When the sequences are aligned, some gaps may be inserted into the sequences by some criteria; and the FDOD values need to be recalculated based on the frequencies of the words with gaps, hence, a new distance matrix is recalculated, and the guide tree is improved, a new alignment is produced. The complete procedure of MSAID algorithm is: (1) The distances between any two of the sequences are calculated using FDOD. In this way, an N × N distance matrix is produced (N is the number of sequences to be aligned). (2) A guide tree is constructed using the distance matrix and the neighbor-joining method. (3) A progressive alignment is implemented by following the branching order of the guide tree. The closest two sequences on the tree are aligned first using normal dynamic programming. This pair of sequences is then fixed and any gap that has been introduced cannot be shifted later. Then, the next closest two sequences are aligned or a sequence is added to the existing alignment of the first two sequences, depending on what is suggested by the guide tree. This continues until all sequences have been aligned. (4) The pairwise alignments included in the multiple alignment are used to recalculate the FDOD in the next round of the iterative procedure. Speaking in details, because of the gap insertion in the multiple alignment, now the alphabet Σ contains a new element (i.e., the gap). Therefore, the recalculation of the measure FDOD is based on the calculation of the frequencies of the words with gaps, what is worthy to be emphasized is that, the words and their frequencies may be variational along with the proceeding of the iteration of the multiple alignments. (5) The process is repeated from step (2), until convergence or until some stopping rule is satisfied.
178
M. Zhang et al. / Computational Biology and Chemistry 29 (2005) 175–181
The time complexity of the progressive alignment procedure in MSAID is given by: O(NL) + O(N 2 L) + O(N 3 ) + O(NL2 ) where O(NL) is associated with the computation of the word frequencies of the sequences, O(N2 L) the computation of the distance matrix, O(N3 ) the computation of the neighborjoining tree and O(NL2 ) the computation of the progressive alignment (assuming that N sequences of length L can be aligned in a multiple alignment of length L) (Notredame et al., 2000).While the time complexity of ClustalW is given by: O(N 2 L2 ) + O(N 3 ) + O(NL2 ) where O(N2 L2 ) is associated with the computation of distance matrix, and the time complexity of the computation of the neighbor-joining tree and progressive alignment is same as that of MSAID. Therefore, with alignment size as that stated above, L N, the apparent complexities of MSAID and ClustalW program are O(NL2 ) and O(N2 L2 ), respectively.
3. Materials, compared methods and statistical analysis In order to test the accuracy of program MSAID, we used the previous reference alignments of the BAliBASE 2.01; this collection contains 142 protein reference alignments that belong to five categories. For most members, the three-dimensional (3D) structure is available. The BAliBASE multiple alignments were constructed by manual comparison of structure and are validated by structure-superposition algorithms. Thus, the alignments are unlikely to be biased toward any specific multiple alignment method. For analysis purposes, the authors have annotated these alignments by marking core blocks of columns deemed correctly alignments. Each of the five basic categories contains at lease 12 representative alignments. They encompass most of the situations that arise when making multiple sequence alignment. In Reference 1, the alignments consist of a small number of equidistant sequences with similar length to each other, i.e., the percent of residue identity (% ID) between any two sequences is within a specified range, and no large extensions or insertions have been introduced. Reference 2 contains the alignments of a family (closely related sequences with >25% ID), plus up to three ‘orphan’ sequences (distant members of the family with