Fast and Numerically Stable Parametric Alignment of Biosequences Ralf Zimmer and Thomas Lengauer GMD - German National Research Center for Information Technology SCAI - Institute for Algorithms and Scientific Computing Schlol3 Birlinghoven, 53754 Sankt Augustin, Germany {Ralf.Zimmer,Thomas.Lengauer}@gmd.de
by sets of carefully chosen numerical parameters. The selection of these parameters is one of the central difficulties in sequence alignment. It is probably fair to say that, presently, more work goes into optimizing the parameters than into developing new algorithms for sequence alignment. Yet, the cost functions for alignment are not accurate enough to ensure that an alignment which is optimal w.r.t. the cost function is also necessarily the correct one or at least useful for biological applications. There are two approaches to dealing with this situation. One approach is to extend the analysis beyond the set of optimal alignments by including also suboptimal (but near optimal) alignments in order to find biologically meaningful alignments with higher certainty. The analysis of this set of alignments can then also yield insights into the deficiencies of the cost function and the parameter settings. The second approach is to keep variable those parameters in the cost function for which we cannot offer convincing and reliable values. Here, the favored parameters are gapinsertion and gap-extension cost values. We will also use this pair of parameters as our running example in the paper. Consider the following alignment problem. Let A be an alignment of two sequences A and B. We assume that we want to maximize alignment costs. The cost of A is defined as (1) hA(% B) = 7x4 i- wd + bd
Abstract
Parametric alignment is an approach towards dealing with the uncertainties that are inherent in the cost parameters involved in the alignment of biosequences. In parametric alignment, a selected set of cost parameters is kept variable. The resulting space of optimal alignments is then analyzed with respect to the dependencies on these cost parameters. The number n of variable parameters is called the dimension of the parametric aligment problem. If the cost function is linear in the variable parameters, then the alignment space has the structure of a subdivision of the n-dimensional euclidean space into a collection of convex polyhedra. Each polyhedron represents a region of parameter settings that lead to the same set of optimal alignments. Efficient algorithms have been put forth for computing the subdivisions arising from one- and two-dimensional parametric alignment. General algorithmic schemes have been developed for n > 2, but these algorithms are inefficient, in practice. To our knowledge, all algorithms devised so far are plagued by numerical instabilities. We propose an algorithm for two-dimensional parametric alignment which is both faster than existing algorithms and numerically stable. We report on experimental results with this algorithm on biological data and comment on the potential role of parametric alignment in the analysis of biosequences. 1
If the alignment A is clear from the context we omit the subscript. Here m is the total cost of all matches and mismatches in the alignment, g is the number of gaps in the alignment, and 5 is the number of indels in the alignment. An indel is an insertion or deletion of a single letter. A gap is a maximal consecutive stretch of insertions or deletions. The cost m is based on a substitution matrix such as the Dayhoff matrix [S, 8, 121, in the case of proteins. For any given alignment d and scoring system the cost m and, likewise, the numbers g and x are uniquely determined by the alignment. We call alignments A and 23with the same cost function hd = ha equivalent or of the same type. The variables Q and ,0 are the two parameters which weigh the gapinsertion and gap-extension cost against the cost of matches
Introduction
The alignment of biological sequences is one of the central analysis tools in computational molecular biology. This form of analysis is essential for finding structural and functional similarities in DNA, RNA, and proteins. Most alignment methods are based on cost functions that are defined I’wmission
to make digitnllhnrd
copies
personal or ckwroom USC is granted we not made or distributed for profit right notice, the title ofthr publication given
that
to rqwhlish, permission
copyright
is hy permission
to post on swvws and/or fee.
RECOMB
97. Santa
Cqy’ight
1997ACM
Fe New
of all or part
of this
material
fi,r
without fee provided that the copies or commercinl advantage. the copyand its date appear, and notice is of the ACM,
or to redistribute Mexico
O-89791-882-8/97/01
Inc.
To copy
otherwise,
to lists,
requires
specific
LISA ..$3.50
344
have proposed two different methods of computing D. Both of these methods have been implemented and tested on biological data. Femandez-Baca and Srinivasan [7] have presented another method for the 2-dimensional case, which is an elegant extension of cutting-plane algorithms for integer linear-programming problems. It turns out that their approach, while looking at the problem from a quite different perspective, is related to our dual approach outlined below. The performance of the existing algorithms for twodimensional parametric alignment is sufficient for the practiced solution of single instances of the alignment problem. However, if we want to scan large sets of sequences in order to compute parametric alignments for the systematic analysis of cost functions [21, 221, then these algorithms are still too slow. Furthermore, to our knowledge, the algorithms are plagued by numerical instabilities. The reason for this problem is that the algorithms compute the subdivision D by navigating along the polygonal lines in the division represented by the primal graph G = (V, E). Therefore, we call these algorithms ‘primal’. Unpredictable numerical inaccuracies determine whether the algorithm navigates directly on such a line or inside either of the two regions to its side. As a consequence the algorithms yield u-reproducible results or do not terminate at all. Therefore, it is preferable to navigate well inside the polygonal regions, where limited numerical instabilities cannot lead to a change in the set of optimal alignments. In this paper, we develop this idea of a ‘dual’ algorithm and achieve greater numerical stability and a significant increase in performance. In Section 2, we give basic deli&ions and discuss the problem of numerical instability. In Section 3 we describe the algorithm, and comment on its correctness and runtime. Section 4 reports on experiments with our algorithm that are aimed at obtaining practical performance estimates. Section 5 describes results that we have achieved by applying the algorithm to biological data. Section 6 gives conclusions and comments on the potential role of parametric alignment in sequence analysis. The parametric alignment software is part of the ToPLign system [15] that also contains a whole range of alignment and threading methods [ 1, 201. ToPLign is available via a uniform graphical user interface over the WWW (Section 7).
Figure 1: An alignment-cost surface and mismatches, respectively. These parameters usually are hard to determine and are therefore kept variable. Figure 1 shows a two-dimensional alignment-cost surface above the ((Y, @-plane determined by alignment hyperplanes HA = {(q&p) [ C = /LA