Edit Distance with Duplications and Contractions Revisited

Report 2 Downloads 182 Views
Edit Distance with Duplications and Contractions Revisited Tamar Pinhas, Dekel Tsur, Shay Zakov, and Michal Ziv-Ukelson Department of Computer Science Ben-Gurion University of the Negev, Israel {matuskat, dekelts, zakovs, michaluz}@cs.bgu.ac.il

Abstract. In this paper, we propose three algorithms for the problem of string edit distance with duplication and contraction operations, which improve the time complexity of previous algorithms for this problem. These include a faster algorithm for the general case of the problem, and two improvements which apply under certain assumptions on the cost function. The general algorithm is based on fast min-plus  multiplication  3 log3 log n of square matrices, and obtains the running time of O |Σ|n log , 2n where n is the length of the input strings and |Σ| is the alphabet size. This algorithmis further accelerated, under  some assumption on the cost  02 log3 log n0 function, to O |Σ| n2 + nn log time, where n0 is the length of 2 n0 the run-length encoding of the input. Another improvement is based on a new fast matrix-vector min-plus multiplication   under a certain discreten3 ness assumption, and yields an O |Σ| log2 n time algorithm. Furthermore, this algorithm is online, in the sense that one of the strings may be given letter by letter. As part of this algorithm we present the currently fastest online algorithm for weighted CFG parsing for discrete weighted grammars. This result is useful on its own.

1

Introduction

Comparing strings is a well-studied problem in computer science as well as in bioinformatics. In this paper we address the problem of string edit distance, with the additional operations of duplication and contraction. Such algorithms are motivated by the study of minisatellites and their comparisons in the context of population genetics [11]. Traditionally, string similarity is measured in terms of edit distance, which reflects the minimum-cost edit of one string to the other based on the edit operations of substitutions (including matches) and deletions/insertions (indels). However, in comparing minisatellite maps, one has to also consider that regions of the map have arisen as a result of duplication events from the neighboring units. A minisatellite is a section of DNA that consists of tandem repetitions of short (6–100 bp) sequence motifs spanning 0.5 kilobases to several kilobases. The repeated motifs also vary in sequence through base substitutions. A minisatellite map represents a minisatellite region, where each motif (denoted unit) is encoded by a character and handled as one entity. For one minisatellite locus,

2

Tamar Pinhas, Dekel Tsur, Shay Zakov, and Michal Ziv-Ukelson

both the type and the number of units in the map vary between individuals in a population. Therefore, pairwise comparisons of minisatellite maps are typically applied in studying the evolution of populations. The single copy duplication model, where only one unit can duplicate at a time, is the most popular, and its biological validation was asserted for the MSY1 minisatellites [8, 11]. According to this model, one unit can mutate into another unit via a mutation of a single nucleotide within it. Also, a unit can be duplicated, that is, an additional copy of the unit may appear next to the original one in one of the strings (tandem repeat). Thus, when comparing minisatellite maps, two additional operations are considered: duplication and contraction. As we assume the single copy duplication model, duplications (and contractions) add (and subtract, respectively) a single letter at a time. The problem of comparing two minisatellite maps under the single copy duplication model was first defined and studied by B´erard and Rivals [8]. B´erard and Rivals suggested an O(n4 ) time and O(n3 ) space algorithm, where n is the length of the two input strings (for the sake of simplicity, we assume that both strings are of the same length). This was followed by the work of Behzadi and Steyaert [5], who gave an O(|Σ|n3 ) time and O(|Σ|n2 ) space algorithm for the problem, where |Σ| is the alphabet size (typically a few tens of unique units). Behzadi and Steyaert [5] improved the running time of their algorithm based on run length encoding, to O(n2 + |Σ|nn02 ), where n0 is the length of the runlength encoding of the input strings [4]. Run length encoding was also used by B´erard et al. [7], who presented an O(n3 + |Σ|n03 ) time algorithm. Recently, Abouelhoda et al. [1] presented an alphabet-independent algorithm, with time complexity O(n2 + nn02 ). The algorithms mentioned above need some restrictions on the edit scripts they consider in order to guarantee the optimality of the solution. For example, the algorithm of Behzadi and Steyaert [5] cannot find an optimal edit script from s = ab to t = cd if the unique optimal script is ab → bb → b → d → dd → cd. The algorithm of Abouelhoda et al. [1] works correctly on the instance above, but it cannot find an optimal edit script from s = a to t = bc if the unique optimal script is a → d → dd → bd → bc. It is easy to design cost functions in which the example scripts are optimal. Thus, the two algorithms above are correct only when the cost function forbids such scripts from being optimal. Therefore, the algorithms presented in this paper are currently the fastest algorithms for the problem under general cost functions, even though their time complexity contains a |Σ| factor. 1.1

Our Contribution and Roadmap

We propose three algorithms for the problem of edit distance with duplications and contractions, based on fast matrix multiplication. We start, in Section 2, with the problem definition and its basic solution. In Section 3, we present an algorithm for general cost functions that is based on min-plus square matrix multiplication. Using the matrix  multiplication algo |Σ|n3 log3 log n time. rithm of Chan [9], this algorithm runs in O log2 n

Edit Distance with Duplications and Contractions Revisited

3

In Section 4, we extend our approach to the algorithm given in [4] which exploits run length encodings of the input  strings,02assuming some restrictions |Σ|nn log3 log n0 2 on the cost functions. This yields an O n + time algorithm. log2 n0 In Section 5, we describe another improvement to the general algorithm for the case of discrete cost functions. This algorithm is based on fast min-plus matrix-vector multiplication and is online, in the sense that one of the strings may be given letter by letter. For this purpose, we adapt Williams’ matrix-vector multiplication algorithm over a finite semiring [14], and modify it to efficiently handle the min-plus matrix-vector multiplication, for the case where the differences between adjacent cells are taken from  a finite integer interval. This yields  |Σ|n3 an overall time complexity of O log2 n . An additional result obtained along the way is the adaptation of the algorithm for weighted context-free grammar (CFG) parsing under discrete cost functions to be online as well. Due to space restrictions, additional materials, including proofs, figures and algorithm details, are on the Internet at http://www.cs.bgu.ac.il/∼negevcb/publications.php.

2

Edit Distance with Duplications and Contractions

In this section, we define the Edit Distance with Duplications and Contractions(EDDC) problem, show some of its properties and give a basic solution for it. 2.1

Problem Definition

In the edit distance model, one is given a source string s and a target string t, and it is assumed that t was obtained by applying a sequence of local edit operations, called an edit script, over s. We denote such a script by s = u0 → u1 → u2 → . . . → ul = t, where each intermediate string uk is obtained by applying a single edit operation on uk−1 . Each edit operation has a cost, and the cost of an edit script is defined as the sum of costs of its operations. The edit distance from s to t, denoted by ed (s, t), is the minimum cost of an edit script from s into t. The goal of the problem is to compute the edit distance between the two given strings. An edit script whose cost is equal to the edit distance is called an optimal script from s into t. In the standard problem definition [8], the allowed edit operations include insertion (inserting a letter in some position in the string), deletion (deleting a letter from some position in the string), and mutation (replacing one letter in the string by another). The EDDC variant adds two additional operations: duplication and contraction [8]. The duplication operation replaces a single-letter substring α with the substring αα, where contraction is its symmetric operation. Throughout the rest of this paper, fix the following entities. Let Σ be a finite alphabet. For a letter α ∈ Σ, denote by ins(α), dup(α), and del(α) the costs of the insertion, duplication, and deletion operations applied on α, respectively, by

4

Tamar Pinhas, Dekel Tsur, Shay Zakov, and Michal Ziv-Ukelson

cont(α) the cost of contracting a string of the form αα into α, and by mut(α, β) the cost of mutating α into some letter β ∈ Σ. We assume that all operation costs are nonnegative. Denote by si the ith letter in the string s, starting at 0. We denote si,j the substring si si+1 . . . sj−1 of s. 2.2

The Basic Recursion

We next show how to solve the EDDC problem recursively. It is possible to show that there is an optimal edit script from s into t of the following form: s is first transformed into a string w = w0 w1 . . . wr (the common ancestor of s and t), where each letter wi is obtained by reducing some substring si of s, and s = s0 s1 . . . sr . Then, w is transformed into t, where every letter wi generates some substring ti of t through the application of some edit script, and t = t0 t1 . . . tr . This is expressed in the following recursive formula: ed (s0,i , t0,j ) = min {ed (s0,k , t0,l ) + ed (sk,i , α) + ed (α, tl,j )} , 0≤k 0, and ed (s0,0 , t0,0 ) = 0. In previous works [5, 1], the EDDC between a letter and a string, ed (sk,i , α) and ed (α, tl,j ), was computed separately preceding the computation of the recursive formula above. We keep this approach and show, in the next section, how to compute these values by applying a CFG parsing algorithm. While being slightly different, our recursive formula resembles previous formulations [5, 1]. Thus, we defer the assertion of its correctness to the supplementary materials. 2.3

Context-free Grammar Representation

A generating edit script is an edit script that contains only insertion, duplication and mutation operations. Consider the application of a generating edit script to a single-letter string s = γ that results in some target string t. Such a script may be described in terms of weighted CFG parsing, as follows. The set of terminals in the grammar is Σ. The set of non-terminals is N = {˜ α : α ∈ Σ} ∪ {I}, where I is a unique non-terminal, representing a letter insertion placeholder. The start non-terminal is γ˜ , corresponding to the source letter γ. The parse rules are: ˜ The cost of such rules is mut(α, β). 1. Mutations: ∀˜ α, β˜ ∈ N , α ˜ → β. 2. Duplications: ∀˜ α ∈ N, α ˜→α ˜α ˜ . The cost of such rules is dup(α). 3. Elongations (a preliminary step to allow insertions): ∀˜ α ∈ N, α ˜→α ˜ I and α ˜ → Iα ˜ . The cost of such rules is 0 (note that an application of such rules must be followed by an application of an insertion rule of type 4). 4. Insertions: ∀˜ α ∈ N, I → α ˜ . The cost of such rules is ins(α). 5. Terminal achievement: ∀˜ α ∈ N, α ˜ → α, where α is the terminal corresponding to non-terminal α ˜ . The cost of such rules is 0.

Edit Distance with Duplications and Contractions Revisited

5

A generating edit script from a letter γ into t corresponds to a minimal cost parse tree of t via the grammar above, which can we achieved by applying the CKY algorithm [12, 10, 2]. A similar grammar can be used to generate a single letter string from a non-empty string using contraction and deletion operations. 2.4

A Basic Dynamic Programming Algorithm for EDDC

In this section we describe a dynamic programming (DP) algorithm that implements the recursion of Eq. 2.1. The algorithm consists of two stages. Given a pair of strings s and t, the first stage of the algorithm computes all letter-to-substring edit distances ed (si,j , α) and ed (α, ti,j ). In the second stage, all prefix-to-prefix edit distances ed (s0,i , t0,j ), and in particular the edit distance between the two complete strings, are computed. The algorithm maintains the following matrices: α – Matrices S α , for every α ∈ Σ. An entry Si,j holds the value ed (si,j , α). α α holds the value ed (α, ti,j ). – Matrices T , for every α ∈ Σ. An entry Ti,j α – Matrices T Dα , for every α ∈ Σ. An entry T Di,j holds the minimum of ed (s0,i , t0,l ) + ed (α, tl,j ), for 0 ≤ l < j. – A matrix T D, where an entry T Di,j holds the value ed (s0,i , t0,j ).

Stage 1. In this stage the matrices S α and T α , for all α ∈ Σ, are computed using a weighted CFG parsing algorithm [12, 10, 2], as explained in Section 2.3. Stage 2. This stage takes as input the matrices S α and T α which were computed in the first stage, in addition to the strings s and t, and computes the matrix T D according to Eq. 2.1. We point out that a naive implementation of Eq. 2.1 would yield an O(|Σ| n4 ) running time. However, as done in [1], it is possible to split the computation into two independent parts by replacing Eq. 2.1 with the two interleaved recursions (Eq. 2.2 and Eq. 2.3 below), in order to obtain a more efficient, O(|Σ| n3 ) algorithm. This is achieved by utilizing the auxiliary matrices T Dα : α α T Dk,j = min {T Dk,l + Tl,j },

(2.2)

α α T Di,j = min {T Dk,j + Sk,i }.

(2.3)

0≤l<j

0≤k