Computing Alignment Seed Sensitivity with ... - Springer Link

Report 3 Downloads 200 Views
Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata Inke Herms1 and Sven Rahmann2 1

Genome Informatics, Faculty of Technology, Bielefeld University, D-33501 Bielefeld, Germany 2 Bioinformatics for High-Throughput Technologies, Computer Science 11, TU Dortmund, D-44221 Dortmund, Germany [email protected], [email protected]

Abstract. Heuristic sequence alignment and database search algorithms, such as PatternHunter and BLAST, are based on the initial discovery of so-called alignment seeds of well-conserved alignment patterns, which are subsequently extended to full local alignments. In recent years, the theory of classical seeds (matching contiguous q-grams) has been extended to spaced seeds, which allow mismatches within a seed, and subsequently to indel seeds, which allow gaps in the underlying alignment. Different seeds within a given class of seeds are usually compared by their sensitivity, that is, the probability to match an alignment generated from a particular probabilistic alignment model. We present a flexible, exact, unifying framework called probabilistic arithmetic automaton for seed sensitivity computation that includes all previous results on spaced and indel seeds. In addition, we can easily incorporate sets of arbitrary seeds. Instead of only computing the probability of at least one hit (the standard definition of sensitivity), we can optionally provide the entire distribution of overlapping or non-overlapping seed hits, which yields a different characterization of a seed. A symbolic representation allows fast computation for any set of parameters. Keywords: Homology search, Seed sensitivity, Spaced seed, Indel seed, Multiple seeds, Probabilistic arithmetic automaton.

1

Introduction

Most heuristic homology search algorithms [1, 2, 3, 4] and some repeat detection algorithms [5] are based on a filtration technique. In the filtration step, one selects candidate sequences that share a common pattern of matching characters (the seed ) with the query. These candidates are then further investigated by exact local alignment methods, such as the Smith-Waterman [6] algorithm. Initially, only contiguous perfect matches (e.g., DNA 11-mers in the initial BLAST implementation) were used as seeds. The PatternHunter (PH) tool by Ma et al. [7] was the first system to systematically advocate and investigate spaced seeds: PH looks for 18-mers with at least 11 matching positions distributed as 111∗1∗∗1∗1∗∗11∗111, K.A. Crandall and J. Lagergren (Eds.): WABI 2008, LNBI 5251, pp. 318–329, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

319

where 1 denotes a necessary match and ∗ denotes a don’t care position (match or mismatch). Over time, various seed models have been proposed in the literature, including consecutive seeds [1, 2], spaced seeds [7, 8, 9, 10], subset seeds [11], vector seeds [12], and indel seeds [13]. Given a probabilistic model for alignments (a so-called homology model ), different seeds in a class (e.g., all seeds with 11 match positions and length 18) can be compared according to their sensitivity, i.e., to match an alignment of given length. A good seed exhibits high sensitivity for alignment models that model evolutionarily related biosequences, and low sensitivity values for models that represent unrelated sequences. The latter property ensures that the seed does not detect too many random hits. Random hits decrease the efficiency of the filtration phase, since they are checked in vain for a significant alignment. An interesting finding was that the PatternHunter approach led to an increase in both sensitivity and filtration efficiency, compared to seeds of contiguous matches. Based on the observations in [7], the advantages of spaced seeds over consecutive seeds have been subsequently evaluated by many authors [8, 14, 15]. An extension to single seed models is the design of a multiple seed. This is a set of spaced seeds to be used simultaneously, such that a similarity is detected when it is found by (at least) one of the seeds. The idea to use a family of spaced seeds for BLAST-type DNA local alignment has been suggested by Ma et al. [7] and was implemented in PatternHunter II [16]. It has also been applied to local protein alignment in [17]. Since finding optimal multiple seeds (seed sets) is challenging, most authors concentrate on the design of efficient sets of seeds, leading to higher sensitivity than optimal single seeds [16, 18, 19]. Kucherov et al. [18] characterize a set of seeds only by its selectivity. Recent approaches [20, 21] approximate the sensitivity of multiple spaced seeds by means of correlation functions. Moreover, Kong [20] discussed that sensitivity should rather be measured via an averaging criterion. When searching optimal seeds, one faces the following problems to evaluate candidate seeds (the second more general one has not yet been extensively considered in the literature). Problem 1 (Sensitivity computation). Given a homology model (Sect. 2.2), a target length t of alignments, and a set of seeds (see Sect. 2.3 for a formal description of different seed models), what is the probability that an alignment of length t is matched by the seed (at least once)? Problem 2 (Hit distribution). Given a homology model, a target length t of alignments, a set of seeds, and a maximal match number K, what is the probability that an alignment of length t is matched by the seed exactly (at least) k times, for k = 0, . . . , K, when counting (a) overlapping matches, (b) non-overlapping matches only? Related Work and Our Contributions. We present an exact method, called Probabilistic Arithmetic Automaton (PAA), to compute the sensitivity as well as the entire distribution of the number of overlapping or non-overlapping hits for a

320

I. Herms and S. Rahmann

given seed or set of seeds. The model is very flexible as it comprises the majority of recent models. We generalize the Markov chain approach of Choi and Zhang [14], following the idea to characterize a seed by an averaging criterion rather than its sensitivity only. The obtained recurrences also allow a symbolic calculation as performed by Mak and Benson [22]. Moreover, the PAA can be designed for indel seeds Mak et al. [13]. Similar to [11], our model provides a unifying framework. However, we characterize multiple seeds by their sensitivity, whereas Kucherov et al. [11] design seed families with perfect sensitivity minimizing the number of false positives. Outline. The rest of the paper is organized as follows: in Sect. 2 we first describe different homology and seed models before establishing the Probabilistic Arithmetic Automaton in Sect. 3. We derive recurrence relations to compute hit distributions and in particular the sensitivity of a given seed. Sect. 4 compares our results to recent approaches. We divide our findings regarding Problem 1 in three categories, namely for i) a single spaced seed, ii) a single indel seed and iii) a set of spaced seeds. Considerations about Problem 2 complete the paper.

2 2.1

Background Notation

Let Σ denote an alphabet and let the set Σ ∗ include all finite words over Σ. A string s = s[0]s[1] . . . s[l − 1] ∈ Σ l has length |s| = l, s[i] denotes the character at position i (where indexing starts with 0) and s[i, j] refers to the substring of s ranging from position i to j. As usual,  is the empty string. The concatenation of r and s is written as rs. Moreover, for two words r, s ∈ Σ ∗ , r  s indicates that r is a prefix of s, while r  s says that r is a suffix of s. Finally, L(X) denotes the probability distribution (Law) of a random variable X with corresponding generic probability measure P. 2.2

Homology Model

To be able to compute seed sensitivity, we need to describe random alignments with known degree of similarity (e.g., a certain per cent identity value). In the context of seed sensitivity computations, it is customary not to model random sequences and derive alignment properties from those, but to directly model alignments representative strings A over an alphabet Σ indicating the status of the alignment columns. In the simplest and most frequently studied case [7, 8, 9, 10, 14, 19] only substitution mutations are considered; that is Σ = {0, 1}, referring to matches (1) and mismatches (0). Indel seeds use the alignment alphabet Σ = {0, 1, 2, 3}, where additionally 2 denotes an insertion in the database sequence, and 3 indicates an insertion in the query sequence. There are various other alignment alphabets, e.g. the ternary alphabet representing match / transition / transversion in DNA [23], or even larger alphabets to distinguish

Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

321

different pairs of amino acids in the case of proteins [17]. The sequence states of alignment columns is modeled as a Markov chain (Σ, P, p0 ) with homology parameters P , a transition matrix on the alphabet Σ, and p0 , an initial distribution on Σ. If we consider Σ = {0, 1} and an i.i.d. homology model, where   the transition 1−p p for a probability Pij does not depend on i, then P takes the form 1−p p match probability p ∈ [0, 1], which is the only homology parameter in this case and quantifies the average per cent identity of such alignments. For indel seeds, a first-order Markov chain is appropriate, since a substitution is more plausible than two consecutive indels (one in each sequence). Hence, the homology model should prohibit the pairs ‘23’ and ‘32’ in a representative string. With the intention to compare our results later on, we work with the transition matrix P proposed in [13]: 0 1 2 3 ⎛ ⎞ 0 p0 p1 pg pg ⎟ 1⎜ ⎜ p∗0 p∗1 pg pg ⎟, (1) 2 ⎝ p0 p1 pg 0 ⎠ ∗ ∗ 3 p0 p1 0 pg where p0 is the probability of a mismatch, p1 is the probability of a match, and pg refers to the probability of a gap in the alignment. In order to ensure a stochastic transition matrix, p∗i = pi + pg pi /(p0 + p1 ) for i = 0, 1 redistributes pg to match and mismatch characters. The initial distribution is given by p0 = (p0 , p1 , pg , pg ). Other transition probabilities are possible, e.g. if alignments with affine gap costs should be modeled. 2.3

Seed Models

A seed π = π[0]π[1] . . . π[L − 1] is a string over an alphabet of “care” and “don’t care” characters. It represents alignment regions that indicate matches at the “care” positions. A seed is classified (L, ω) by its length L = |π| and its weight ω, which refers to the number of “care” positions. A spaced seed is a string over the alphabet Ξ = {1, ∗}. The “care” positions are indicated by ‘1’, while ‘∗’ An indel seed according to Mak et al. [13] is a string over the alphabet Ξ = {1, ∗, ?}, where ‘1’ and ‘∗’ are as above and ‘?’ stands for zero or one character from the alignment alphabet Σ = {0, 1, 2, 3}. Consecutive ‘?’ symbols may represent any character pair except ‘23’ or ‘32’. By means of this interpretation, the model explicitly allows for indels of variable size. For example, 1??1 may detect indels of size 0, 1, or 2. It hence tolerates 2, 1, or 0 match/mismatch positions. To understand how indel seeds are used in the filtering step of homology search, we refer the reader to the original article. In order to obtain the set of patterns defined by π, let Ψ = {[1], [01], [0123]} contain the character sets induced by Ξ, where we write [xy] as shorthand for {x, y}. The generalized string that refers to π is G(π) = g(π[0]) . . . g(π[L − 1]), where g : Ξ → Ψ is a mapping such that ‘1’→ [1], ‘∗’→ [01], and ‘?’→ [0123].

322

I. Herms and S. Rahmann

Definition 1 (Pattern set). Let π = π[0]π[1] . . . π[L − 1] be a seed of length L over the alphabet Ξ. The pattern set

PS(π) = s = s[0] . . . s[L − 1] | s[i] ∈ g(π[i]) contains all words satisfying the seed, this is all strings that match G(π). Example 1. For the spaced seed π = 1 ∗ 1 ∗ 1 of length 5 and weight 3, G(π) = [1][01][1][01][1] and PS(π) = {10101, 10111, 11101, 11111}. The patterns of the indel seed π = 1 ∗ 1?1 with G(π) = [1][01][1][0123][1] are given by PS(π) = {1011, 1111, 10101, 10111, 10121, 10131, 11101, 11111, 11121, 11131}. Definition 2 (Hit position). A seed π hits the representative string A ending at position n iff ∃ M ∈ PS(π) s.t. A[n − |M | + 1, n] = M . Example 2. For the representative string 1011011110 for instance, the seed π = 1 ∗ 11 ∗ 1 hits at positions 6 and 9, the indel seed π = 11 ∗ 1?1 hits at positions 7 and 8, respectively. The idea to use a finite, non-empty set Π = {π1 , . . . , πm } of spaced seeds, also called multiple spaced seed , turned out to further improve the quality of filtering [8, 9, 16, 19]. The patterns are collected in PS(Π) = ∪m i=1 PS(πi ). A multiple seed hits A, if (at least) one of its components does (in contrast to Pevzner and Waterman [24] where all seeds are required to match).

3

Probabilistic Arithmetic Automata

We show that the framework of Probabilistic Arithmetic Automata (PAA), recently introduced by Marschall and Rahmann [25], provides a unified approach to seed sensitivity computation. While including previous related approaches [8, 11, 14], the PAA framework can handle both ungapped and gapped alignments. Moreover, it allows the investigation of overlapping and non-overlapping hits of a single or a multiple seed. It yields recurrence relations to compute the entire hit distribution and in particular seed sensitivity. From these recurrences we derive a polynomial which allows fast computation for any parameter set. A PAA consists of three components, namely i) a Markov chain, ii) emissions associated with the states of this chain, and iii) a series of arithmetic operations performed on the emissions. In the context of seed sensitivity the Markov chain generates a random sequence alignment, emissions correspond to hit counts, and the accumulation of such counts yields the distribution of seed hits. The following formal definition is adopted from Marschall and Rahmann [25]: Definition 3 (Probabilistic Arithmetic Automaton). A Probabilistic

Arithmetic Automaton is a tuple Q, T, q0 , E, μ = (μq )q∈Q , N, n0 , θ = (θq )q∈Q , where

Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

– – – – – – – –

323

Q is a finite

set of states, T = Tuv u,v∈Q is a stochastic state transition matrix, q0 ∈ Q is called start state, E is a finite set called emission set, each μq : E → [0, 1] is a weight distribution associated with state q, N is a finite set called value set, n0 ∈ N is called start value, each θq : N × E → N is an operation associated with state q.

chain on state space Q starting in q0 with The tuple (Q, T, δq0 ) 1 defines a Markov transitions according to T . Let Yi i∈N0 denote its associated state process. Ac cordingly, Zi i∈N0 represents the sequence of emissions and Vi i∈N0 denotes the sequence of values Vl = θYl (Vl−1 , Zl ), with V0 = n0 , resulting from the performed operations. In order to compute seed sensitivity, the PAA should count the number of seed hits to a randomly generated alignment. Therefore, we add the emitted counts, so the operation is θq = + in each state. Thus, Vl = Vl−1 + Zl with

V0 = 0, reducing the free parameters of the PAA to Q, T, q0 , E, μ = (μq )q∈Q . 3.1

PAA Construction

Our approach is related to [14]. Given the homology model (Σ, P, p0 ), we aim to construct a Markov chain (Q, T, δq0 ) that generates a proper random sequence alignment. Similar to the structure of an Aho-Corasick tree used in (approximate) string matching, the states of the underlying Markov chain have to incorporate all prefixes of the patterns of the considered seed π or multiple seed Π, respectively. Supplementary states are represented by the characters from the alignment alphabet, and an additional start state ensures the initial character distribution. Thus, for a given seed π of length L, the state space is Q(π) = {start} ∪ {σ ∈ Σ} ∪ {x ∈ Σ ∗ | 1 ≤ |x| ≤ L, ∃ M ∈ PS(π) : x  M } . (2) We mark a finite set F of final states in order to distinguish states that contribute to the number of seed hits. In our case, these states are those hit by π, that is F = {q ∈ Q | ∃ M ∈ PS(π) : M  q}. There is a non-zero outgoing transition from state u to state v if v is the maximal suffix of uσ, that is if there exists a σ ∈ Σ s.t. v = argmaxx∈{y∈Q | yuσ} |x|. The corresponding transition probability is Pu[|u|−1]σ = P(σ | u[|u| − 1]), where the conditional probabilities are given by the homology model. Thus, in the i.i.d. case it is just the character frequency pσ . The stochastic state transition matrix of the PAA with u, v ∈ Q is given by ⎧ 0 ⎪ if u = start, v = σ ∈ Σ, ⎨pσ Tuv = Pu[|u|−1]v[|v|−1] if u = start, ∃ σ ∈ Σ : v = argmax{x∈Q | xuσ} |x|, ⎪ ⎩ 0 otherwise. (3) 1

The initial state distribution given by δq0 is the Dirac distribution assigning probability 1 to {q0 }.

324

I. Herms and S. Rahmann

(a) PAA counting overlapping hits of π = 1 ∗ 1.

(b) PAA counting non-overlapping hits of π = 1 ∗ 1.

Fig. 1. Markov chain underlying the PAA for π = 1 ∗ 1. 1(a) shows the transitions in order to count all occurrences of π in a random alignment over Σ = {0, 1} (either i.i.d. with match probability p or Markov with P1 = p and P0 = q). 1(b) is designed to count non-overlapping hits.

If we are interested in counting non-overlapping hits, the transitions outgoing a final state become  P1σ if u ∈ F, v = σ ∈ Σ, Tuv = 0 if u ∈ F, v ∈ / Σ, since the last character read was a 1 (by definition). Therewith, the Markov chain becomes reducible, and we can remove states from Q, that are not reachable from start. The so constructed Markov chain (Q, T , δstart ) generates a random sequence alignment whose similarity level is specified by the homology parameters. It is adapted to a given seed through the choice of states. In order to count the number of seed hits, we still need to assign the family μ = (μq )q∈Q of weight output distributions. Let C(q) denote the number of hit counts in state q ∈ Q. Then, C(q) = |{M ∈ PS(π) : M  q}| . (4) In the case of non-overlapping hits, |{M ∈ PS(π) : M  q}| = 1 for all q ∈ Q, because of the reduced state space. Consequently, we deal with the emission set E = {0, 1, . . . , cmax }, cmax = maxq∈Q C(q), and |E|-dimensional Dirac measures μq assigning probability 1 to C(q). Conclusion. To sum up, the PAA Q, T, q0 = start, E = {0, 1, . . . , cmax }, μ =

(μq )q∈Q with Q and T given in (2), (3), and μq = δC(q) with C(q) given by (4), generates a random sequence alignment according to the specified homology model in order to count the number of hits of a given seed π or Π, respectively. The inherent Markov Additive Chain (Vl )l∈N0 with Vl = Vl−1 + C(Yl ),

V0 ≡ 0,

yields the number Vt of accumulated hits in a random alignment of length t.

Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

325

Remark 1. We use an O(|Σ| |Q| log |Q|) algorithm by Hopcroft [26] to minimize the size of the state space. Here, the initial partition is induced by grouping states with the same emitted value C(q), the same ingoing and the same outgoing transition probability. 3.2

Hit Distribution and Seed Sensitivity

In order to compute the sensitivity of a given seed π or Π (Problem 1) and its entire hit distribution (Problem 2) for a target alignment length t, we seek the by distribution L(Vt ) of accumulated seed hits. This is obtained

marginalization  from the joint state-value distribution L(Yt , Vt ) via P Vt = k = q∈Q P Yt =

q, Vt = k . For the sake of readability, we define hlq (k) := P(Yl = q, Vl = k) and  thus, P(Vl = k) =: hl (k) = q∈Q hlq (k). Now, we can reformulate the mentioned problems. Seed sensitivity is commonly defined as S(π, t) = P(Vt ≥ 1) = 1 − P(Vt = 0) = 1 − ht (0) . It is related to the hit distribution

P {A | |A| = t, Vt = k} = P(Vt = k) = ht (k) for k ≥ 0 .

(5)

(6)

The PAA approach proposed in 3.1 yields the following system of recurrence relations:  hl−1 if q ∈ /F  (k)Tq  q  hlq (k) = q ∈Q ql−1 for q ∈ Q, l ≥ 1, (7)  h (k − C(q))T if q ∈ F q q q ∈Q q with initial condition h0start (0) = 1. In order to efficiently implement

the computation of seed sensitivity, we make use of the vector H l (0) = hlq (0) q∈Q and the update formula H l (0) = H l−1 (0)T  . By T  , we denote the transition matrix projected to the columns representing q with C(q) = 0. This means, all entries within a column corresponding to a state hit by π are set to 0 in order to ensure hlq (0) = 0 for q with C(q) > 0. The computation of S(π, t) requires O(|Q|2 ) space and O(t |Q|2 ) time. Parameter Free Calculation. Another advantage of our approach is the possibility of a parameter-free calculation as presented in Mak and Benson [22]. Without setting the homology parameters in advance, a polynomial (in these parameters) can be computed from (7) by means of a computer algebra system like Mathematica. Hence, one can quickly assess the sensitivity of a seed under different parameter values.

4

Results

We have computed the sensitivity of a given seed or set of seeds by means of the PAA approach. The analysis includes different seed and homology models.

326

4.1

I. Herms and S. Rahmann

Spaced Seeds

We considered amongst others the PH seed class (18, 11) and the class (11, 7) with a target alignment length of t = 64 as has been done in Choi and Zhang [14]. The computed sensitivities (data not shown) agree with those determined in the original article, except for class (11, 7) for low similarity levels. Instead of sensitivity 0.05398 for p = 0.1 we compute 5.39821 × 10−6 and instead of sensitivity 0.1140 for p = 0.3 our result is 0.01140. However, our results are consistent with the polynomial obtained from the recurrences. They further agree with the following plausibility argument: Let hi be the probability of the event that the seed from (11, 7) hits the random alignment at position i. If we

assume independence of positions, hi = p7 (1−p)4 +4p8 (1−p)3 + 42 p9 (1−p)2 + 43 p10 (1− p) + p11 for all 11 ≤ i ≤ 64. For p = 0.1 this rough calculation yields a sensitivity of 5.40021 × 10−6 . 4.2

Indel Seeds

To show that our approach also works for indel seeds, we compare our results to sensitivities provided by Mak et al. [13]. Therefore, we investigate different seeds under a Markov homology model with Σ = {0, 1, 2, 3} and transitions according to 1. Note that in the original article the authors work with normalized positions. This means, only positions in the representative string that refer to a character in the query are counted. Thus, any 2 in A does not contribute to the target alignment length t. We use the expected number t¯ of characters to read up to the t’th normalized position and compute S(π, t¯) as well as S(π, t) accordant to (5). The results in Table 1 show that sensitivity is overestimated when using normalized positions, since some alignments are ignored, although sensitivity is defined as fraction of sequence alignments that are matched by a seed. Compare spaced and indel seeds of equivalent random hit rates, Mak et al. [13] observe that with increasing indel to mismatch ratio, indel seeds outperform spaced seeds. Our method yields the same results (compare Table 1), even if the “winning seed” changes with target length. Table 1. Sensitivity of pairs of spaced and indel seeds with equivalent random hit rates for different homology parameters (t,p1 ,p0 ,pg ). Winning seeds are shown in bold. seed 1111 ∗ 111111 1111 ∗ 111?1111 1111 ∗ 111111 1111 ∗ 111?1111 111 ∗ 11 ∗ 1111 111 ∗ 11?11 ∗ 111 111 ∗ 11 ∗ 1111 111 ∗ 11?11 ∗ 111

homology parameters (64, 0.7, 0.2, 0.05) (64, 0.7, 0.2, 0.05) (100, 0.7, 0.2, 0.05) (100, 0.7, 0.2, 0.05) (64, 0.8, 0.15, 0.025) (64, 0.8, 0.15, 0.025) (100, 0.8, 0.15, 0.025) (100, 0.8, 0.15, 0.025)

sensitivity by [13] S(π, t¯) S(π, t) 0.488697 0.494082 0.470369 0.487703 0.492503 0.468351 0.669123 0.668821 0.649304 0.670198 0.669884 0.650131 0.943899 0.943609 0.937750 0.943214 0.942985 0.936985 0.991157 0.990945 0.989497 0.991239 0.991044 0.989594

Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

4.3

327

Multiple Seeds

It is already NP-hard [16] to find a single optimal seed by testing all seeds of a given class and selecting the best. Additionally, for a multiple seed there are exponentially many sets of seeds of a given weight. Hence, there is no exact algorithm known that computes optimal multiple seeds. However, “good” multiple seeds are computed by heuristic algorithms presented in [20] and [21]. These approaches use different quality measures correlated with sensitivity, while we compute exact sensitivity values even for non-i.i.d. homology models. Our results agree with the approximations. The PAA framework can also be applied to design efficient sets of seeds. Similar to [19] we can successively find seeds that locally maximize the conditional probability to hit a random alignment, given that the seeds in the set do not match. This can be achieved by introducing absorbing states. 4.4

Alternative Criteria

In addition to recent work, our approach yields the entire distribution of seed hits. Therefore, we can rank seeds e.g. according to their probability to produce few overlapping hits or subject to the probability of at least two non-overlapping hits. The latter corresponds to the FASTA approach. There, candidates are required to contain at least two non-overlapping seed hits, which afterwards can probably be combined to a similarity. We have calculated the distributions (6) of overlapping and non-overlapping hits for seed classes (11, 7) and (18, 11) (data not shown). Figure 2 shows the parameter ranges for best performing seeds from the PH seed class according to i) sensitivity and ii) the probability of at least two non-overlapping hits at target alignment length 64. For some parameters, the optimal seed has only a slightly higher sensitivity than its competitors. In such cases, another criterion might be of interest. For instance, for p = 0.4, where the four top-ranking seeds have almost the same

(a) Parameter ranges according to maximal sensitivity.

(b) Parameter ranges according to maximal probability of at least two nonoverlapping hits. Fig. 2. Parameter ranges for best performing seeds from PH seed class (18, 11) at target length 64. 2(a) shows seeds with highest sensitivity. 2(b) displays winning seeds regarding the probability of at least two non-overlapping hits. A (PH):111 ∗ 1 ∗ ∗1 ∗ 1 ∗ ∗11∗111, B:111∗∗1∗11∗∗1∗1∗111, C:11∗∗111∗1∗∗1∗111∗1, D:111∗1∗∗11∗1∗1∗∗111.

328

I. Herms and S. Rahmann

sensitivity, we observe the PH seed A to maximize the non-overlapping hits criterion clearly.

5

Conclusions

We have presented a Probabilistic Arithmetic Automaton to compute seed sensitivity. It also provides the entire distribution of the number of overlapping or non-overlapping seed hits. The flexibility of the PAA framework allows various homology and seed models. In especially, we can handle sequence alignments with or without gaps and build the automaton for an appropriate seed. For that reason we have presented general definitions of the pattern set and the hit position. Further, symbolic representation of (7) yields a polynomial in the homology parameters, which gives the solution to Problem 1 or Problem 2 for any set of parameters. This enables us to differentiate between seeds that have very similar sensitivities. It remains to prove to what extent alternative criteria are reasonable. One idea is to investigate the hit distribution under two background models: one alignment model representing unrelated sequences, one model describing homologous sequences. Then, we would call a seed optimal that has a given high sensitivity (e.g. 95%) and maximizes selectivity. Further, we show that one and the same approach is suitable for single and multiple seeds. In particular, we can compute the sensitivity of a multiple seed under non-i.i.d. homology models (in contrast to the DP algorithm by [16]). Clearly, despite minimization, the size of the automaton is exponential in the number of wildcards. Thus, the method should rather serve as verification for heuristic algorithms. Acknowledgments. We thank T. Wittkop and E. Fritzilas for helpful comments on the manuscript. I. Herms is supported by the NRW Graduate School in Bioinformatics and Genome Research.

References [1] Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988) [2] Altschul, S.F., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol 215, 403–410 (1990) [3] Altschul, S.F., Madden, T.L., Sch¨ affer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997) [4] Kent, W.J.: BLAT–the blast-like alignment tool. Genome Res. 12(4), 656– 664 (2002) [5] Gelfand, Y., Rodriguez, A., Benson, G.: TRDB–the tandem repeats database. Nucleic Acids Res. 35 (2007)

Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

329

[6] Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981) [7] Ma, B., Tromp, J., Li, M.: Patternhunter - faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002) [8] Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proceedings of the 7th annual international conference on Research in computational molecular biology, pp. 67–75 (2003) [9] Brejov´ a, B., Brown, D.G., Vinar, T.: Optimal spaced seeds for homologous coding regions. J. Bioinform. Comput. Biol. 1(4), 595–610 (2004) [10] Choi, K.P., Zeng, F., Zhang, L.: Good spaced seeds for homology search. Bioinformatics 20(7), 1053–1059 (2004) [11] Kucherov, G., No´e, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds. J. Bioinform. Comput. Biol. 4(2), 553–569 (2006) [12] Brejov´a, B., Brown, D.G., Vinar, T.: Vector seeds: an extension to spaced seeds. J. Computer System Sci. 70(3), 364–380 (2005) [13] Mak, D., Gelfand, Y., Benson, G.: Indel seeds for homology search. Bioinformatics 22(14), e341–e349 (2006) [14] Choi, K.P., Zhang, L.: Sensitivity analysis and efficient method for identifying optimal spaced seeds. J. Computer System Sci. 68, 22–40 (2004) [15] Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In: Proceedings of SODA 2006, pp. 444–453. SIAM, Philadelphia (2006) [16] Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter II: Highly sensitive and fast homology search. J. Bioinform. Comput. Biol. 2(3), 417–439 (2004) [17] Brown, D.G.: Optimizing multiple seeds for protein homology search. IEEE/ACM Trans. Comput. Biol. Bioinform. 2(1), 29–38 (2005) [18] Kucherov, G., No´e, L., Roytberg, M.: Multiseed lossless filtration. IEEE/ACM Trans. Comput. Biol. Bioinform. 2(1), 51–61 (2005) [19] Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12(6), 847–861 (2005) [20] Kong, Y.: Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search. J. Comput. Biol. 14(2), 238–254 (2007) [21] Ilie, L., Ilie, S.: Multiple spaced seeds for homology search. Bioinformatics 23(22), 2969–2977 (2007) [22] Mak, D.Y.F., Benson, G.: All hits all the time: Parameter free calculation of seed sensitivity. In: APBC. Advances in Bioinformatics and Computational Biology, vol. 5, pp. 327–340. Imperial College Press (2007) [23] No´e, L., Kucherov, G.: Improved hit criteria for DNA local alignment. BMC Bioinformatics 5, 149 (2004) [24] Pevzner, P.A., Waterman, M.S.: Multiple filtration and approximate pattern matching. Algorithmica 13(1/2), 135–154 (1995) [25] Marschall, T., Rahmann, S.: Probabilistic arithmetic automata and their application to pattern matching statistics. In: 19th Annual Symposium on Combinatorial Pattern Matching (accepted for publication, 2008) [26] Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton. Technical report, Stanford, CA, USA (1971)