Approximate string matching by finite automata - Semantic Scholar

Report 5 Downloads 112 Views
Approximate String Matching by Finite Automata Bofivoj Melichar Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, Karlovo n~m. 13, 121 35 Prague 2, Czech Republic e-maih [email protected] fax: 298098 phone: 2435 7470 A b s t r a c t . Approximate string matching is a sequential problem and therefore it is possible to solve it using finite automata. A nondeterministic finite automaton is constructed for string matching with k mismatches. It is shown, how "dynamic programming" and "shift-and" based algorithms simulate this nondeterministic finite automaton. The corresponding deterministic finite automaton have O(m TM) states, where m is the length of the pattern and k is the number of mismatches. The time complexity of algorithms based on such deterministic finite automaton is O(n), where n is the length of text.

1 Introduction Approximate string matching can be described in the following way: Given a text string T = tit2...tn, a pattern P = PIP2"" "Pro, and an integer k, k < m < n, we are interested in finding all occurrences of a substring X in the text string T such that the distance D(P, X) between the pattern P and the string X is less than or equal to k. In this paper we will consider the Hamming distance. The H a m m i n g distance, denoted by Ds/, between two strings P and X of equal length is the number of positions with mismatching symbols in the two strings. We will refer to approximate string matching as string matching with k mismatches whenever D is the H a m m i n g distance. The Levenshtein distance, denoted by DL, or edit distance, between two strings P and X, not necessarily of equal length, is the minimal number of editing operations i n s e r t , d e l e t e and r e p l a c e needed to convert P into X. We will refer to approximate string matching as string matching with k differences whenever D is the Levenshtein distance. A nondeterministic finite automaton (NFA) is a 5 - tuple M = (Q, A, 5, q0, F), where Q is a finite set of states, A is a finite set of input symbols, is a state transition function from Q x A to the power set of Q, q0 E Q is the initial state, F C Q is the set of final states.

Hlavfi~, gfira (Eds.): CAIP '95 Proceedings,LNCS970 9 Springer-VerlagBerlin Heidelberg 1995

343

A finite automaton is deterministic (DFA) if J(q, a) has exactly one element for any q E Q and a E A. In the following, we will use the alphabet A = {sl, s 2 , . . . , SlAI}. If p E A then/~ is the complement set A - {p}, in our case. 2

String Matching

with k Mismatches

First, we construct a nondeterministic finite automaton MH for a given pattern P = PIP2"" "Prn, alphabet A = {sl, s 2 , ' " , SIAl}, and k < m. This automaton is depicted in Fig.1. Each state q E Q has a label (i, j), where i, 0 < i < k, is a level of q,

~t ~2

7 ~176176

:ii .

.

.

.

.

\

Level

2

~

'

-

,

......

'

~

,

"~

Level k

Fig. 1. and j, 0 _< j < m, is a depth ofq. In the automaton MH, there are k + l levels of states sequences. Every level ends in one of the final states (0, m), (1, m ) , . . - , (k, m). These final states are accepting states of strings with 0, 1, 2,-.., k mismatching symbols, respectively. The sequence of states of the level 0 corresponds to the given pattern without any mismatch. Levels 1, 2,..-, k correspond to the strings with 1, 2 , . . . , k mismatching symbols, respectively. From each nonfinal state of level j, 0 < j < k, there exists a transition to the state of the level j q- 1, which means, that a mismatch occurs. Moreover, there is a self loop in the state 0 for every symbol of the alphabet A. This automaton accepts all strings having a postfix X such that DH(P, X) < k.

344 The number of states of the automaton MH is

(m+ i) + (rn) + ( m - I) + . . . + (rn- k + 1) = (k + l)(m+ i - }). The only problem is that the finite automaton is nondeterministic. There are two ways how to use this automaton as a base for the matching algorithm: 1. To simulate the nondeterministic automaton in a deterministic way. Some of known matching algorithms use this approach. Below we show two examples. The problem of this approach is a high time complexity, which may be O ( n , m) in the worst case, while the space complexity is O(k * m). 2. To construct an equivalent deterministic finite automaton. The problem of this approach is a high space complexity, which may he O(m k+l) in the worst case, while the time complexity- is O(n). 2.1

Simulation of the Nondeterministic Automaton Ms

D y n a m i c P r o g r a m m i n g A l g o r i t h m . We show, how the well known "dynamic programming" algorithm DP, [WF74], [Uk85], simulates the nondeterministic automaton MH. For the case of string matching with k mismatches, we modify this algorithm in the following way: Evaluate the (m + 1) • (n + 1) matrix D defined by the following recurrence relation:

D[O,j] = O, 0 < j < n, D[i,O] = i, 0 < i < m, n[i,j] = i f p i = tj t h e n D[i - 1,j - 1] else (D[i - 1,j - 1] + 1), otherwise.

H1 (aba) aIb

b

Fig. 2.

a

345

Example 1. The nondeterministic automaton H1 (aba) for the pattern aba and k = 1 is shown in Fig.2. The dynamic programming algorithm computes the following matrix D for the string aabbabaabbb : 0 al b!2 a 3

a

a

b

b

a

b

0

0

0

0

0

1

2 2

1 2

0 2

a

a

b

b

b

0

0

1

0

0

0

0

0

0

0

1

0

0

1

1

1 1

2 1

1

0 3

2 0

1 2

0 2

i 1

1 2

t

t

1"

t

The value of D[i, j] is the number of mismatches between the prefix of the pattern PlP2""Pi and text tj-i+ltj-i+2"'. If Dim, j] < k, then the pattern was found with D[m,j] mismatches. This is marked by arrows in the matrix D. Therefore the value of D[i, j] corresponds to the level in the N F A . There are only levels 0 and 1 in our automaton. The value of the index i corresponds to the depth of the state in N F A which is the number of transitions from the initial state. Thus, dynamic programming algorithm simulates N F A following in parallel all sequences of transitions for all prefixes of the pattern. It is clear from the example, that it is useless to compute elements of the matrix D that have values greater than k. This can be used for the algorithm optimization. S h l f t - A n d A l g o r i t h m . Another way of the simulation of the nondeterministic automaton MH is the Shift-And algorithm [WM92]. This algorithm is a modification of the Shift-Or algorithm [BG92]. For the case of string matching with k differences the ~Shift-And algorithm is as follows: Evaluate the sequence of (m + 1) x (n + 1) matrices R d, 0 < d < k, defined by the following recurence relations:

1. 2. 3. 4.

Rd[o,j]= Rd[i, O] = R~ = Rd[i, j] = l
Recommend Documents