Approximate String Matching by Finite Automata Bofivoj Melichar Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, Karlovo n~m. 13, 121 35 Prague 2, Czech Republic e-maih
[email protected] fax: 298098 phone: 2435 7470 A b s t r a c t . Approximate string matching is a sequential problem and therefore it is possible to solve it using finite automata. A nondeterministic finite automaton is constructed for string matching with k mismatches. It is shown, how "dynamic programming" and "shift-and" based algorithms simulate this nondeterministic finite automaton. The corresponding deterministic finite automaton have O(m TM) states, where m is the length of the pattern and k is the number of mismatches. The time complexity of algorithms based on such deterministic finite automaton is O(n), where n is the length of text.
1 Introduction Approximate string matching can be described in the following way: Given a text string T = tit2...tn, a pattern P = PIP2"" "Pro, and an integer k, k < m < n, we are interested in finding all occurrences of a substring X in the text string T such that the distance D(P, X) between the pattern P and the string X is less than or equal to k. In this paper we will consider the Hamming distance. The H a m m i n g distance, denoted by Ds/, between two strings P and X of equal length is the number of positions with mismatching symbols in the two strings. We will refer to approximate string matching as string matching with k mismatches whenever D is the H a m m i n g distance. The Levenshtein distance, denoted by DL, or edit distance, between two strings P and X, not necessarily of equal length, is the minimal number of editing operations i n s e r t , d e l e t e and r e p l a c e needed to convert P into X. We will refer to approximate string matching as string matching with k differences whenever D is the Levenshtein distance. A nondeterministic finite automaton (NFA) is a 5 - tuple M = (Q, A, 5, q0, F), where Q is a finite set of states, A is a finite set of input symbols, is a state transition function from Q x A to the power set of Q, q0 E Q is the initial state, F C Q is the set of final states.
Hlavfi~, gfira (Eds.): CAIP '95 Proceedings,LNCS970 9 Springer-VerlagBerlin Heidelberg 1995
343
A finite automaton is deterministic (DFA) if J(q, a) has exactly one element for any q E Q and a E A. In the following, we will use the alphabet A = {sl, s 2 , . . . , SlAI}. If p E A then/~ is the complement set A - {p}, in our case. 2
String Matching
with k Mismatches
First, we construct a nondeterministic finite automaton MH for a given pattern P = PIP2"" "Prn, alphabet A = {sl, s 2 , ' " , SIAl}, and k < m. This automaton is depicted in Fig.1. Each state q E Q has a label (i, j), where i, 0 < i < k, is a level of q,
~t ~2
7 ~176176
:ii .
.
.
.
.
\
Level
2
~
'
-
,
......
'
~
,
"~
Level k
Fig. 1. and j, 0 _< j < m, is a depth ofq. In the automaton MH, there are k + l levels of states sequences. Every level ends in one of the final states (0, m), (1, m ) , . . - , (k, m). These final states are accepting states of strings with 0, 1, 2,-.., k mismatching symbols, respectively. The sequence of states of the level 0 corresponds to the given pattern without any mismatch. Levels 1, 2,..-, k correspond to the strings with 1, 2 , . . . , k mismatching symbols, respectively. From each nonfinal state of level j, 0 < j < k, there exists a transition to the state of the level j q- 1, which means, that a mismatch occurs. Moreover, there is a self loop in the state 0 for every symbol of the alphabet A. This automaton accepts all strings having a postfix X such that DH(P, X) < k.
344 The number of states of the automaton MH is
(m+ i) + (rn) + ( m - I) + . . . + (rn- k + 1) = (k + l)(m+ i - }). The only problem is that the finite automaton is nondeterministic. There are two ways how to use this automaton as a base for the matching algorithm: 1. To simulate the nondeterministic automaton in a deterministic way. Some of known matching algorithms use this approach. Below we show two examples. The problem of this approach is a high time complexity, which may be O ( n , m) in the worst case, while the space complexity is O(k * m). 2. To construct an equivalent deterministic finite automaton. The problem of this approach is a high space complexity, which may he O(m k+l) in the worst case, while the time complexity- is O(n). 2.1
Simulation of the Nondeterministic Automaton Ms
D y n a m i c P r o g r a m m i n g A l g o r i t h m . We show, how the well known "dynamic programming" algorithm DP, [WF74], [Uk85], simulates the nondeterministic automaton MH. For the case of string matching with k mismatches, we modify this algorithm in the following way: Evaluate the (m + 1) • (n + 1) matrix D defined by the following recurrence relation:
D[O,j] = O, 0 < j < n, D[i,O] = i, 0 < i < m, n[i,j] = i f p i = tj t h e n D[i - 1,j - 1] else (D[i - 1,j - 1] + 1), otherwise.
H1 (aba) aIb
b
Fig. 2.
a
345
Example 1. The nondeterministic automaton H1 (aba) for the pattern aba and k = 1 is shown in Fig.2. The dynamic programming algorithm computes the following matrix D for the string aabbabaabbb : 0 al b!2 a 3
a
a
b
b
a
b
0
0
0
0
0
1
2 2
1 2
0 2
a
a
b
b
b
0
0
1
0
0
0
0
0
0
0
1
0
0
1
1
1 1
2 1
1
0 3
2 0
1 2
0 2
i 1
1 2
t
t
1"
t
The value of D[i, j] is the number of mismatches between the prefix of the pattern PlP2""Pi and text tj-i+ltj-i+2"'. If Dim, j] < k, then the pattern was found with D[m,j] mismatches. This is marked by arrows in the matrix D. Therefore the value of D[i, j] corresponds to the level in the N F A . There are only levels 0 and 1 in our automaton. The value of the index i corresponds to the depth of the state in N F A which is the number of transitions from the initial state. Thus, dynamic programming algorithm simulates N F A following in parallel all sequences of transitions for all prefixes of the pattern. It is clear from the example, that it is useless to compute elements of the matrix D that have values greater than k. This can be used for the algorithm optimization. S h l f t - A n d A l g o r i t h m . Another way of the simulation of the nondeterministic automaton MH is the Shift-And algorithm [WM92]. This algorithm is a modification of the Shift-Or algorithm [BG92]. For the case of string matching with k differences the ~Shift-And algorithm is as follows: Evaluate the sequence of (m + 1) x (n + 1) matrices R d, 0 < d < k, defined by the following recurence relations:
1. 2. 3. 4.
Rd[o,j]= Rd[i, O] = R~ = Rd[i, j] = l