Adaptive Pattern Matching - CiteSeerX

Report 3 Downloads 322 Views
Adaptive Pattern Matching R.C. Sekary

Bellcore, Rm 2A-274 445 South Street Morristown, NJ 07962. [email protected].

R. Ramesh

Dept. of Computer Science University of Texas at Dallas Richardson, TX 77083 [email protected]

I.V. Ramakrishnan

Dept. of Computer Science SUNY at Stony Brook Stony Brook, NY 11794. [email protected]

Abstract Pattern matching is an important operation used in many applications such as functional programming, rewriting and rule-based expert systems. By preprocessing the patterns into a DFA-like automaton, we can rapidly select the matching pattern(s) in a single scan of the relevant portions of the input term. This automaton is typically based on left-to-right traversal (of the patterns) or its variants. By adapting the traversal order to suit the set of input patterns, it is possible to considerably reduce the space and matching time requirements of the automaton. The design of such adaptive automata is the focus of this paper. In this context we study several important problems that have remained open even for automata based on left-to-right traversals. Such problems include upper and lower bounds on space complexity, construction of optimal dag automata and impact of typing in pattern matching. An interesting consequence of our results is that lazy pattern matching in typed systems (such as ML) is computationally hard whereas it can be done eciently in untyped systems.

1 Introduction Pattern matching is a fundamental operation in a number of important applications such as functional and equational programming, term rewriting and theorem proving. In most of these applications, patterns are partially ordered by assigning priorities. For instance, in languages such as ML [5] and Haskell [6], a pattern occurring earlier in the text has a higher priority over those following it. Applications that do not impose priorities can also be handled as a special case of matching with priorities. The typical approach to pattern matching is to preprocess the patterns into a DFA-like automaton that can rapidly select the patterns that match the input term1 . The main advantage of such a matching automaton is that all pattern matches can be identi ed in a single scan (i.e., no backtracking) of portions of input term relevant for matching purposes and is done in time that is independent of the number of patterns. Fig. 1 shows such a matching automaton constructed on the basis of a left-to-right traversal of patterns. This automaton can be represented by tables or compiled into case statements. Each state of the automaton corresponds to the pre x of the input term seen in reaching that state and is annotated with the set of patterns that can possibly match. For instance, state s4 corresponds to having inspected the pre x f (b; a; x), where x denotes the subterm that has not yet been examined. This state is annotated with the pattern set f1; 2; 3g since we cannot rule out a match for any of the three patterns on the basis of the pre x f (b; a; x). Research partially supported by NSF grants CCR-8805734,9102159,9110055 and NYS S&T grant RDG 90173. This research was completed at SUNY, Stony Brook. 1 Only matches at the root of the input term are identi ed. Nonlinearity is not considered since even in applications that allow such patterns, most failures are associated with symbol mismatches, as observed in [3].  y

kf ? k 1; 2; 3 b ? @= b @ ? R k1; 3 k 1; 2; 3 a? ?a 1; 3 s ? k k 1; 2; 3 a?b @= (a; b) ?b @ = b R hk hk? @ R hk ?@ h? hk k f1; 2; 3g f

f

f

g

g

g

f2g

f

6

f1g

6

4

f

f3g f1g

g

6

g

f3g

kf ? 1; 2; 3 2k ?a 1; 2; 3 3k a?b @ = (a; b) ? h?@R k h 2; 3 1k k b ?= b@ 1 3 hk? @R hk f1; 2; 3g

f

f

g

f

g



6

g

6

f2g

f g f g

f3g

Figure 1: Left-to-right (left) and adaptive automata (right) for f (x; a; b); f (b; a; a); f (x; a; y ) with textual order priority. Here x and y denote variables. Pattern matching automata have been extensively studied for well over a decade. Augustsson [1] and Wadler [14] describe pattern matching techniques based on left-to-right traversal. A drawback with these methods is that they may reexamine symbols potentially several times; so in the worst case they may end up testing each pattern separately against the input term. This drawback is overcome by the methods of Christian, Graf and Schnoebelen, which are again based on left-to-right traversals or variants thereof. However, their space requirements can become exponential in the size of patterns. One way to improve space and matching time is to engineer a traversal order to suit the set of patterns or the application domain. We refer to such traversals as adaptive traversals and automata based on such traversals as adaptive automata. As traversal orders are no longer xed apriori, an adaptive automaton must specify the traversal order. For instance, in the adaptive automaton shown in g. 1, each state is labelled with the next argument position to be inspected from that state. Adaptive traversal has two main advantages over a xed-order of traversal such as left-toright.  The resulting automaton can be smaller, e.g., the adaptive automaton in g. 1 has 8 states compared to 11 in the left-to-right automaton. The reduction factor can even become exponential.  Pattern matching requires lesser time with adaptive traversals than xed-order traversals. Furthermore, xed-order traversals cannot use the priority information to avoid inspection of symbols irrelevant for determining highest priority matches. For instance, the adaptive automaton announces a match for pattern 1 after examining only the last two arguments of f whereas the left-to-right automaton needs to inspect all three arguments. Observe that examining unnecessary symbols runs counter to the goals of lazy evaluation in the context of functional programs. The design of adaptive automata is the focus of this paper. In the process, we study several important problems that have remained open even in the context of automata based on xed-order traversals. These problems include lower and upper bounds on space complexity, construction of optimal dag automata and the impact of typing in pattern matching.

1.1 Summary of Results We rst formalize the concept of adaptive traversal and its special case, namely, xed traversal orders and then present an algorithm for constructing adaptive automata.  In section 3 we examine space and matching time complexity. We show that the space requirements of the automata can be quite large by establishing the rst known tight exponential lower bounds on size. These results are quite dicult to obtain since proofs must be independent of traversal order.  In section 4 we present several techniques to synthesize traversal orders that can improve space and matching time. We rst develop the important concept of a representative set which forms the

basis of several optimization techniques aimed at avoiding inspection of unnecessary symbols. We present a quadratic-time algorithm for computing representative sets in untyped systems whereas we show that computing these sets is NP {complete for typed systems.  We then present several powerful strategies for synthesizing traversal orders. Through an intricate example, we show that they all can sometimes increase both space and matching time. We therefore present another strategy based on selecting indices that overcomes this drawback. Huet and Levy [7] had established the importance of index in the design of optimal automata for strongly-sequential patterns. Our results extend the applicability of indices even for patterns that are not strongly-sequential.  In section 4 we synthesize a traversal S (T ) from a given traversal T (such as left-to-right) by inspecting index positions whenever possible. Using S (T ) in place of T does not a ect the termination properties of a functional program. So the programmer can continue to assume T whereas an implementation can bene t from signi cant improvements in space and time using S (T ).  In section 5 we describe an orthogonal approach to space minimization based on dag representation. By tightly characterizing equivalence of states we directly build an optimal dag automaton. This important problem had remained open [4] even for left-to-right traversals.  In section 6 we focus on the important problem of index computation in prioritized systems. Laville [9], Puel and Suarez [11] have extended Huet-Levy's [7] index computation algorithm to deal with priorities. However, these algorithms require exponential time in the worst case. In contrast, we present the rst polynomial-time algorithm for index computation in untyped systems. Furthermore, we also show that this problem is co-NP {complete for typed systems. We therefore present powerful heuristics that can signi cantly speed up this process in typed systems.  Our work clearly brings forth the impact of typing in pattern matching. We have shown that several important problems in the context of pattern matching are unlikely to have ecient algorithms in typed systems whereas we have given polynomial time algorithms for them in untyped systems. This clearly demonstrates that typing makes algorithms inherently more complex rather than serve as an optimization that aids in pruning the search space for index computation, as suggested in [9]. Finally, implications of our results are discussed in section 7. Some proof details are omitted in the main paper, but can be found in the appendix.

2 Preliminaries In this section we introduce notations that will be used in the rest of this paper. We also present a generic algorithm for adaptive automata construction that forms the basis for the results presented in later sections. We assume familiarity with the basic concept of a term. We use root(t) to denote the symbol appearing at the root of a term t. The notion of a position is used to refer to subterms in a term as follows. A position is either the empty string  that reaches the root or p:i (p is a position and i an integer) which reaches the ith argument of the root of the subterm reached by p. We use t=p to refer to the subterm of t reached by p and t[p s] to denote the term obtained by replacing the subterm t=p by s. A substitution maps variables to terms. An instance t of a term t is obtained by substituting (x) for every variable x in t. If t is an instance of u then we say u  t and call u a pre x of t. Fringe of u is the set of all variable positions in u. We will use x; y and z to denote variables, ` ' to denote any unnamed variable and a; b; c; d; f to denote nonvariables. We say that two terms t and s unify i they possess a common instance. t t s denotes the least such instance in the ordering given by . A top-down traversal inspects nodes in a term successively and is characterized by a selection function that, having inspected a pre x u, chooses the next position to visit. The match set Lu of

a pre x u is the set of patterns that unify with u. Let P denote the set of fringe positions of u wherein at least one pattern in Lu has a nonvariable. Adaptive and Fixed Traversals: An adaptive traversal is a top-down traversal wherein the position p 2 P next visited is a function of the pre x u and the given set of patterns L. In a xed traversal order p is simply a function of P . Pattern Match: We say that a pattern l 2 L matches t i t  l and t 6 l0 for any l0 with priority higher than that of l. If there is any such l in L then we say there is a pattern match for t. Among patterns of equal priority, we may announce a match for a term t if it is an instance of any one of these patterns. Ambiguous and Unambiguous Systems: A set of patterns L is said to be ambiguous whenever there is more than one pattern that can match any given term. Otherwise L is unambiguous. Typed Systems: In typed systems, the set of allowable input terms are constrained by a type discipline. For instance, this constraint may take the form that arguments to a function f must be drawn from the set fT; F g. In this case, terms f (T; F ); f (F; T ); f (T; T ) and f (F; F ) are allowable whereas f (2; T ) is not. Untyped Systems: In untyped systems, there is no such type discipline. Hence the allowable input terms include any term over any alphabet 0 that is a superset of the alphabet  used in the patterns. Index: A fringe position p for a pre x u is said to be an index i for every term t  u for which there is a pattern match, t=p is a nonvariable. Intuitively, index is a position that must be examined to determine a pattern match, e.g., any fringe position of u where every pattern in Lu has a nonvariable is an index.

2.1 Generic Algorithm to build Adaptive Automata

A state v of the automaton remembers the pre x u inspected in reaching v from the start state. Suppose that p is the next position inspected from v . From v there are transitions on all symbols c that are present at p for any l 2 Lu . There will also be a transition from v on 6= which will be taken on seeing a symbol di erent from those on the other edges. A generic algorithm to construct an adaptive automaton is given below. Observe that priorities are implicitly handled by de nition of match. Procedure Build(v; u) 1. Let M denote patterns that match u. 2. If M = 6  then matchset[v] := M. / State v announces a match for patterns in M 3. else 4. p = select(u) / select is a function to choose the next position to inspect / 5. pos(v) = p / Next position to inspect is recorded in the pos eld / 6. for each symbol for which 9l 2 Lu with root(l=p) = c (for some nonvariable c) do 7. create a new node vc and an edge from v to vc labelled c 8. Build(vc; u[p c(y1 ; :::; yrank(c))]) 9. If 9l 2 Lu with a variable at p or at an ancestor of p then 10. create a new node v6= and an edge from v to v6= labelled 6= 11. Build(v6=; u[p = 6 ])

Build is similar to previously known algorithms such as that of Christian [3] and Huet-Levy [7]. The main di erence is that Build is generic whereas the others prespeci ed the select function. Build is invoked by the call Build(s0; x) where s0 is the start state of the automaton. It takes two parameters { v speci es a state of the automaton and u speci es the pre x examined in reaching

 m f ? 1,2,3,4 3m b ? 1,2,3,4 m= a  2  @@    Rms 1,2,3,4 2:m 1 1 b H=Hj ? c SSwd m ? ? 1,2,4 2:m 3 m j s 1:m 1 b ?c H 3 b H ?jm ?m j? c 2m m :2 1:2 1 2,4 J s c

m ^ J ? 4 j m jm4 6

l1 : f ( ; a(b; ; b); b) l2 : f ( ; a(b; c; c); b) l3 : f ( ; a(c; ; ); b) l4 : f (d(b; c); ; b)

6

2

e

e le @I 6 ? @@ e?? l

l1

l2

3

4

s

Figure 2: Adaptive automaton constructed by Build for patterns shown at top-right, with priorities as in bottom-right. The pos eld and Lu are shown with each state. States labelled s denote identical subtrees. state v . This invocation of Build constructs the subautomaton rooted at v . Lines 6, 7 and 8 create transitions based on each symbol that could appear at p for any pattern in Lu . In line 8, Build is recursively invoked with the pre x extended to include the symbols seen on the transitions created in line 7. If there is a pattern in Lu with a variable at or above p then a transition on 6= is created at line 10 and Build recursively invoked at line 11. Fig. 2 shows a set of patterns along with their priorities and the corresponding automaton constructed by Build. The automaton has 25 states since states labelled s all have identical subtrees each with 4 states. Observe that there can be at most S positions of interest (recall that S is the sum of the sizes of all the patterns). Each position, which is an integer sequence, can be encoded as an integer. We omit the details of such an encoding which can be found in implementations such as equals [10]. Using such an encoding, the symbol speci ed by any position can be accessed without any extra overhead when compared to xed-traversal orders. Finally, observe that the automata generated by Build has a tree structure which is also the representation used by all previous pattern matching methods. Popularity of trees stems from their simplicity and the fact that they readily support case statement representation.

3 Space and Matching Time Complexity We now examine upper and lower bounds (independent of traversal order) on the space and matching time complexity of adaptive tree automata for several classes of patterns. Fig. 3 summarizes our results. Observe that the contribution by depth of the automata (whose upper bound is S ) to space is insigni cant compared to the exponential lower bound. The major contribution comes from the breadth of the automata. Furthermore, even for patterns without variables (i.e. ground terms) it is hard to reduce the number of states [2] whereas reducing breadth is not. Therefore in the rest of the paper, we will use breadth as the measure of size of the automaton. We now present the details of lower bound proofs. The proofs for upper bounds can be established along the lines of [4, 12]. The space bounds given in this section are all independent of the 2

The size of a pattern is the number of nonvariable symbols in it.

Class of Patterns

Lower bound Upper bound Lower bound Upper bound on space on space on time on time Qn n Unambiguous, no priority

(2 ) O(Qi=1 jli j)

(a) S Unambiguous, with priority

(an ) O(Qni=1 jli j)

(S ) S Ambiguous

(an ) O( ni=1 jli j)

(S ) S Figure 3: Space and matching time complexity of adaptive automata. Here n; S; a and jlij denote respectively the number of patterns, sum of their sizes2 , average size and the size of ith pattern. 2 3 a a a a 66 b 77 a a 64 b b a 75

b

b b

Figure 4: Example Matrix for n = 4

2 a a 6 b 4b

b

3

a a 75 b b

2 6 4b

a a b a b b b

3 7 5

Figure 5: Matrices representing Lu for states reached by transitions on a and b.

traversal order and are established using at patterns that all have a root symbol f with arbitrarily large arity. It can be shown that for purposes of buiding either the smallest size automaton or one that does matching in the shortest possible time, at patterns are equivalent to a set of patterns having a common pre x u whose fringe size equals arity of f . (This is because every symbol in the common pre x will have to be inspected on any path to any nal state. Then it follows by theorem 4 (section 4.3) that the automaton that inspects all these symbols rst is no larger than any other automaton.)

3.1 Unambiguous, Unprioritized Patterns Consider a set of n nonoverlapping patterns from the alphabet ff; a; bg and variables. Since all at patterns have the same root symbol f , we need only specify the arguments. Therefore n patterns can be represented by a matrix of n rows, where the ith row lists the arguments of f in the ith pattern. Each column has at most one occurrence of a, at most one occurrence of b and the rest are all ` 's. For each pair of patterns l and l0, there is at least one column wherein l and l0 have di erent nonvariables and so the system is unambiguous. Fig. 4 shows such a matrix that represents the four patterns f (a; a; a; ; ; ; a); f (b; ; ; a; a; ; ); f ( ; b; ; b; ; a; ); f ( ; ; b; ; b; b; ). Denote by S (n) the size of the smallest automaton with n such patterns. Now pick any position (i.e., column) to discriminate. If this column contains only one nonvariable then on a positive transition (i.e., transition on a or b) we will once again be left with the same Lu without any reduction in the problem size. For instance, in g. 4 if we choose column 7 for inspection then we are left with the problem of building an automaton to match on the basis of the rst six positions of each pattern. In other words, we are left with a matrix obtained from that of g 4 by deleting the last column and hence the problem is still an instance of S (4). On the other hand if the column contains two nonvariables then based on the symbol seen in the column we can now partition the n patterns into two sets each consisting of n ? 1 patterns. These sets must be further discriminated and hence we have two instances of S (n ? 1), e.g., in the above example, inspecting position 2 results in the pattern sets f1; 2; 4g and f2; 3; 4g as shown in g. 5. Hence:

S (n) = 2  S (n ? 1) whose solution is O(2n )

3.2 Unambiguous Prioritized Patterns To derive lower bound on the size of the automaton for unambiguous prioritized set of patterns, consider the following set of at patterns with textual order priority:

f (cm(a); x2; x3; :::; xn); f (x1; cm(a); x3; :::; xn); :::; f (x1; :::; xn?1; cm(a)) Using the set of patterns constructed as above, we obtain a lower bound on space of (an ) as follows. Denote by S (n) the size of the automaton for patterns of the above form. It can be shown that the smallest size automaton is obtained by rst inspecting all the c's and then the a in the rst column, then those in the second column and so on. (This because it can be shown that each position examined in automaton thus obtained is an index. Then it follows by lemma 4 (see section 4.3) that this automaton is no larger than any other automaton.) Now consider the rst m states of the automaton. Each state has a 6= branch that leads to a subautomaton for matching n ? 1 patterns. Therefore: S (n)  mS (n ? 1) whose solution is mn . Noting that m is the average size of patterns, we have the lower bound an .

3.3 Ambiguous Patterns Consider again the set of patterns used in section 3.2. Now assume that there is no priority among these patterns. Observe that the automata A for this set of patterns (which must report all matches since there is no priority relationship among the patterns) can be used as an automaton for the prioritized system considered in the previous section. To do this, on reaching a nal state v , we simply announce a match for the pattern with the highest priority among those in matchset[v ]. Hence this automaton cannot be any smaller than that for patterns with priority. Thus we arrive at a lower bound of (an ) for ambiguous unprioritized patterns. The same lower bound holds even when priorities are allowed since the automata for unprioritized patterns can be used for matching prioritized patterns in the manner mentioned above.

3.4 Matching Time One possible measure of matching time is the average root-to-leaf path length of the automaton. This measure can exhibit the following anomaly. Suppose an automaton A has the property that for each path in automaton A0 , the corresponding path(s) in A is shorter or has the same length. Then clearly A has better matching time but this may not be re ected in the average root-to-leaf path lengths. The reason for this is that there may be several paths in A0 that correspond to a single path in A. Suppose that there is one path P of length 2 in A and another path Q of length 7. Also assume that there are two paths in A0 corresponding to P with length 3 and one corresponding to Q with length 7. Note that although each path in A is shorter than (or equal in length to) the corresponding path in A0 , the average path length of A is 4:5 whereas the average path length of A0 is only 4:3! A good measure that does not su er from such an anomaly is average over the mean matching times for all patterns, where the mean matching time for a pattern l is the average path length of those paths that lead to a matching state for l. This quantity is 4:5 for A whereas it is 5 for A0 . Based on this measure, upper and lower bounds on matching time are given in g. 3. These bounds are established with the same construction used to obtain the space bounds.

4 Synthesizing Traversal Orders Observe that Build does not specify the selection function. We now present several techniques that can be used in any combination to derive selection functions for improving space and matching time.

4.1 Representative Sets

Suppose v is the state reached on seeing pre x u. Observe that there may be patterns in Lu for which no match is announced at any descendant of v . For instance, consider the patterns in g. 2 and the pre x u = f ( ; a(b; ; b); ). Although Lu = fl1; l4g observe that a match for l4 can be declared only if the 3rd argument of f is b. In such a case we declare a match for the higher priority pattern l1. Inspecting any position only on behalf of a pattern such as l4 is wasteful, e.g., inspection of position 1 for u is useless since it is irrelevant for declaring a match for l1 . To avoid inspecting such positions, we identify a representative set Lu that is a minimal set consisting only of those patterns for which a match is announced at a descendant of v . In the above example Lu = fl1g. Representative sets form the core of several optimizations discussed later. Laville's notion of accessible patterns [9] is the same as our representative set. Our contribution here is that our de nition yields a simple algorithm for computing this set. The de nition of accessible patterns does not yield such an algorithm and so [9] uses the notion of compatible patterns (which corresponds to our match set Lu ) in place of accessible patterns. Observe that we can also use Lu in place of Lu in all the following optimizations, but doing so may make the optimizations less e ective. For instance, our algorithm for directly building optimal dag automata (see section 5) will fail to identify some equivalent states if Lu is used in place of Lu . To compute Lu note that any pattern l that has the following property need not be present in Lu : whenever a match for l is identi ed for a term t at a descendant of v we can also identify a match for another pattern (of equal or higher priority) for the same term. A precise de nition of the property is

8t  u [t  l ) 9l0 2 Lu [(t  l0) ^ (priority(l0)  priority(l))]] We can arrive at Lu by repeatedly deleting patterns such as l. However, the above de nition does

not immediately lend itself to a decision procedure since it refers to an in nite number of instances of u. Although the set of terms t to be considered can be restricted to a nite set, it still does not lead to an ecient procedure since: Theorem 1 Computing Lu is NP {complete for typed systems. The proof of this therem uses a reduction similar to that used in the proof of theorem 9 (see section 6.2). We omit the proof details here. For untyped systems a more ecient method can be developed by considering only l t u and not its instances. Speci cally, Theorem 2 Lu can be computed in O(nS ) time in untyped systems by deleting l such that

9l0 [(l t u)  l0] ^ [priority(l0)  priority(l)]

Proof: First we need to show that the two characterizations of the l to be deleted are equivalent. It is easy to see that any l satisfying the second characterization will also satisfy the rst one. To show that any l that satis es the rst characterization also satis es the second, let l1; :::; lk be all those patterns such that any instance v of l and u (and hence of l t u) is also an instance of one of l1; :::; lk. Now consider the term t obtained from l t u by instantiating each variable by 6=. This term cannot be an instance of some li unless li  (l t u). Observe that this li can serve the role of

l0 in the second characterization. Thus the two characterizations are equivalent. It can be easily seen that computing Lu using the simpler characterization can be done in O(nS ) time. Note that in the above de nition, if l and l0 have equal priority and l t u = l0 t u, then either

one can be retained in Lu and the other discarded. (This choice will not a ect the structure of the automaton below the current state, but will only determine whether a match is announced for l or l0 in a nal state below.) Except for the choice among such patterns, the second de nition speci es Lu uniquely. Finally, note that in unambiguous systems Lu = Lu .

4.2 Greedy Strategies We resort to greedy strategies for implementing function select since dynamic programming approach will take exponential time as the subautomata themselves can have exponential size. We begin by listing several strategies to choose a p such that: 1. the number of distinct nonvariables at p in any pattern in Lu is minimized. 2. the number of distinct nonvariables at p in any pattern in Lu is maximized. 3. the number of patterns having nonvariables at p is maximized. 4. Let L1 ; :::; Lr be the match sets of the children of v . Then max(jL1j; :::; jLrj) is minimized. 5. ri=1 Li is minimized. For improving space, the rationale for these strategies is as follows. Strategy 1 locally minimizes the breadth. Strategy 2 attempts to distinguish quickly among patterns so as to contain the (potentially exponential) blow-up. Since only patterns with a variable at p are duplicated among the children of v , strategy 3 minimizes their number. Strategy 5 carries this one step further by minimizing the number of such duplications. Strategy 4 attempts to minimize the size of the largest subautomaton rooted at any child of v since it determines the size of the automaton in case of blow-up's. For improving time, strategy 3 (and also 5) minimizes the number of patterns for which inspection of the symbol at p is unnecessary. For strategies 2 and 4 observe that in the worst case, time spent in matching below v is proportional to sum of sizes of patterns in Lu . By quickly discriminating among patterns, this quantity can be rapidly reduced. All of the above greedy strategies su er from the drawback that:

Theorem 3 For each of the above greedy strategies there exist pattern sets for which automata of smaller size and matching time can be obtained by making a choice di erent from that given by the greedy strategy. Proof: For strategies 2,3,4 and 5 consider the following set of patterns with equal priorities:

f (a; a; ; ) f (b; ; a; ) f (c; ; ; a) f ( ; b; b; b) f ( ; c; c; c) f ( ; d; d; d) After inspecting the root, all these strategies will choose one of positions 2,3 or 4. It can be shown (by enumerating all possible matching automata for these patterns) that the smallest breadth, number of states and matching time obtainable by this choice are 20, 47 and 4.25 respectively. These gures can be reduced to 15, 45 and 4 respectively by choosing position 1. The construction of this example is quite intricate. The key idea is to make each of the pattern sets f1; 4; 5; 6g; f2; 4; 5; 6g and f3; 4; 5; 6g strongly-sequential3 whereas any set containing two of the rst three patterns and one of the last three is not. 3

In strongly-sequential systems, any pre x u with jLu j  1 must have an index.

4.3 Selecting Indices

We now propose an important strategy that does not su er the drawbacks of the above strategies. The key idea is to select a p that is an index for u. We show that this strategy yields automaton of smaller (or same) size and smaller (or same) matching time than obtainable by any other choice. The importance of index was known only in the context of strongly-sequential systems. Our result demonstrates its applicability to patterns that are not strongly-sequential. Speci cally, given a subautomata A rooted at v that selects some position p, we construct another subautomata A0 that selects an index q and show that size and matching time of A0 are no greater than that of A. This construction proceeds through a series of steps each of which interchange the order of positions q and another position seen immediately preceding it. The following theorem (stated without proof) makes the above construction precise. Theorem 4 With p; q and v as above suppose that pos(w) = q for every child w of v. Then the order of inspection of p and q can be interchanged without increasing the size or matching time of the automaton. By repeating this interchange we can obtain A0 from A. It is quite interesting to note that repetition of interchange steps which is permitted only under restrictive conditions, is sucient to globally rearrange the traversal order in A to get A0. Although the theorem only asserts that size and matching time of A0 is no larger than that of A, g. 6 shows that they can be strictly smaller for A0 . We remark that by doing interchange steps as above, size can sometimes be reduced byas much as an exponential factor and time by O(n). Using arbitrary traversal orders (determined at compile time) is not appropriate in functional programming since termination properties of the program depend on the traversal order. Therefore the programmer must be made aware of the traversal order apriori. Given this constraint, we now show that it is possible to internally change the traversal order in such a way that it does not a ect the termination properties and at the same time realize the advantages of adaptive traversal4.

De nition 5 (Monotonic Traversals) Given a pre x u, suppose a traversal order T selects p as the next position to be visited. T is said to be monotonic i for any pre x u0  u (where u0 =p a variable), the traversal once again selects p unless u0 =p corresponds to a variable position 8l 2 Lu0 . Traversals that possess monotonicity property include most known traversal orders such as depth- rst and breadth- rst (as well as variations of these without left-to-right bias). In particular it includes all xed-traversal orders mentioned earlier such as left-to-right and right-to-left. It also includes traversals used in implementations of strongly-sequential systems. Given a monotonic traversal T , we can obtain another traversal S (T ) from it by inspecting indices whenever possible without a ecting termination properties. Speci cally, let S (T ) denote the traversal that uses the following strategy to pick the position to inspect for a pre x u.  If u has indices then arbitrarily select one of them.  Otherwise, select the position in the fringe of u that will be the rst to be visited by T .

Theorem 6 Size and matching time can never become worse if S (T ) is used in place of T . Each

path in the automaton using S (T ) examines a subset of the positions examined on the corresponding path in T .

5 Minimizing Space using DAGs The previous section discussed synthesizing traversal orders in order to improve space and time. We now discuss an orthogonal approach to minimize space based on sharing equivalent states. An 4

We remark that this transformation is applicable only to functional languages with no side-e ects.

h1 kXbX:Xz k h2 k

2 a

f g

f g



k kf- 1

J 6= h f3g :k JJ^ 2ka XbXX zk h f1g

a

?  f- 2 ? a

k k

hk 2 kXa XXz: hk 3

1 6=

f g f g

@@R hkf1g

b

Figure 6: Example to illustrate size and matching time reduction due to transformation. Figure shows automata for f ( ; b); f (a; a) and f ( ; a) with textual order priority. obvious way to achieve sharing is to use standard FSA minimization techniques for optimizing the tree automaton constructed by Build. A more ecient method is to identify equivalence of two states without even generating the subautomata rooted at these states. Observe that the tree automaton may be exponentially larger than the dag automaton, in which case the naive approach uses exponential time and space whereas direct construction can potentially use only polynomial space and polynomial time. This important problem of directly building an optimal automata even when restricted to left-to-right traversals has remained open [4]. We now propose a solution to this problem. To identify equivalent states, suppose that two pre xes u1 and u2 have the same representative set Lu and di er in positions p such that every pattern in Lu has a variable at or above p. Since such positions are irrelevant for determining a match, these two pre xes are equivalent. On the other hand, it can also be shown that if they have di erent representative sets or di er in any other position then they are not equivalent. Based on this observation, we de ne the relevant pre x of u as the term obtained by replacing subterms at positions such as p above by the symbol 6=. For instance, the pre xes corresponding to di erent states marked `s' in g. 2 are di erent, but they all have the same relevant pre x f (x; 6=; b). By showing that two states are equivalent i the corresponding relevant pre xes are identical we establish:

Theorem 7 The automaton obtained by merging states with identical relevant pre xes is optimal. Merging equivalent states as described above can substantially reduce the space required by the automata, e.g., the tree automaton in g. 2 has 25 states which can now be reduced to 16 by sharing. Also recall that for the patterns in g. 4, parts of the automaton reached by positive transitions alone is exponential. We can show that by sharing states this part of the automaton will become polynomial!

5.1 Impact of DAGs on Space and Matching Time Complexity Since sharing a ects space requirements alone, all the results established so far not relating to space (such as theorem 6 for time) continue to hold for dags as well. In what follows we discuss the impact of dags on some of the results established earlier regarding space. We can show that the upper Qbound on size of dag automata is O(2n S ) which is much smaller than the corresponding bound O( ni=1 jli j) for tree automata. We can also establish a lower bound of O(2n) for ambiguous patterns. For unambiguous patterns, it is not clear whether the lower bound on size is exponential. For instance, it appears that the patterns used in the lower bound proof on size of tree automata for unambiguous patterns, possess a polynomial-size dag automaton. Reasoning about lower bounds becomes extremely complicated for dags since it is dicult to capture behavior of sharing formally. Finally, all the greedy strategies as well as the strategy of selecting indices can, in some cases,

increase the space of dag automata. The failure of the index selection strategy further demonstrates the complexity of sharing.

6 Index Computation Recall that an index for a pre x u is a position on its fringe that must be inspected to announce a match for any pattern in Lu . In the absence of priorities, the positions that must be inspected to announce a match for a pattern l are exactly those fringe positions wherein l has a nonvariable. With priorities, however, we may have to inspect positions wherein l has a variable in order to rule out a match for higher priority patterns. But it is not obvious which variable position of l must be inspected and so it is not clear how to compute indices in prioritized systems. Therefore Laville [8] proposed an indirect method for index computation that rst transforms the prioritized patterns into an equivalent set of unprioritized patterns, which is then used for index computation. For each pattern l, the transformation generates a set Ml of its instances that are not instances of any higher priority pattern. (For typed systems,S only those instances that observe the type discipline are generated.) The transformed system is l2L Ml. Puel and Suarez [11] developed a compact representation for the sets Ml based on the notion of constrained terms. These are terms with constraints placed on the substitutions taken by the variables in them. For instance, the constrained term feq (x; y )j(x 6= a) _ (y 6= b)g denotes the set of terms eq (t1 ; t2 ) with either t1 6= a or t2 6= b. A precise semantics of constrained terms is given by rst regarding a term t with variables as denoting the set I (t) of its instances. Now the constrained term ftj'g denotes the set of instances of t that also satisfy the constraint formula '. To de ne a constraint formula, rst de ne an atomic constraint to be of the form t 6= s or of the form x 6= s, where x is a variable in t and all the variables in s are unnamed5 . In the former case the terms that satisfy the constraint must belong to the set I (t) ? I (s). In the latter case, the substitution taken by x must not be an instance of s. An arbitrary constraint is obtained by combining atomic constraints using _ and ^. A constraint ' _ is satis ed by all (and only) terms that satisfy either ' or . Similarly, the constraint ' ^ is satis ed by all (and only) terms that satisfy ' as well as . Finally, note that in typed systems, there is always an implicit constraint imposed by the type discipline that requires a variable to be substituted only by a term that is of the same type as the variable. Henceforth we may use several methods to simplify constrained terms. These methods are quite intuitive and their correctness readily follows from the above semantics. Using constrained terms the set Ml can be represented compactly as

flj(l 6= l ) ^    ^ (l 6= lk )g 1

where l1; : : :; lk are all the patterns with priority greater than l. The constraints l 6= li can be simpli ed to constraints on the variables in l. To simplify a constraint l 6= l0 , we rst check if the two patterns unify. If not, the constraint simpli es to true. Otherwise, let x1; :::; xn be all the variables in l that are substituted by nonvariable terms (say t1 ; :::; tn) when unifying l and l0. Then the simpli ed constraint is (x1 6= t1 ) _    _ (xn 6= tn ) Using this procedure, simpli cation of all the constraints on a pattern l can be done in O(S ) time. For illustration, consider the following example that de nes equality on an enumerated data type This is to prevent having constraints such as x = 6 a(y) ^ y = 6 b that will complicate the development of the materials in the rest of the section. 5

t = a1 ja2j : : : jan. eq(a1 ; a1) = true

.. . eq(an; an ) eq(x; y)

.. . = =

.. .

true false

Note that no two among the rst n patterns unify and so the condition eq (ai ; ai) 6= eq (aj ; aj ) holds vacuously and hence the rst n patterns appear unchanged in the transformed system. The last rule, however, uni es with every other rule and gets translated into

eq(x; y)j(x 6= a1 _ y 6= a1 ) ^ (x 6= a2 _ y 6= a2 ) ^ : : : ^ (x 6= an _ y 6= an )

(1)

where the condition in the ith disjunct prevents instances of the ith pattern from being recognized as instances of the last rule. As seen above, the constraints are in CNF and it is not clear how indices can be obtained from them. So Puel and Suarez simplify it into DNF which makes the structure of the terms in the set denoted by a constrained term apparent. For instance, when n = 2 the above constrained term simpli es to:

feq(x; y)j(x 6= a ^ x 6= a ) _ (x 6= a ^ y 6= a ) _ (x 6= a ^ y 6= a ) _ (y 6= a ^ y 6= a )g Note that in typed systems, constraints such as x = 6 a ^ x =6 a will be simpli ed to false and so the above constraint simpli es to [x = 6 a ^ y =6 a ] _[x =6 a ^ y 6= a ]. Once a constrained term tc = ftj' _   _ 'm g is in DNF, indices of a pre x u w.r.t. tc can be 1

2

1

2

2

1

1

2

1

1

2

2

2

1

1

easily picked as follows. Firstly, any variable position of u wherein t has a nonvariable symbol is an index for u. Secondly, consider any variable position p wherein t has a variable x, but a constraint on the substitution for x appears in every conjunct 'i . Note that for any term to be an instance of this constrained term, it must satisfy one of the constraints '1 through 'm . Since x appears in every one of these conjuncts, its substitution must be examined in order to determine whether a term satis es the constraint. Therefore all such positions are also indices of u. The above conversion of the constraint on a pattern l from CNF to DNF is very expensive and can take O(jljn) time. Therefore Puel and Suarez's algorithm, which is based on such a conversion, has exponential time complexity for both typed and untyped systems. Laville's algorithm is also exponential for typed and untyped systems. In contrast, we now present the rst polynomial-time algorithm for untyped systems that operates directly on the original patterns.

6.1 Algorithm for Index Computation in Untyped Systems

The index for a pre x is computed in two steps. First we compute the set of indices of the pre x w.r.t. each of the constrained patterns in Lu . The intersection of the sets thus computed yields the indices of the pre x w.r.t. Lu . We compute the indices of a pre x u w.r.t. to a single constrained pattern l as follows: Let l1; l2;    lk be the patterns in Lu that have priority over l and also unify with l. Following two steps specify the indices of u w.r.t. l. 1. Each variable x in u such that l has a nonvariable at the position corresponding to x. 2. Each variable x that must be the only position to be instantiated in l t u to determine a match for some higher priority pattern lj .

We remark that a similar algorithm was suggested by Laville as a heuristic for fast index computation. However, the question of the power (or completeness) of the heuristic is not addressed at all. We now illustrate the algorithm on l1 = f (a; b; c), l2 = f (a; ; ) and l3 = f ( ; ; c) (with textual order priority) and the pre x u = f (x; y; z ). Observe that x; y and z are all indices of l1 by step 1. The only index for l2 is x by step 1. Observe that step 2 does not yield any additional indices for l2. For l3, z is an index by step 1 and x is an index by step 2 (with lj = l2). The intersection of all these positions is x which is therefore an index for u. Index computation takes O(nS ) time using this algorithm. We show: Theorem 8 The algorithm for computing indices in untyped systems is sound and complete. Proof: We need only establish that the above algorithm computes all and only the indices of u w.r.t. the constrained pattern lc = flj (|l 6= l1) ^{z   (l 6= lk)}g

Clearly, any position computed using step 1 is an index w.r.t. lc . For any position selected using step 2 observe that the constraint l 6= lj will simplify to x 6= t for some term t; so the constraint x 6= t will appear in every conjunct in the DNF of . All the positions selected in step 2 are thus indices of lc . For completeness, we need to show that if a position p in u is not selected by steps 1 or 2 then it is not an index of lc . This is accomplished by giving an instance of lc that has a variable at p. Suppose that p is not selected by step 1 or step 2. Then for each li, it must be the case that either li =p is a variable or that there is more than one position in the fringe of l t u that need to be instantiated for determining a match for li . In either case, the disjunct obtained by simplifying l 6= li contains at least one literal (xi 6= t) for some variable xi di erent from l=p. Consider the term s obtained from l t u by substituting 6= for each of the xi 's mentioned above. Note that the substitutions for the variables in l t u satis es the constraints in lc . Hence s is an instance of l and u but still has a variable at p. This proves that p is not an index of u.

6.2 Index Computation in Typed Systems

For typed systems it is very unlikely we can escape exponential complexity since we show: Theorem 9 Index computation problem for typed systems is co-NP {complete. For intuition into the reason for the complexity gap between typed and untyped systems, we draw an analogy between index computation with constrained terms and the satis ability problem. Note that the literals in constraints generated by Puel and Suarez's algorithm are all of the form x 6= t where x is a variable and t an arbitrary term. Such constrained terms are analogous to boolean formulas with only negative literals. Such boolean formulas are trivially satis able (by a truth assignment that assigns false to every literal). Similarly index computation is simple in untyped systems. However, in typed systems, there are implicit positive constraints introduced by the type discipline. For instance, there is an implicit constraint (x = a1) _   _ (x = an ) on the constrained term (1). Thus we have a constraint that contains both positive and negative literals. Such a constrained term is analogous to a boolean formula with both positive and negative literals and hence the complexity gap. Proof: The index selection problem, when posed as a decision problem, takes the form \Does u possess an index w.r.t. pattern set L?" To show that this problem is in co-NP , we need to show that the problem of deciding whether p is not an index is in NP . To do this, let p1 ; :::; pr be the set of fringe nodes of u. We rst guess r instances t1 ; :::; tr of u such that ti =pi is a variable for 1  i  r. Then we verify that each ti is an instance of (at least) one of the patterns in L and if

so we declare that u has no index. All this can clearly be accomplished in polynomial time and hence the problem of determining whether u does not possess an index is in NP and so the index selection problem is in co-NP . To show that the problem is co-NP {complete, we reduce the complement of satis ability to this problem. Let '1 ^    ^ 'n be an instance of satis ability problem where 'i is a disjunct of literals of the form x or :x, x 2 fx1 ; : : :; xmg. We transform this into an index computation problem in the following system consisting of n + 1 patterns, with textual order priority. The roots of these patterns is the symbol f (which once again stands for the common pre x shared by all the patterns) with m + 1 arguments. The last m arguments of f are of a type that consists of nonvariables a and b. The (n + 1)th pattern is of the form f (x0 ; x1; :::; xm). To specify the rst n patterns, let t1 ; :::; tn be terms that do not unify with each other. Also assume that there is a term t of the same type as t1 ; :::; tn which does not unify with any of these terms. Now we specify the ith pattern (for 1  i  n) as f (ti ; s1; ::; sm), where sj is a or b depending upon whether xj or :xj occurs in 'i . If neither occur in 'i then sj is a variable. Observe that the size of this pattern set is polynomial in the size of '1; :::; 'n. With this construction, we will now show that determining whether f (x0 ; :::; xm) possesses an index is equivalent to determining whether '1 ^    ^ 'n is not satis able. First we transform the above pattern set into a set of constrained patterns. Following the transformation the (n + 1)th pattern becomes

ff (x ; :::; xm)j(x 6= t _ ' ) ^    ^ (x 6= tn _ 'n)g Here we have slightly abused the notation in replacing xj = 6 a by xj and xj 6= b (i.e. xj 6= a by type discipline) by :xj . By using heuristic 1, this constraint can be simpli ed to (x 6= t ) _ (' ^   ^ 'n) which shows that f (x ; :::; xm) has an index (namely x ) i the input to the SAT problem 0

0

1

1

0

0

is not satis able.

0

1

1

0

6.3 Heuristics for Fast Index Computation in Typed System Suppose the constrained term is of the form

flj[(|x 6= t _ ' ) ^ {z  ^ (x 6= tn _ 'n)}] ^ | ^ :{z: : ^ k}g 1

1

1



where 1; 2; : : : k and '1 ; '2; : : :'n are disjuncts that do not contain occurrences of x. Suppose that there exists a term t (valid for type of x) such that t does not equal any of t1 ; t2 ; : : :; tn . Then: Rule 1: The indices obtainable from the above constraint are exactly those obtainable from the constrained term flj[(x 6= t1) _ ('1 ^ '2 ^    'n )] ^ g If t1 ; t2; : : :; tn are the only terms valid for type of x then: Rule 2: If each 'i consists only of a single variable y, then y is an index. The power of rule 1 is evident from the fact that in order to apply it we need only identify some variable x such that the constraints on x do not simplify to false6 . Once such an x is identi ed, we can avoid expanding into 2n terms and instead concern ourselves with just two of the terms in the expansion, namely, (x 6= t1 ) and 0 = '1 ^    ^ 'n . To illustrate the power of rule 2 consider the constrained term in (1). Note that x 6= a1 ^ x 6= a2 ^    ^ x 6= an simpli es to false and the 'i's (which are of the form y 6= ai ) also contain only 6

By saying \' reduces to false" we mean that ' is not satis able, i.e., it is a contradiction.

one variable y and hence y is an index. Interchanging the role of x and y we also note that x is an index. Note that computing these indices has taken time linear in sum of sizes of rules. In contrast, Puel and Suarez's algorithm spends exponential time reducing the constraint to DNF. Proof of Correctness of Rule 1: Expanding and taking the conjunction with  we get [(x 6= t1 ^    ^ x 6= tn ) ^  ] _ [(|'1 ^ {z  ^ 'n )} ^ ] _ [ ^  ]

0

where  is a disjunction of terms containing at least one constraint on x and some ''s. The rst term in this expansion constrains only x. Therefore no variable other than x that appears in any 'i can be an index unless it appears in every conjunct in DNF of . Furthermore, since x apears in all the terms in the above expansion except 0, it is clear that x will be an index i 0 reduces to false. This is exactly captured by considering only [(x 6= t1 ) ^  ] _ [ 0 ^  ]. Proof of Rule 2: Once again we expand as above and eliminate the term (x 6= t1 ^  ^ x 6= tn ) which reduces to false. Now observe that every other conjunct in the expansion contains at least one 'i . If each 'i is of the form y 6= ti then each conjunct in DNF will constrain y and so y is an index.

7 Concluding Remarks In this paper we studied pattern matching with adaptive automata. We presented lower and upper bounds on the worst-case size of the automata for several classes of patterns. We then discussed how to improve space and matching time by synthesizing traversal orders. We showed that a good traversal order selects indices whenever possible and uses one of the greedy strategies (especially strategy 4 or 5) otherwise. Although the greedy strategies may sometimes fail, it appears from the complexity of the counter examples that such failures may be rare. For functional programming, we synthesized a traversal S (T ) from a monotonic traversal T . Since using S (T ) does not a ect termination properties, the programmer can assume T whereas an implementation can bene t from signi cant improvements in space and matching time. We also discussed an orthogonal approach to space minimization by sharing equivalent states. Recall that even the index selection strategy may fail to improve space of dag automata. This occurs because index selection may adversely a ect the way in which descendants of a state can be shared. Since it is dicult to predict sharing among descendant states, the possibility of improving space without using indices does not appear to be practical. So the best approach is to use all the strategies in section 4 and use sharing as an additional source of space optimization over tree automata. Our work clearly brings forth the impact of typing in prioritized pattern matching. We have shown that several important problems in the context of pattern matching are unlikely to have polynomial-time algorithms for typed systems whereas we have given polynomial-time algorithms for them in untyped systems. This raises the question whether it is worthwhile to consider typing for pattern matching. It is not clear how often typing information can be used to nd an index (or to determine that a pattern does not belong to Lu ) which cannot be found otherwise. On the other hand there is a signi cant penalty in terms of computational e ort for both these problems if we use typing information.

References [1] L. Augustsson, Compiling Pattern Matching, FPCA '85, LNCS 201. [2] D. Comer and R. Sethi, Complexity of Trie Index Construction, FOCS '76.

[3] J. Christian, Fast Knuth-Bendix Completion : Summary, RTA '89, LNCS 355. [4] A. Graf, Left-to-Right Tree Pattern Matching, RTA '91, LNCS 488. [5] R.Harper, R.Milner and M.Tofte, The De nition of Standard ML, Report ECS-LFCS-88[6] [7] [8] [9] [10] [11] [12] [13] [14]

62, LFCS, University of Edinburgh, 1988. P. Hudak et al, Report on Haskell, Draft Proposed standard circulated by IFIP WG2.8., 1988. G. Huet and J.J. Levy, Computations in Nonambiguous Linear Term Rewriting Systems, Tech. Rep. No. 359, INRIA, Le Chesney, France, 1979. A. Laville, Lazy Pattern Matching in the ML Language, FST&TCS '87. A. Laville, Implementation of Lazy Pattern Matching Algorithms, ESOP '88, LNCS 300. K. Owen, S. Pawagi, C. Ramakrishnan, I.V. Ramakrishnan, R.C. Sekar, Fast Parallel Implementation of Functional Languages - The EQUALS Experience, to appear in ]LFP '92. L. Puel and A. Suarez, Compiling Pattern Matching by Term Decomposition, LFP '90. Ph. Schnoebelen, Re ned Compilation of Pattern Matching for Functional Languages, Science of Computer Programming, 11, pp. 133{159, 1988. R.C. Sekar, R. Ramesh and I.V. Ramakrishnan, Adaptive Pattern Matching, TR 14/91, SUNY, Stony Brook. P. Wadler, Ecient Compilation of Pattern Matching, in S.L. Peyton-Jones, Ed, The Implementation of Functional Programming Languages, Prentice Hall International, 1987.

Appendix A Representative Sets

Recall that a pattern l to be deleted from Lu was characterized as:

8t  u [t  l ) 9l0 2 Lu [(t  l0) ^ (priority(l0)  priority(l))]] We now show: Theorem 4 Computing Lu is NP {complete for typed systems. Proof: The problem of computing Lu , when posed as a decision problem, takes the form \Does l 2 Lu ?" Suppose that a pattern l gets transformed into a constrained pattern flj'g. Note that for a pre x u the pattern l 2 Lu i 9 a t  u such that t is an instance of flj'g. Therefore to check whether l 2 Lu we rst transform the input set of patterns into constrained patterns as outlined earlier. Now the pattern l gets transformed into a constrained pattern flj'g where ' is a constraint on variables in l. To check whether l belongs to the representative set, we need only guess a term t and verify that it is an instance of the constrained pattern and of u. Clearly, the whole process takes polynomial time and hence the problem is in NP . To show that the problem of index computation is NP {complete, we will reduce satis ability problem to the problem of determining whether a pattern belongs to the representative set of a given pre x. Let '1 ^    ^ 'n be an instance of satis ability problem where 'i is a disjunct of literals of the form x or :x, x 2 fx1 ; : : :; xm g. We transform this into an instance of determining whether a pattern belongs to Lu . Consider a function f taking m + 1 arguments of type that consists of two nonvariables a and b. This function is de ned using the following n + 1 patterns with textual order priority. The (n + 1)th pattern is of the form f (a; x1; :::; xm). Among the rst n patterns, the ith pattern is of the form f (x0 ; s1; :::; sm), where si is a or b depending on whether xi or :xi appears in 'i . If neither appear, si is a variable. Now the last pattern, when transformed to a constrained pattern, becomes

ff (a; x ; :::; xm) j ' ^    ^ 'ng 1

1

(There is a slight abuse of notation here in using xi to represent xi 6= a and :xi to represent xi 6= b or in other words xi = a. ) Now consider the pre x u = f (x0; x1; :::; xm) and observe that there can be a pattern match for the (n + 1)th pattern with an instance t of u i t is an instance of the above constrained pattern. Clearly, such a t exists i '1 ^    ^ 'n is satis able. Therefore, by de nition of Lu , the (n + 1)th pattern is in Lu i the above boolean formula is satis able.

B Selecting Indices

Theorem 7 With p; q and v as above suppose that pos(w) = q for every child w of v. Then the

order of inspection of p and q can be interchanged without increasing the size or matching time of the automaton. By repeating this interchange we can obtain A0 from A. Proof: Fig. 7 shows the subautomaton before and after the interchange. In the gure a1 ; : : :; am are all the symbols that appear at q in any pattern in Lu . Similarly, b1; :::; bk are all the symbols that appear at p for any pattern in Lu . (Note bk =6= if some pattern in Lu has a variable at q . Observe that some of the states rij may not be present because no pattern in Lu has a bi at p and aj at q . Such rij denote empty subautomata.) Note that the interchange as above is possible since the the

kH* k  HH jk b ?? ? * .k r ? b - sqkaam pk . H . kr m H j H . @ . .. bk@ @@ * .krk R q.kHa  .k s a s1 a1 r11 q am .. r2m

1

2

2

1

21

2

1

HH j

k m

1

rkm

kr

b  *  b  k kr  .. HHbkHHj .. r a?? km ? ... qk @am@ ... b * kr m @R pkHb - kr m b . sm HkHH j. s01 p

1 2

11 21

1

1

1

1 2

0

2

krmm

Figure 7: Automaton before and after an interchange step. pre x inspected at an rij is the same in A as well as A0 . However some of the states s01 ; s02; : : :; s0m be such that 8l 2 Lprefix(s0i ) , l=p is a variable. Clearly in such a case all the subautomata rooted at r1i; r2i; : : :; rmi that are below s0i will be identical and the position p need not be inspected at all. Therefore s0i can be replaced by one of r1i; r2i; : : :; rmi thereby decreasing size of A0 . (For matching time note that if an s0i does not have the above property then the positions inspected on all paths through s0i in A0 are identical to those inspected in the corresponding paths in A. Otherwise the positions inspected in A0 are strictly a subset of those inspected in A.) In the worst case however no s0i will have this property and hence breadth of A0 (as also its matching time) can be equal to that of A but no larger. The proof that A0 can be obtained from A by repetition of the interchange step is by induction on the height of A. In the base case height of A = 1 in which case p = q and so A = A0. Now assume inductively that the lemma holds for all the subautomata A1 ; : : :; Am rooted at the children of v . Therefore we can construct new subautomata A01 ; : : :; A0m rooted at states that all inspect q and such that jA0i j  jAi j and such that each path in A0i examines a subset of positions seen in the corresponding path in Ai . Now use the above interchange step to to get A0 .

Theorem 9 Size and matching time can never become worse if S (T ) is used in place of T . Fur-

thermore, each path in the automaton using S (T ) examines a subset of the positions examined on the corresponding path in T .

Proof: Start with an automaton A based on T . Use construction in the above theorem to inspect indices as early as possible to obtain another automaton A0 . Clearly A0 has the property required in the theorem and so we need only show that A0 uses traversal S (T ). To do so we rst make the following observation about A and A0 . Observation: Let v be a state in A0 that examines a position that is not an index and let v1; v2; : : :; vn be the corresponding states in A. Then prefix(vi)  prefix(v) for 1  i  n. The above observation is readily established by induction on number of interchange steps performed to obtain A0 from A. The proof of the theorem is established on the basis of the observation as follows. Consider any state v with pre x u such that pos(v ) = p is not an index of u. By above observation there exists at least one state v 0 in A corresponding to v (also with pos(v 0 ) = p) such that prefix(v 0)  u. Furthermore p is a position in u where at least one pattern in Lu has a nonvariable or otherwise v will not be in A0 . Therefore by monotonicity and the fact that T selects p for prefix(v 0), T must also select p for u. Hence the select function of A0 agrees with T at every non index position and so it uses S (T ).

C Minimizing Space Using DAGs

Theorem 10 The automaton obtained by merging states with identical relevant pre xes is optimal. Proof: First we need to show that the states merged as above are indeed equivalent. Note that only the relevant pre x takes part in determining Lu and in choosing the position p. This implies that the subtrees created by two distinct invocations build(v1; t1 ) and build(v2; t2 ) will be identical if both t1 and t2 have the same relevant pre xes and so v1 and v2 are equivalent. Now we need to show that no two states v1 and v2 with distinct relevant pre xes u1 and u2 are equivalent. If Lu1 6= Lu2 , let l 2 Lu1 and l 62 Lu2 . Then, by the properties of representative set (and correctness of the matching automaton), there is a path from v1 to a matching state for l. On the other hand, there is no such path from v2 and hence v1 and v2 are not equivalent. If Lu1 = Lu2 then 9p root(u1=p) 6= root(u2=p), since u1 6= u2. Since the representative sets are the same, u1 and u2 cannot have di erent nonvariable symbols at p. Hence one of these relevant pre xes, say u1 , has a nonvariable symbol at p and the other has a variable at p. If this nonvariable is 6=, every lhs in the representative set must contain a variable at p. The de nition of relevant pre x then implies that root(u2=p) must also be 6=. Since we assumed that u1 and u2 di er at p, this is also not possible. Therefore root(u1 =p) must be a nonvariable symbol other than 6=. Let l be an lhs in the representative set such that root(l=p) is a nonvariable. Now, there must be a path from v1 to a matching state for l, and the symbol at p is not examined on this path. In contrast, the symbol at p is examined on every path from v2 to a matching state for l. Therefore v1 and v2 are not equivalent.