Obtaining shorter regular expressions from finite ... - Semantic Scholar

Report 2 Downloads 62 Views
Theoretical Computer Science 370 (2007) 110–120 www.elsevier.com/locate/tcs

Obtaining shorter regular expressions from finite-state automataI Yo-Sub Han a,∗ , Derick Wood b a System Technology Division, Korea Institute of Science and Technology, P.O. Box 131, Cheongnyang, Seoul, Republic of Korea b Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong

Received 18 April 2006; received in revised form 5 July 2006; accepted 14 September 2006 Communicated by Arto Kustaa Salomaa

Abstract We consider the use of state elimination to construct shorter regular expressions from finite-state automata (FAs). Although state elimination is an intuitive method for computing regular expressions from FAs, the resulting regular expressions are often very long and complicated. We examine the minimization of FAs to obtain shorter expressions first. Then, we introduce vertical chopping based on bridge states and horizontal chopping based on the structural properties of given FAs. We prove that we should not eliminate bridge states until we eliminate all non-bridge states to obtain shorter regular expressions. In addition, we suggest heuristics for state elimination that leads to shorter regular expressions based on vertical chopping and horizontal chopping. c 2006 Elsevier B.V. All rights reserved.

Keywords: Regular languages; Finite-state automata; State elimination; Bridge states; Vertical chopping; Horizontal chopping

1. Introduction It is well known that the family of languages accepted by finite-state automata (FAs) is the same as the family of languages defined by regular expressions [16]. It can be proved by showing that we can construct FAs from regular expressions and that we can construct regular expressions from FAs. There are a number of FA constructions; for example, the Thompson construction [19], the position construction [9, 18] and the follow construction [14]. These constructions are inductive and, therefore, preserve the structural properties of the corresponding regular expressions. For instance, the size of a Thompson automaton is bounded by the size of a given regular expression [8] and the number of states in a position automaton is the number of character appearances in the corresponding regular expression plus one [3]. When converting FAs into regular expressions, we can use either linear equations [6] or state elimination [2]. We consider state elimination. State elimination was already in use in the 1960s, in particular by Brzozowski and McCluskey Jr. [2] and was carefully formulated by Wood [20]. The idea behind state elimination is simple. We keep

I Part of this research was carried out while Han was in HKUST. ∗ Corresponding author. Tel.: +82 2 958 5608; fax: +82 2 958 5649.

E-mail addresses: [email protected] (Y.-S. Han), [email protected] (D. Wood). c 2006 Elsevier B.V. All rights reserved. 0304-3975/$ - see front matter doi:10.1016/j.tcs.2006.09.025

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

111

Fig. 1. An example of state elimination. The dotted states are being removed.

removing states, except for the start and the final states for a given FA, while maintaining the transition information of the automaton until there are no more states to eliminate. We illustrate state elimination in Fig. 1. Apart from computing regular expressions for FAs, state elimination also has other applications: Giammarresi and Montalbano [7] proposed a method of obtaining a generalized automaton [6], which has strings as transition labels rather than characters, from a given FA using state elimination. They restricted state elimination of a subset of states, which does not induce a cycle or a self-loop. Br¨uggemann-Klein and Wood [1] showed that Thompson automata can be transformed into position automata by eliminating states that have null transitions. We investigate the structural properties of FAs that help to obtain shorter regular expressions in state elimination. We observe that if we decompose a given FA in certain ways, then we can construct a shorter regular expression compared with random selection of states in state elimination. Based on this observation, we propose two decomposition methods, vertical chopping and horizontal chopping. In Section 2, we define some basic notions. In Section 3, we describe state elimination and suggest two ways to reduce the size of FAs. Then, we introduce vertical chopping and horizontal chopping of an FA in Sections 4 and 5. Furthermore, we prove that we should not eliminate bridge states, which are defined in Section 4, until we eliminate all non-bridge states to obtain a shorter regular expression. Note that the size of minimal regular expressions of a given FA cannot even be efficiently approximated (if P ! = PSPACE) [10]. Therefore, we believe that is it not easy to compute an optimal removal sequence for state elimination in polynomial time. On the other hands, we can compute bridge states in linear time in the size of a given minimal deterministic finite-state automaton (DFA). Finally, we suggest some heuristics for state elimination that lead to shorter regular expressions. 2. Preliminaries Let Σ denote a finite alphabet of characters and Σ ∗ denote the set of all strings over Σ . A language over Σ is any subset of Σ ∗ . The character ∅ denotes the empty language and the character λ denotes the null string. An FA A is specified by a tuple (Q, Σ , δ, s, F), where Q is a finite set of states, Σ is an input alphabet, δ ⊆ Q × Σ × Q is a (finite) set of transitions, s ∈ Q is the start state and F ⊆ Q is a set of final states. Let |Q| be the number of states in Q and |δ| be the number of transitions in δ. Then, the size of A is |A| = |Q| + |δ|. Given a transition ( p, a, q) in δ, where p, q ∈ Q and a ∈ Σ , we say that p has an out-transition and q has an in-transition. Furthermore, p is a source state of q and q is a target state of p. A string x in Σ ∗ is accepted by A if there is a labelled path from s to a final state in F that spells out x. Thus, the language L(A) of an FA A is the set of all strings spelled out by paths from s to a final state in F. We define A to be non-returning if the start state of A does not have any in-transitions and A to be non-exiting if a final state of A does not have any out-transitions. An FA A may have useless states, that is, states that can never be reached whatever the input string. We remove all such useless states using the reachability test of the

112

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

underlying digraph of A. Therefore, we can assume that A has only useful states: that is, each state appears on some path from the start state to some final state. 3. State elimination Given an FA A = (Q, Σ , δ, s, F), the state elimination of a state q ∈ Q \ {{s} ∪ F} is the bypassing of q, q’s in-transitions, q’s out-transitions and q’s self-looping transition with equivalent expression transition sequences. Namely, for each in-transition ( pi , αi , q), for 1 ≤ i ≤ m and m ≥ 1, for each out-transition (q, γ , r j ), for 1 ≤ j ≤ n and n ≥ 1, and for the self-looping transition (q, β, q) in δ, we construct a new transition ( pi , αi · β ∗ · γ j , r j ). If there is a transition ( p, ν, r ) in δ for some expression ν, then we merge two transitions into a single transition ( p, (αi · β ∗ · γ j )+ν, r ). We then remove q and all in-transitions and out-transitions of q including the self-looping transition from δ. We denote the resulting automaton by Aq = (Q \ {q}, Σ , δq , s, F). State elimination is a very primitive operation on FAs. It was introduced by Brzozowski and McCluskey Jr. [2] to compute regular expressions from FAs. State elimination maintains the language accepted by a given FA while removing states. Note that we have regular expressions instead of single characters on a transition of Aq . We say that an FA with regular expressions on transitions is an expression automaton (EA) [2,11]. EAs are a generalization of FAs. Given an EA A, if we only allow characters on transitions, then A is a traditional FA and if we only allow strings, then A is a generalized automaton [6,7]. Therefore, EAs have more expressive power on each transition. On the other hand, they have the same expressive power as FAs overall [20]. It is easier to formulate state elimination for computing a regular expression if a given FA A is non-returning and non-exiting. If an FA A = (Q, Σ , δ, s, F) is not non-returning and not non-exiting, then we transform A into a new FA A0 such that L(A0 ) = L(A) and A0 is non-returning and non-exiting as follows: • non-returning: we introduce a new start state s 0 and connect s 0 to s by a null transition, (s 0 , λ, s). • non-exiting: we introduce a new final state f 0 and connect f i ∈ F to f 0 by null transitions, ( f i , λ, f 0 ) for f i ∈ F. Now the resulting automaton A0 = (Q ∪ {s 0 , f 0 }, Σ , δ ∪ {(s 0 , λ, s)} ∪ {( f i , λ, f 0 ) | f i ∈ F}, s 0 , f 0 ) is non-returning and non-exiting. In addition, A0 has a single final state. Proposition 1 (Han and Wood [11]). Let A = (Q, Σ , δ, s, f ) be a non-returning and non-exiting FA with at least three states and q be a state in Q \ {s, f }. Then, L(Aq ) = L(A) and Aq is also non-returning and non-exiting. Once we eliminate all states in Q \ {s, f } for A that is non-returning and non-exiting, we obtain an EA A Q\{s, f } = ({s, f }, Σ , (s, E, f ), s, f ), where E is the corresponding regular expression for A by Proposition 1. Therefore, state elimination guarantees to construct regular expressions from FAs.

Fig. 2. An example of state elimination that produces many duplicate strings.

One problem with state elimination is that it may increase the size of labels on transitions exponentially while removing states for a given automaton. For example in Fig. 2, if we eliminate q from the automaton, then we have to introduce O(mn) duplicate strings as new transition labels. On the other hand, if a given automaton has a simpler structure, for instance see Fig. 3, then state elimination yields new transitions that are smaller.

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

113

Fig. 3. We require n−1 new copies of string x after the state elimination of q.

Another problem with state elimination is that different removal sequences give different regular expressions for the same language. Although we cannot avoid the exponential blow-up in state elimination, we can obtain a shorter regular expression by choosing a better removal sequence. Fig. 4 illustrates this idea. Therefore, it is natural to ask how to compute a better removal sequence from a given FA.

Fig. 4. An example of different regular expressions by different removal sequences for a given FA. E 1 = (aa + b)(a + cb)∗ (cd + d) is the output of state elimination in p → r → q order and E 2 = (aa + b)a ∗ c(ba ∗ c)∗ (ba ∗ d + d) + (aa + b)a ∗ d is the output of state elimination in p → q → r order, where L(E 1 ) = L(E 2 ).

Recently, Delgado and Morais [5] proposed heuristics for computing a smaller regular expression from a given FA A. They introduced the weight of a state q in A. Given a transition t = ( p, α, q), the weight of t is the total number of character appearances in α. Then, the weight of q, which we call state weight, is defined as the sum of in-transition weights + the sum of out-transition weights + the loop weight. Then, they eliminate a state that has the lightest weight. Although this heuristic is better than random selection, it is straightforward to give examples in which the greedy choice does not lead to shorter regular expressions. We can always find the best removal sequence for state elimination by trying all possible removal sequences and choosing a sequence that gives the shortest regular expression. If we have a smaller FA A0 such that L(A) = L(A0 ), then we can compute the optimal removal sequence more quickly and the removal sequence will lead to a shorter regular expression. Thus, if there is an algorithm for computing a removal sequence, then the algorithm takes FAs as inputs and, therefore, the size of FAs must be closely related to the runtime of the algorithm. This observation motivates us to examine NFA minimization methods that reduce the number of duplicate strings in state elimination. Note that since the nondeterministic finite-state automaton (NFA) minimization problem is known to be PSPACEcomplete [15], we cannot expect to minimize a very different and smaller FA from a given FA for the same language. We define two states p and q in an FA A = (Q, Σ , δ, s, F) to be equivalent if the following conditions hold: (1) p ∈ F if and only if q ∈ F. (2) ( p, a, t) ∈ δ if and only if (q, a, t) ∈ δ for any a ∈ Σ . If we have two equivalent states, then we remove one of them, say p, and redirect all in-transitions of p into q. This does not change the language of A while it reduces the size of A. State equivalence already plays an important role in the literature. If a given FA A is deterministic, then we obtain a minimal DFA after eliminating all equivalent states [12]. If A is a position automaton, then we have a follow automaton [14] after the elimination of all equivalent states. Lemma 2. If two source states of a current state q are equivalent, then we need fewer new transitions when eliminating q after merging the two states.

114

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

Fig. 5. State t has two out-transitions with the same label to two distinct target states p and q. We make all out-transitions of p leave from q and remove p.

Now we consider target states of the current state t ∈ Q of an FA A = (Q, Σ , δ, s, F). Assume that t has two target states p and q and two out-transitions of t have the same character; namely, (t, a, p) ∈ δ if and only if (t, a, q) ∈ δ, for a ∈ Σ , and p and q have no other in-transitions except from t as shown in Fig. 5. Then, we delete p and attach all out-transitions of p to q so that all out-transitions are from q. Lemma 3. If the current state t, in an FA A = (Q, Σ , δ, s, F), has two target states that are reachable only from t via the same transition label, then we need fewer new transitions when removing q after merging the two states. Ilie et al. [13] adopted these ideas for reducing the size of NFAs and designed an O(m log n) time algorithm using O(m + n) space that discovers equivalent states for a given FA A, where n is the number of states and m is the number of transitions of A. 4. Vertical chopping Assume that we have an FA A that cannot be minimized any further by using equivalent states. Now we have to compute a removal sequence for A. One question arising from Fig. 4 is why does removing the middle state at the last step lead to a shorter regular expression than when removing it at the second to last step. We observe that the middle state in Fig. 4 has some helpful properties. Definition 4. We define a state qb in an FA A to be a bridge state if it satisfies the following conditions: (1) State qb is neither a start nor a final state. (2) For each string w ∈ L(A), its path in A must pass through qb at least once. (3) Once w’s path passes through qb for the first time, the path can never pass through any states that have been visited before apart from state qb . Note that we can decompose A into two subautomata A1 and A2 such that L(A) = L(A1 ) · L(A2 ) from the first and the second requirements. On the other hand, we may have several duplicated states and transitions in both A1 and A2 without the third requirement and it does not give a smaller subautomaton in the worst-case. Fig. 6 illustrates this phenomenon.

Fig. 6. Since state 3 satisfies both the first and the second conditions in Definition 4, we can partition A into two subautomata A1 and A2 such that L(A) = L(A1 ) · L(A2 ). Note that A2 has the same size as A, where state 3 is now the start state of A2 .

The third requirement guarantees that if we partition A at a bridge state qb into A1 and A2 , then all out-transitions of qb appear only in A2 . Therefore, A1 and A2 have only qb as a common state between them. Fig. 7 gives an example of bridge states.

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

115

Fig. 7. States 1 and 7 are bridge states.

Fig. 8. An example of vertical chopping of the automaton in Fig. 7 at a bridge state 7.

Assume that there is only one final state in A. If there is more than one final state, then we introduce a new final state f 0 and connect all final states to f 0 by null transitions. Given an FA A = (Q, Σ , δ, s, f ) and a bridge state qb ∈ Q, we partition A into two subautomata A1 and A2 as follows: A1 = (Q 1 , Σ , δ1 , s, qb ) and A2 = (Q 2 , Σ , δ2 , qb , f ), where Q 1 is a subset of states of A that appear on some path from s and qb in A, Q 2 = Q \ Q 1 ∪ {qb }, δ2 is a subset of transitions of A that appear on some path from qb to f in A and δ1 = δ \ δ2 . Fig. 8 illustrates partitioning at a bridge state. Lemma 5. Given an FA A, let A1 and A2 be subautomata of A that are partitioned at a bridge state of A. Then, L(A) = L(A1 ) · L(A2 ). Proof. Let qb be a bridge state, w1 ∈ L(A1 ) and w2 ∈ L(A2 ). We now process w1 w2 with respect to A. Since δ1 ⊆ δ, we reach qb after reading w1 . Again, we can read out w2 from qb and reach a final state of A since δ2 ⊆ δ.  Note that if states p and q are bridge states in A, then q is still a bridge state in one of the resulting subautomata after the partitioning of A at p. For example, state 1 is a bridge state of the FA A in Fig. 7 and is a bridge state of the FA A1 in Fig. 8 after chopping at state 7. Let B = {b1 , b2 , . . . , bk } be a set of bridge states in A, where k is the total number of bridge states in A. Then, B \ {bi } is the set of bridge states of A1 and A2 after chopping A at state bi . Now we present an algorithm that computes all bridge states for a given FA A = (Q, Σ , δ, s, f ). We say that a path in A is simple if it does not have any cycles. Then, from the second requirement of bridge states in Definition 4, we establish the following statement. Lemma 6. Let P be a simple path from s to f in A. Then, only the states in P can be bridge states of A. Proof. Assume that a state q is a bridge state and it is not in P. Then, it immediately contradicts the second requirement of bridge states.  Since A is essentially a directed graph, our algorithm is based on Depth-First Search (DFS). For details on DFS, refer to the textbook [4]. First, we compute a simple path P from s to f of A using DFS. Let C B = {s, b1 , b2 , . . . , bk , f } be the set of states in P. Our approach is to take all states in C B as bridge state candidates and to identify all states that violate any requirements in Definition 4 and remove them from C B . Then, we have bridge states. We use DFS to explore A from s. We maintain the following three values, for each state q ∈ Q: anc: The index i of a state bi ∈ C B such that there is a path from bi to q and there is no path from b j ∈ C B to q for j > i. The anc of bi is i. min: The index i of a state bi ∈ C B such that there is a path from q to bi and there is no path from q to bh for h < i in A without visiting any states in C B .

116

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

Fig. 9. An example of DFS that computes (min, anc, max), for each state of A, for a given C B = {s, b1 , b2 , b3 , b4 , b5 , b6 , f }, where # denotes the null index.

max: The index i of a state bi ∈ C B such that there is a path from q to bi and there is no path from q to b j for i < j in A without visiting any states in C B . The min value of a state q means that there is a path from q to bmin . If a state bi ∈ C B has a min value, then it implies that bi is in a cycle. Similarly, if bi has a max value and max 6= i + 1, then it means that there is another simple path from bi to bmax without passing through bi+1 . For computing three values for all states in A efficiently, we first visit all states of C B in s → b1 → b2 → · · · → bk → f order before visiting any other states in DFS. When a state q ∈ Q \ C B is discovered in DFS, q inherits anc of its preceding state. A state q has two types of child state: one type is a subset T1 of C B and the other is a subset T2 of Q \ C B ; namely, some child states of q are bridge state candidates and the other states of q are not. Once we have finished exploring all descendants of q, we update min and max of q as follows: min = min(min (q.anc), min (q.min)) q∈T1

q∈T2

and max = max(max(q.anc), max(q.max)). q∈T1

q∈T2

Fig. 9 provides an example of DFS result after updating (min, anc, max) for all states in a given FA. If a state bi ∈ C B does not have any out-transitions except for a transition to bi+1 ∈ C B (for example, b6 in Fig. 9), then bi has (#, i, i + 1) when DFS is complete, where # denotes the null index. Once we compute DFS and have (min, anc, max) for all states in A, we remove states that violate any requirements to be bridge states from C B . Assume bi ∈ C B has (h, i, j), where h < i and i < j. First, we remove bh+1 , bh+2 , . . . , bi from C B since there is a path from bi to bh and, therefore, all these states can be revisited after visiting bi ; it violates the third requirement in Definition 4. For example, b2 in Fig. 9 is removed from C B since its min is not #. If h is #, then we do not remove any states. Second, we remove bi+1 , bi+2 , . . . , b j−1 from C B since there is a path from bi to b j ; that is, there is another simple path from bi to f without passing through these states. Finally, we remove s and f from C B . We obtain two bridge states b1 and b6 after removing states from C B = {s, b1 , . . . , b6 , f } in Fig. 9 that violate any requirements to be bridge states in Fig. 9. Since it takes constant time to compute three values for each state and DFS runs in linear time in the size of a given FA, we establish the following result. Theorem 7. We can compute a set of bridge states for a given FA A = (Q, Σ , δ, s, f ) in O(|Q| + |δ|) time using DFS. Now we demonstrate how bridge states help to compute a shorter regular expression from a given automaton A. Note that we use state elimination for computing regular expressions. As we have mentioned previously, the removal sequence for state elimination is crucial when we wish to compute a shorter regular expression. Lemma 8. If all states in a given automaton A = (Q, Σ , δ, s, f ) are bridge states, then state elimination gives the same regular expression whatever the removal sequence of states of A we use. Proof. Since all states are bridge states, they are in the same simple path from the start state to the final state of A. Assume that we assign unique number for each state from s to f as q1 , q2 , . . . , qm , where m is the number of states

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

117

Fig. 10. An example of an FA whose states are all bridge states. Note that state elimination always gives ac∗ bbb∗ cb∗ a whatever removal sequence we use.

in A. Then, the state elimination of state qi makes a new transition δ(qi−1 , qi+1 ) = δ(qi−1 , qi ) · δ(qi , qi )∗ · δ(qi , qi+1 ) from qi−1 to qi+1 . Note that it does not introduce any new duplicated strings except the Kleene star for a self-loop. Thus, the state elimination of A gives a simple catenation. Therefore, it always gives the same regular expression independently of the removal sequence for the states in A. (see Fig. 10.)  Proposition 9. Given an EA A = (Q, Σ , δ, s, f ), all and only regular expressions in transitions will appear in the corresponding regular expression when we compute the expression using state elimination. Proof. Since state elimination is based on catenations of in-transitions and out-transitions of the current state to be eliminated, the statement is true.  Now we answer the question arising in Fig. 4. We assume that there are no three consecutive bridge states in A. If there are, then we delete the middle bridge state by state elimination. Given an EA A = (Q, Σ , δ, s, f ), let C(A) be the total number of character appearances in transitions of A; that is, X C(A) = |ei j |, for each (qi , ei j , q j ) ∈ δ, where qi , q j ∈ Q. i, j

For example, if A is ({s, f }, Σ , (s, E, f ), s, f ), which is the final expression automaton of state elimination for computing a corresponding regular expression, then C(A) = |E|. Theorem 10. Given an EA A = (Q, Σ , δ, s, f ) without three consecutive bridge states and a set B of bridge states of A, the optimal removal sequence must eliminate all states in Q \ B before eliminating any bridge states. Proof. Without loss of generality, we assume that we have an optimal removal sequence OPT of state elimination for A that eliminate a bridge state qb first. We prove that there is a shorter regular expression using a different removal sequence and, therefore, OPT is not an optimal sequence. Since there are no three consecutive bridge states in A, either a target state or a source state of qb must be not a bridge state. Let us assume that a target state is not a bridge state. Fig. 11 illustrates an example of the state elimination of qb in A.

Fig. 11. (a) is a part of a given EA A. (b) is the resulting EA Aqb after eliminating qb from A. (c) is the resulting EA A p after eliminating p from A. (d) is the resulting EA Aqb p after eliminating p from Aqb . Note that C(A) < C(Aqb ) and C(A p ) < C(Aqb p ).

118

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

Fig. 12. An example of state elimination of a bridge state. Note that the numbers of character appearances are the same for both automata.

Let Aqb be the resulting EA after the state elimination of qb . Then, C(A) < C(Aqb ) by Fig. 2. Let p be the next state to be eliminated after qb by OPT. We consider two cases: case 1 is when p is a target or a source state of qb and case 2 is when p is not. (1) If p is a target or a source state of qb : Assume that p is a target state of qb . In Aqb , p has at least the same number of in-transitions compared to p in A and each in-transition has a longer expression. Therefore, C(A p ) < C(Aqb p ). Moreover, a target state of p in Aqb p has longer expressions of in-transitions than the corresponding expression of in-transitions in A p as shown in Fig. 11(b) and (d). (2) If p is neither a target nor a source state of qb : The state elimination of p produces the same new expressions in both A and Aqb . Then, since C(A) < C(Aqb ), C(A p ) < C(Aqb p ). Let AOPT be the EA computed by OPT and A0 be the corresponding EA that we make by eliminating the same state as OPT does except for qb . Then, by the same argument, it is always true that C(A0 ) < C(AOPT ). Once OPT completes state elimination, then C(A0 ) < C(AOPT ) and A0 has three states, s, f and qb . Note that C(AOPT ) is the size of the regular expression computed by OPT. Now we eliminate qb from A0 and denote the resulting EA by Aq0 b as illustrated in Fig. 12. Note that C(Aq0 b ) = C(A0 ) is the size of the corresponding regular expression that we have computed. Since C(Aq0 b ) = C(A0 ) < C(AOPT ), we have computed a regular expression that is shorter than the regular expression computed by OPT — a contradiction. Therefore, the optimal removal sequence must eliminate all states in Q \ B before eliminating any bridge states.  Theorem 10 suggests that given an automaton A, we identify all bridge states of A, chop A into several subautomata using bridge states, compute corresponding regular expressions for each subautomaton and catenate the resulting regular expressions to obtain a regular expression for A. Note that each subautomaton is disjoint from every other subautomata except for bridge states. Thus, vertical chopping is a divide-and-conquer approach based on the structural properties of A. 5. Horizontal chopping Now we have an FA A without any bridge states after we have applied vertical chopping as suggested in Section 4. It implies that there are one start state and one final state in A. Although we cannot avoid computing a removal sequence for A, we can sometimes avoid examining all of A to compute such a sequence. For example, we can partition A, shown in Fig. 13, into two subautomata Au and Al . We can compute corresponding regular expressions eu and el for Au and Al , respectively. Then, a regular expression for A is eu + el , which does not increase the number of character appearances.

Fig. 13. An example of horizontal chopping for a given FA without bridge states.

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

119

Fig. 14. An example of DFS that identifies groups. The label outside a state is its group index. Note that group 1 and group 2 belong to the same group because of q. Therefore, there are two disjoint subautomata that we can use horizontal chopping.

Another interesting observation is as follows. Assume that an optimal removal sequence is 5 → 3 → 4 → 6 → 2 for the FA in Fig. 13. Then, a removal sequence, 3 → 4 → 6 → 5 → 2 gives the same regular expression as before since state elimination of a state in the upper subautomaton does not affect expressions in the lower subautomaton. It implies that sometimes we can compute an optimal removal sequence for a given FA A by computing optimal removal sequences for subautomata and combining them. This approach is also a divide-and-rule approach. Since we partition A horizontally, we call it horizontal chopping. For horizontal chopping of a given FA A = (Q, Σ , δ, s, f ), we have to identify subautomata of A such that each subautomaton is disjoint from each other except for s and f . Our algorithm is based on DFS. When exploring A, we maintain a group index for each state of A. First, we assign a different group index for each child of s in A. Assume p is the current state with group index i and q is the next state to visit in DFS. If q does not have a group index (that means q is discovered just now), then q inherits the group index i from p. Otherwise, q already has a group index j and we combine two group indices i and j and regard them as the same group. We continue to explore until we visit all states in A. Fig. 14 illustrates DFS running to identify groups from a given automaton. Note that when we visit state q from state p, we unite group 1 and group 2 to a single group. From this algorithm, we establish the following result: Theorem 11. Given an FA A = (Q, Σ , δ, s, f ), we can discover all subautomata that are disjoint from each other except for s and f in O(|Q| + |δ|) time using DFS. Moreover, once we partition A horizontally, some states become bridge states of subautomata. For example, state 2 is a bridge state of Au and states 3, 4 and 6 are bridge states of Al in Fig. 13. Note that these states are not bridge states of A. Therefore, we can compute bridge states for each subautomaton and perform vertical chopping if there are bridge states; then, again we can repeat horizontal chopping. We continue chopping until no further chopping is possible, and, then compute a removal sequence. Note that state elimination using horizontal chopping and vertical chopping works well for FAs that preserve the structural properties of corresponding regular expressions. For example, for each catenation operation of a given regular expression that is not enclosed by a Kleene star, there is a bridge state in the corresponding Thompson automaton and position automaton. Similarly, for each union operation that is not enclosed by a Kleene star, we can find a horizontal chopping in the corresponding Thompson automaton. On the other hand, we might not be able to perform any vertical chopping or horizontal chopping in the worst case. However, then it implies that such an FA is already complex and barely preserves any structural properties of the possible regular expressions. In this case, we can only choose brute force. 6. Conclusions There are several number of FA constructions from regular expressions and each construction has different properties in the literature [3,8,9,14,18,19]. On the other hand, there are few methods to capture a regular expression from a given FA such as linear equations [6] and state elimination [2]. State elimination is an intuitive construction: we compute a regular expression by removing states in a given automaton while maintaining expressions in transitions.

120

Y.-S. Han, D. Wood / Theoretical Computer Science 370 (2007) 110–120

The resulting regular expression of state elimination depends on the removal sequence of states. Thus, if we choose a good removal sequence, then we have a shorter regular expression. However, we have to try all possible sequences to find out an optimal sequence and there are m! sequences, where m is the number of states. Moreover, state elimination blows up the size of regular expressions in transitions much. Thus, these attract us to investigate state elimination for reducing the size of regular expressions and computing a better removal sequence that ensures to have a shorter regular expression. We have examined NFA minimization to reduce the number of character appearances based on state equivalence. Furthermore, we have investigated the properties of bridge states of an FA and proved that bridge states must be eliminated after eliminating all non-bridge states in A in order to have a shorter regular expression. We can perform vertical chopping of A using bridge states. We have also discovered that we can use horizontal chopping that ensures to compute a state removal sequence of A quickly: once we partition A horizontally, then we can repeat vertical chopping for each subautomaton. We have designed two algorithms for identifying vertical chopping and horizontal chopping of A based on DFS. Both algorithms have a linear running time in the size of A. The combination of vertical chopping and horizontal chopping suggests a divide-and-conquer heuristic for computing a better removal sequence of states of A. Our state elimination strategy relies on bridge states that determine a (particular type of) decomposition of regular languages. More generally, any decomposition of a regular language L can be associated with a subset of states of a DFA for L as shown by Mateescu et al. [17]; they called such sets decomposition sets. Therefore, it would be interesting to see if decomposition set can provide some useful information for choosing a good strategy for state elimination. Acknowledgments We wish to thank the referee for the careful reading of the paper and many valuable suggestions including recent references. Han was supported by the Research Grants Council of Hong Kong Competitive Earmarked Research Grant HKUST6197/01E and the KIST Tangible Space Initiative Grant 2E19020. Wood was supported by the Research Grants Council of Hong Kong Competitive Earmarked Research Grant HKUST6197/01E. References [1] A. Br¨uggemann-Klein, D. Wood, The validation of SGML content models, Mathematical and Computer Modelling 25 (1997) 73–84. [2] J. Brzozowski, E. McCluskey Jr., Signal flow graph techniques for sequential circuit state diagrams, IEEE Transactions on Electronic Computers EC-12 (1963) 67–76. [3] P. Caron, D. Ziadi, Characterization of Glushkov automata, Theoretical Computer Science 233 (1–2) (2000) 75–90. [4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, McGraw-Hill Higher Education, 2001. [5] M. Delgado, J. Morais, Approximation to the smallest regular expression for a given regular language, in: Proceedings of CIAA’04, in: Lecture Notes in Computer Science, vol. 3317, 2004, pp. 312–314. [6] S. Eilenberg, Automata, Languages, and Machines, vol. A, Academic Press, New York, NY, 1974. [7] D. Giammarresi, R. Montalbano, Deterministic generalized automata, Theoretical Computer Science 215 (1999) 191–208. [8] D. Giammarresi, J.-L. Ponty, D. Wood, D. Ziadi, A characterization of Thompson digraphs, Discrete Applied Mathematics 134 (2004) 317–337. [9] V. Glushkov, The abstract theory of automata, Russian Mathematical Surveys 16 (1961) 1–53. [10] G. Gramlich, G. Schnitger, Minimizing nfa’s and regular expressions, in: Proceedings of STACS’05, in: Lecture Notes in Computer Science, vol. 3404, 2005, pp. 399–411. [11] Y.-S. Han, D. Wood, The generalization of generalized automata: Expression automata, International Journal of Foundations of Computer Science 16 (3) (2005) 499–510. [12] J. Hopcroft, An n log n algorithm for minimizing the states in a finite automaton, in: Z. Kohavi, A. Paz (Eds.), Theory of Machines and Computations, Academic Press, New York, NY, 1971, pp. 189–196. [13] L. Ilie, G. Navarro, S. Yu, On NFA reductions, in: Theory is Forever (Salomaa Festschrift), in: Lecture Notes in Computer Science, vol. 3113, 2004, pp. 112–124. [14] L. Ilie, S. Yu, Follow automata, Information and Computation 186 (1) (2003) 140–162. [15] T. Jiang, B. Ravikumar, Minimal NFA problems are hard, SIAM Journal on Computing 22 (6) (1993) 1117–1141. [16] S. Kleene, Representation of events in nerve nets and finite automata, in: C. Shannon, J. McCarthy (Eds.), Automata Studies, Princeton University Press, Princeton, NJ, 1956, pp. 3–42. [17] A. Mateescu, A. Salomaa, S. Yu, Factorizations of languages and commutativity conditions, Acta Cybernetica 15 (3) (2002) 339–351. [18] R. McNaughton, H. Yamada, Regular expressions and state graphs for automata, IEEE Transactions on Electronic Computers 9 (1960) 39–47. [19] K. Thompson, Regular expression search algorithm, Communications of the ACM 11 (1968) 419–422. [20] D. Wood, Theory of Computation, John Wiley & Sons, Inc., New York, NY, 1987.