{

Multi-Tilde-Bar Derivatives Pascal Caron, Jean-Marc Champarnaud, and Ludovic Mignot ´ LITIS, Universit´e de Rouen, 76801 Saint-Etienne du Rouvray Cedex, France {pascal.caron,jean-marc.champarnaud,ludovic.mignot}@univ-rouen.fr

Abstract. Multi-tilde-bar operators allow us to extend regular expressions. The associated extended expressions are compatible with the structure of Glushkov automata and they provide a more succinct representation than standard expressions. The aim of this paper is to examine the derivation of multi-tilde-bar expressions. Two types of computation are investigated: Brzozowski derivation and Antimirov derivation, as well as the construction of the associated automata.

1

Introduction

Regular expression word derivatives have been introduced in [5] by Brzozowski in order to compute language quotients via expression derivatives: for any word w, the language denoted by the derivative of a regular expression E w.r.t. w is the left quotient of the language denoted by E w.r.t. w. Regular expression derivation plays a fundamental role in theory of automata. In particular, under the assumption that the set D of all the derivatives of a regular expression E is ﬁnite, it is possible to construct a FA (ﬁnite automaton) with D as a set of states that recognizes the language denoted by E. Word derivatives handle unrestricted regular expressions; they are themselves expressions and they provide a DFA (deterministic ﬁnite automaton), as far as the ACI (associativity, commutativity and idempotence) properties of the sum of two expressions are used. Alternative types of derivation have been designed since Brzozowski’s seminal work. Partial derivatives, due to Antimirov [2], only address simple regular expressions; they are sets of expressions and they provide both a DFA and a NFA (non-deterministic ﬁnite automaton). Antimirov derivatives have been recently extended to unrestricted regular expressions [10]; extended partial derivatives are sets of sets of expressions and they provide a DFA, a NFA and an AFA (alternating ﬁnite automaton) [11]. Some derivations are based on the linearization of the (simple) input expression: let us cite the continuations of Berry and Sethi [4], the c-continuations of Champarnaud and Ziadi [14] and the derivatives of Ilie and Yu [18]. Let us mention that Antimirov derivation has been extended to the case of weighted rational expressions [21,13]. As reported in [2], the concept of derivation has been successfully used to investigate the properties of regular expressions [17,15,7,20,3,1]. More recently, Brzozowski introduced a new approach for studying the state complexity of regular languages, based on the counting of their quotients (or of their derivatives) [6]. N. Moreira and R. Reis (Eds.): CIAA 2012, LNCS 7381, pp. 321–328, 2012. c Springer-Verlag Berlin Heidelberg 2012

322

P. Caron, J.-M. Champarnaud, and L. Mignot

Moreover, derivatives provide a useful tool to implement regular matching algorithms [23,16], or scanner generators as reported in [22]. A close topic is the derivation of new operators that extend regular expressions. For example, the computation of the derivatives of an approximative regular expression (that denotes a languages at a bounded distance from a given language) has been presented in [12]. The aim of this paper is to investigate the derivation of the multi-tilde-bar expressions introduced in [8,9]. These expressions are built upon simple operators and multi-tilde-bar operators and their main interest is that they are compatible with the structure of Glushkov automata and more succinct than standard expressions. We provide formulae for the computation of word and partial derivatives of multi-tilde-bar expressions and investigate the properties of these derivatives. The next section gathers classical notions concerning regular languages, regular expressions and ﬁnite automata; it also recalls the deﬁnition and main properties of multi-tilde-bar operators. The deﬁnition of the quotient of the language of an extended to multi-tilde-bar expression is introduced in Section 3. Section 4 is devoted to the computation of the Brzozowski derivatives of an extended expression and Section 5 to the computation of the Antimirov derivatives. In both cases, the construction of the associated automaton is provided.

2

Preliminaries

We recall some deﬁnitions and notation concerning regular languages, regular expressions , ﬁnite automata and multi-tilde-bar expressions. For further details about these topics, we refer to classical books such as [24]. Languages, Regular Expressions and Automata An alphabet is a ﬁnite set of symbols. Given an alphabet Σ, any subset of Σ ∗ is a language over Σ. The set of regular languages over Σ is denoted by Reg(Σ ∗ ) and is deﬁned as the smallest family of languages containing ∅ and {a} for every symbol a in Σ and closed under union, catenation and Kleene star. A regular expression E over an alphabet Σ is inductively deﬁned by E = 0, E = 1, E = a, E = (F + G), E = (F · G), E = (F ∗ ) with a a symbol in Σ, and F and G two regular expressions over Σ. The language denoted by a regular expression is inductively deﬁned by L(0) = ∅, L(1) = {ε}, L(a) = {a}, L(F + G) = L(F ) ∪ L(G), L(F · G) = L(F ) · L(G) and L(F ∗ ) = L(F )∗ , with a a symbol in Σ, and F and G two regular expressions over Σ. By construction, the language denoted by a regular expression is regular. The alphabetic width |E| of E is the number of occurrences of symbols of Σ appearing in E. A finite automaton A is a 5-tuple (Σ, Q, I, F, δ) where Σ is an alphabet, Q is a ﬁnite set of states, I ⊂ Q a set of initial states, F ⊂ Q a set of final states and δ ⊂ Q × Σ × Q a set of transitions. The set δ can be seen as a function from Q × Σ to 2Q deﬁned by q ∈ δ(q, a) ⇔ (q, a, q ) ∈ δ. The domain of the function δ canbe extended to 2Q × Σ ∗ by setting, for all Q ⊂ Q, δ(Q , ε) = Q , δ(Q , a) = q∈Q δ(q, a), δ(Q , a · w) = δ(δ(Q , a), w) for all word w in Σ ∗ . The language recognized by the automaton A is the set L(A) = {w ∈ Σ ∗ | δ(I, w) ∩ F = ∅}. A language

Multi-Tilde-Bar Derivatives

323

L is recognizable if there exists an automaton that recognizes it. The set of recognizable languages over Σ is denoted by Rec(Σ ∗ ). Kleene theorem [19] asserts that Reg(Σ ∗ ) = Rec(Σ ∗ ). Consequently , for every regular language L, there exist an automaton A and an expression E such that L = L(E) = L(A). The Multi-tilde-Bar Operators [8,9] are deﬁned for The unary operators tilde, denoted by , and bar, denoted by every expression E by L( E ) = L(E) ∪ {ε} and L( E ) = L(E) \ {ε}. They are extended to multi-tilde-bar operators, which are applied to a list of expressions, according to the following deﬁnitions. Let n be a positive integer. For convenience, the list (E1 , . . . , En ) of expressions is denoted by E1,n . Similarly, a catenation E1 · · · En is denoted by E1···n . The set of integers {1, . . . , n} is denoted by 1, n. The subset of pairs (i, j) such that if 1 ≤ i ≤ j ≤ n is denoted by 1, n2≤ . The set of ﬁnite lists of pairs in 1, n2≤ is denoted by Sn . Let S be a list in Sn . Let k be in 1, n. The list S≤k (resp. S≥k ) is deﬁned by S≤k = ((i, f ) ∈ S | f ≤ k) (resp. S≥k = ((i − k + 1, f − k + 1) ∈ S | i ≥ k)). Let us notice that a renumbering is performed for the computation of S≥k . A list S is said to be free if for all pairs (i, f ), (i , f ) in S such that (i, f ) = (i , f ), i, f ∩ i , f = ∅. Let L1 , . . . , Ln be n nonempty regular languages over Σ and w be a word in L1 · · · Ln . A sequence (w1 , . . . , wn ) satisfying w1 · · · wn = w ∧ ∀k ∈ 1, n, wk ∈ Lk is said to be a split up of w over (L1 , . . . , Ln ). Multi-tilde-bar operators are a natural combination of multi-tilde and multibar operators [9]. The respective role of tildes and bars is explicited in the two following deﬁnitions. Definition 1. Let (w1 , . . . , wn ) be a split up of a word w over a list of languages The sequence (w1 , . . . , wn ) (L1 ∪ {ε}, . . . , Ln ∪ {ε}). Let T be a free list in Sn . is generated by the list T if it holds: wk = ε if k ∈ (i,f )∈T i, f and wk ∈ Lk otherwise. Bars are used to forbid some combinations of tildes. Consequently, the satisfaction of a bar by a sequence has to be deﬁned with a list of tildes as a context. Definition 2. Let E1,n be a list of n expressions. Let (w1 , . . . , wn ) be a split up of a word w over (L(E1 ) ∪ {ε}, . . . , L(En ) ∪ {ε}) generated by a free list T in Sn . Let b = (i, f ) be a pair in 1, n2≤ \ T . The bar b is said to be satisﬁed by (w1 , . . . , wn ) w.r.t. T if at least one of the three following conditions is satisfied: (1) there exists a pair t in T such that t overlaps b, (2) there exists a pair t in T such that b is included in t, (3) wi · · · wf = ε. According to the two previous deﬁnitions, the language denoted by a multi-tildebar can be expressed as follows: Definition 3 ([8]). Let E1,n be a list of expressions over an alphabet Σ and L be the list (L(E1 ) ∪ {ε}, . . . , L(En ) ∪ {ε}) of languages. Let B and T be two lists

324

P. Caron, J.-M. Champarnaud, and L. Mignot

in Sn such that B ∩ T = ∅. The multi-tilde-bar E = language L(E)=

T ;B

E1,n denotes the

w ∈ Σ ∗ |there exists a split up of w over L generated by a free sublist T of T satisfying every bar in B w.r.t. T .

Example 1. Let us consider the EMRE E1 deﬁned by ∗ (a b), (b∗ a) · a∗ (i.e. ( a∗ b )( b∗ a ) · a∗ ). E1 = (1,1),(2,2);(1,2) The language denoted by E1 is the set L(E1 ) = (((L(a∗ b) ∪ {ε}) · (L(b∗ a) ∪ {ε})) \ {ε}) · L(a∗ ). Definition 4. Let Σ be an alphabet. An Extended to multi-tilde-bar Regular Expression (EMRE) over Σ is inductively defined by: E = 0, E = 1, E = a, E1,n , E = E1 + E2 , E = E1 · E2 , E = E1∗ , E = T ;B where E1 , . . . , En are any n EMREs over an alphabet Σ, a is any symbol in Σ and T and B are any two disjoint lists in Sn . Definition 5. An EMRE is said to be total if and only if for any of its multitilde-bar subsexpressions E1,n it holds T ∪ B = 1, n2≤ . T ;B Lemma 1 ([8]). Any EMRE admits an equivalent total one.

3

Quotient Formulae

We now recall the inductive computation of the quotient w−1 (L) of a language L w.r.t. a word w in Σ ∗ , that is the set {w ∈ Σ ∗ | ww ∈ L}. Lemma 2. Let L be language in Reg(Σ ∗ ) and w be a word in Σ ∗ . The quotient w−1 (L) of L w.r.t. w is inductively computed as follows: ε−1 (L) = L, (aw )−1 (L) = w−1 (a−1 (L)), −1 a (∅) = a−1 ({ε}) = a−1 ({b}) = ∅, a−1 (a) = {ε}, −1 −1 −1 ∗ −1 ∗ a (L1 ∪ L2 ) = a−1 (L 1 )−1∪ a (L2 ), a−1 (L1 ) = a (L1 ) · L1 , a (L1 ) · L2 ∪ a (L2 ) if ε ∈ L1 , a−1 (L1 · L2 ) = otherwise. a−1 (L1 ) · L2 where L1 and L2 are any two languages in Reg(Σ ∗ ), a and b are any two distincts symbols in Σ and w is any word in Σ ∗ . Lemma 3. Let E = E1,n be a total EMRE over an alphabet Σ. Then: T ;B

L(E) =

{ε | (1, n) ∈ T } ∪ (L(E1 ) \ {ε}) · L( ∪ (1,k−1)∈T (L(Ek ) \ {ε}) · L( T

T≥2 ;B≥2

≥k+1 ;B≥k+1

E2,n ) . Ek+1,n )

E1,n be a total EMRE over an alphabet Σ and Corollary 1. Let E = T ;B let a be a symbol in Σ. Then: a−1 (L(E)) =

E2,n ) a−1 (L(E1 )) · L( T≥2 ;B≥2 ∪ (1,k−1)∈T a−1 (L(Ek )) · L( T

≥k+1 ;B≥k+1

Ek+1,n )

Multi-Tilde-Bar Derivatives

4

325

Word Derivatives of an EMRE

The set of all the word derivatives of a regular expression can be inﬁnite. However Brzozowski derivation yields a ﬁnite set of derivatives (called dissimilar derivatives) based on the use of the +ACI operator that is associative, commutative and idempotent. We extend these results to the case of EMREs and give the construction of the dissimilar derivative DFA of an EMRE. Definition 6. Let E be regular expression over the alphabet Σ and w be a word in Σ ∗ . The dissimilar derivative dda (E) of E w.r.t. w is inductively computed as d d d d dε (E) = E, d (E) = d ( da (E)), aw

d da (0)

w

= dda (1) = dda (b) = 0, dda (a) = 1, d d d d d ∗ ∗ da (F + G) = da (F ) + da (G), da (F ) = da (F ) · F , d d d da (F ) · G +ACI da (G) if ε ∈ L(F ), (F · G) = d da otherwise. da (F ) · G where F and G are any two regular expressions over the alphabet Σ, a and b are any two distincts symbols of Σ and w is any word in Σ ∗ . E1,n be a total EMRE over an alphabet Σ, let Definition 7. Let E = T ;B a be a symbol inΣ and w be a word in Σ ∗. Then: d

d (E) da

= d dw

da

+ACI

(E) =

(E1 ) ·

E2,n T≥2 ;B≥2 d ACI (1,k−1)∈T da (Ek ) ·

E d dw

T≥k+1 ;B≥k+1

, Ek+1,n

if w = ε, d ( (E)) if w = b · w ∧ b ∈ Σ ∧ w ∈ Σ ∗ . d b

Proposition 1. The derivative of an EMRE E w.r.t. a word w denotes the set w−1 (L(E)). Proposition 2. The set of dissimilar derivatives of an EMRE is finite. Definition 8. Let E be an EMRE over an alphabet Σ and DE be the set of the dissimilar derivatives of E. Let A = (Σ, Q, I, F, δ) be the automaton defined by Q = DE , I = {E}, F = {E ∈ Q | ε ∈ L(E )}, ∀E ∈ Q, ∀a ∈ Σ, δ(E , a) = { dda (E )}. The automaton A is the dissimilar derivative DFA of E. Proposition 3. The dissimilar derivative DFA of an EMRE E recognizes L(E). Example 2. Let us consider the total EMRE E1 = ( a∗ b )( b∗ a ) · a∗ deﬁned in Example 1. Successive dissimilar derivatives of E are computed as follows: d ∗ ∗ ∗ ∗ d ∗ da (E1 ) = a b · ( b a ) · a + a = E2 da (E4 ) = a = E5 d d ∗ ∗ ∗ ∗ ∗ ∗ db (E1 ) = ( b a ) · a + b a · a = E3 db (E4 ) = b a · a = E6 d ∗ d ∗ ∗ ∗ ∗ da (E5 ) = a = E5 da (E2 ) = a b · ( b a ) · a + a = E2 d d ∗ ∗ db (E5 ) = 0 db (E2 ) = ( b a ) · a = E4 d ∗ d ∗ da (E6 ) = a = E5 da (E3 ) = a = E5 d ∗ ∗ d (E3 ) = b∗ a · a∗ = E6 db (E6 ) = b a · a = E6 db

326

P. Caron, J.-M. Champarnaud, and L. Mignot a a b

E3

a

E6

a

E5 a

b b

b a

E1

b

E2

E4

Fig. 1. The Dissimilar Derivative DFA of E1

5

Partial Derivatives of an EMRE

Partial derivatives [2] of a regular expression are deﬁned as follows. Definition 9. The partial derivative of a regular expression E w.r.t. a word w is the set ∂∂a (E) of expressions inductively computed as follows: ∂ ∂ ∂ ∂ ∂ε (E) = E, ∂ (E) = ∂ ( ∂a (E)), aw

∂ ∂a (0)

w

∂ ∂ ∂ ∂a (1) = ∂a (b) = ∅, ∂a (a) = {1}, ∂ ∂ ∂ ∂ ∗ ∗ ∂a (F ) ∪ ∂a (G), ∂a (F ) = ∂a (F ) · F , ∂ ∂ (F ) · G ∪ ∂a (G) if ε ∈ L(F ), = ∂a ∂ otherwise. ∂a (F ) · G

= ∂ ∂a (F + G) = ∂ ∂a (F

· G)

where: F and G are any two regular expressions over the alphabet Σ, a and b ∗ are any two distincts symbols of Σ and w is any word in Σ and for any set of expressions E, ∂∂a (E) = E∈E ∂∂a (E), L(E) = E∈E L(E). We now deﬁne the partial derivatives of a total EMRE. E1,n be a total EMRE over an alphabet Σ, Definition 10. Let E = T ;B let a be a symbol in Σ and w be a word inΣ ∗ . Then: ∂ (E1 ) · E2,n ∂a T≥2 ;B≥2 , ∪ (1,k−1)∈T ∂∂a (Ek ) · Ek+1,n T≥k+1 ;B≥k+1

{E} if w = ε, ∂ (E) = ∂ ∂w ( ∂ (E)) if w = b · w ∧ b ∈ Σ ∧ w ∈ Σ ∗ . ∂ ∂b

∂ (E) ∂a

=

w

Proposition 4. Let E = ∗

and w be a word in Σ . Then

E1,n T ;B ∂ L( ∂w (E))

be a total EMRE over an alphabet Σ = w−1 (L(E)).

By deﬁnition, a partial derivative of an expression E is a set of expressions and each of these expressions is called a derivated term of E. We show that the set DE of all the derivated terms of an EMRE E is ﬁnite and we give the construction of the derivated term NFA. Lemma 4. Let E = E1,n be a total EMRE over an alphabet Σ and T ;B let w be a word in Σ+ . Then: ∂ ∂w

(E) ⊂

w=uv∧v=ε

n ∂ k=1 ∂v

(Ek ) ·

T≥k+1 ;B≥k+1

Ek+1,n .

Multi-Tilde-Bar Derivatives

327

Proposition 5. Let E be a total EMRE . Then: (#DE ) ≤ |E| + 1.

Definition 11. Let E be an EMRE over an alphabet Σ . Let A = (Σ, Q, I, F, δ) be the automaton defined by Q = DE , I = {E}, F = {E ∈ Q | ε ∈ L(E )}, for any expression E ∈ Q, for any symbol a in Σ, δ(E , a) = ∂∂a (E ). The automaton A is the derivated term NFA of E. Proposition 6. The derivated term automaton of an EMRE E recognizes L(E). Example 3. Let us consider the total EMRE E1 = ( a∗ b )( b∗ a ) · a∗ deﬁned in Example 2. Successive derivated terms of E are computed as follows: ∂ ∂ ∗ ∗ ∗ ∗ ∗ ∂a (E1 ) = {a b( b a ) · a , a )} ∂a (E3 ) = {a } = {E3 } ∂ = {E2 , E3 } ∂b (E3 ) = ∅ ∂ ∂ ∗ ∗ ∗ ∗ ∗ ∂b (E1 ) = {( b a ) · a , b a · a } ∂a (E4 ) = {a } = {E3 } ∂ ∗ ∗ = {E4 , E5 } ∂b (E4 ) = {b a · a } = {E5 } ∂ ∂ ∗ ∗ ∗ ∗ = {E3 } ∂a (E5 ) = {a ∂a (E2 ) = {a b( b a ) · a } = {E2 } ∂ ∗ ∗ ∂ ∗ ∗ ∂b (E5 ) = {b a · a } = {E5 } ∂ (E2 ) = {( b a ) · a } = {E4 } b

a a a E1

a

E2

b

E4

b

a

E5

E3

a

b b

b

Fig. 2. The Derivated Term NFA of E1

6

Conclusion

We have shown how the Brzozowski derivation and the Antimirov one can be applied to the case of (simple) regular expressions extended to multi-tilde-bar operators. The computation of the c-continuations for such expressions has been already investigated even though it is not presented here. The main interest of c-continuations is that they allow us to eﬃciently implement Glushkov and Antimirov NFAs. We also intend to generalize these derivations to the case of unrestricted regular expressions extended to multi-tilde-bar operators.

References 1. Almeida, M., Moreira, N., Reis, R.: Antimirov and Mosses’s rewrite system revisited. Int. J. Found. Comput. Sci. 20(4), 669–684 (2009) 2. Antimirov, V.: Partial derivatives of regular expressions and ﬁnite automaton constructions. Theoret. Comput. Sci. 155, 291–319 (1996)

328

P. Caron, J.-M. Champarnaud, and L. Mignot

3. Antimirov, V.M., Mosses, P.D.: Rewriting extended regular expressions. Theor. Comput. Sci. 143(1), 51–72 (1995) 4. Berry, G., Sethi, R.: From regular expressions to deterministic automata. Theoret. Comput. Sci. 48(1), 117–126 (1986) 5. Brzozowski, J.A.: Derivatives of regular expressions. J. Assoc. Comput. Mach. 11(4), 481–494 (1964) 6. Brzozowski, J.A.: Quotient complexity of regular languages. Journal of Automata, Languages and Combinatorics 15(1/2), 71–89 (2010) 7. Brzozowski, J.A., Leiss, E.L.: On equations for regular languages, ﬁnite automata, and sequential networks. Theor. Comput. Sci. 10, 19–35 (1980) 8. Caron, P., Champarnaud, J.M., Mignot, L.: Erratum to “acyclic automata and small expressions using multi-tilde-bar operators”. [Theoret. Comput. Sci. 411(3839), 3423–3435] (2010); Theor. Comput. Sci. 412(29), 3795–3796 (2011) 9. Caron, P., Champarnaud, J.M., Mignot, L.: Multi-bar and multi-tilde regular operators. Journal of Automata, Languages and Combinatorics 16(1), 11–26 (2011) 10. Caron, P., Champarnaud, J.-M., Mignot, L.: Partial Derivatives of an Extended Regular Expression. In: Dediu, A.-H., Inenaga, S., Mart´ın-Vide, C. (eds.) LATA 2011. LNCS, vol. 6638, pp. 179–191. Springer, Heidelberg (2011) 11. Caron, P., Champarnaud, J.M., Mignot, L.: A general frame for the derivation of regular expressions (submitted, 2012) 12. Champarnaud, J.-M., Jeanne, H., Mignot, L.: Approximate Regular Expressions and Their Derivatives. In: Dediu, A.-H., Mart´ın-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 179–191. Springer, Heidelberg (2012) 13. Champarnaud, J.M., Ouardi, F., Ziadi, D.: An eﬃcient computation of the equation K-automaton of a regular K-expression. Fundam. Inform. 90(1-2), 1–16 (2009) 14. Champarnaud, J.M., Ziadi, D.: Canonical derivatives, partial derivatives, and ﬁnite automaton constructions. Theoret. Comput. Sci. 239(1), 137–163 (2002) 15. Conway, J.H.: Regular algebra and ﬁnite machines. Chapman and Hall (1971) 16. Frishert, M.: FIRE Works & FIRE Station: A ﬁnite automata and regular expression playground. Ph.D. thesis, Eindhoven University, Netherlands (2005) 17. Ginzburg, A.: A procedure for checking equality of regular expressions. J. ACM 14(2), 355–362 (1967) 18. Ilie, L., Yu, S.: Follow automata. Inf. Comput. 186(1), 140–162 (2003) 19. Kleene, S.: Representation of events in nerve nets and ﬁnite automata. Automata Studies Ann. Math. Studies 34, 3–41 (1956) 20. Krob, D.: Diﬀerentation of K-rational expressions. Internat. J. Algebra Comput. 2(1), 57–87 (1992) 21. Lombardy, S., Sakarovitch, J.: Derivatives of rational expressions with multiplicity. Theor. Comput. Sci. 332(1-3), 141–177 (2005) 22. Owens, S., Reppy, J.H., Turon, A.: Regular-expression derivatives re-examined. J. Funct. Program. 19(2), 173–190 (2009) 23. Sulzmann, M., Lu, K.: Partial derivative regular expression pattern matching (December 2007) (manuscript) 24. Yu, S.: Regular languages. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages. Word, Language, Grammar, vol. I, pp. 41–110. Springer, Berlin (1997)

Abstract. Multi-tilde-bar operators allow us to extend regular expressions. The associated extended expressions are compatible with the structure of Glushkov automata and they provide a more succinct representation than standard expressions. The aim of this paper is to examine the derivation of multi-tilde-bar expressions. Two types of computation are investigated: Brzozowski derivation and Antimirov derivation, as well as the construction of the associated automata.

1

Introduction

Regular expression word derivatives have been introduced in [5] by Brzozowski in order to compute language quotients via expression derivatives: for any word w, the language denoted by the derivative of a regular expression E w.r.t. w is the left quotient of the language denoted by E w.r.t. w. Regular expression derivation plays a fundamental role in theory of automata. In particular, under the assumption that the set D of all the derivatives of a regular expression E is ﬁnite, it is possible to construct a FA (ﬁnite automaton) with D as a set of states that recognizes the language denoted by E. Word derivatives handle unrestricted regular expressions; they are themselves expressions and they provide a DFA (deterministic ﬁnite automaton), as far as the ACI (associativity, commutativity and idempotence) properties of the sum of two expressions are used. Alternative types of derivation have been designed since Brzozowski’s seminal work. Partial derivatives, due to Antimirov [2], only address simple regular expressions; they are sets of expressions and they provide both a DFA and a NFA (non-deterministic ﬁnite automaton). Antimirov derivatives have been recently extended to unrestricted regular expressions [10]; extended partial derivatives are sets of sets of expressions and they provide a DFA, a NFA and an AFA (alternating ﬁnite automaton) [11]. Some derivations are based on the linearization of the (simple) input expression: let us cite the continuations of Berry and Sethi [4], the c-continuations of Champarnaud and Ziadi [14] and the derivatives of Ilie and Yu [18]. Let us mention that Antimirov derivation has been extended to the case of weighted rational expressions [21,13]. As reported in [2], the concept of derivation has been successfully used to investigate the properties of regular expressions [17,15,7,20,3,1]. More recently, Brzozowski introduced a new approach for studying the state complexity of regular languages, based on the counting of their quotients (or of their derivatives) [6]. N. Moreira and R. Reis (Eds.): CIAA 2012, LNCS 7381, pp. 321–328, 2012. c Springer-Verlag Berlin Heidelberg 2012

322

P. Caron, J.-M. Champarnaud, and L. Mignot

Moreover, derivatives provide a useful tool to implement regular matching algorithms [23,16], or scanner generators as reported in [22]. A close topic is the derivation of new operators that extend regular expressions. For example, the computation of the derivatives of an approximative regular expression (that denotes a languages at a bounded distance from a given language) has been presented in [12]. The aim of this paper is to investigate the derivation of the multi-tilde-bar expressions introduced in [8,9]. These expressions are built upon simple operators and multi-tilde-bar operators and their main interest is that they are compatible with the structure of Glushkov automata and more succinct than standard expressions. We provide formulae for the computation of word and partial derivatives of multi-tilde-bar expressions and investigate the properties of these derivatives. The next section gathers classical notions concerning regular languages, regular expressions and ﬁnite automata; it also recalls the deﬁnition and main properties of multi-tilde-bar operators. The deﬁnition of the quotient of the language of an extended to multi-tilde-bar expression is introduced in Section 3. Section 4 is devoted to the computation of the Brzozowski derivatives of an extended expression and Section 5 to the computation of the Antimirov derivatives. In both cases, the construction of the associated automaton is provided.

2

Preliminaries

We recall some deﬁnitions and notation concerning regular languages, regular expressions , ﬁnite automata and multi-tilde-bar expressions. For further details about these topics, we refer to classical books such as [24]. Languages, Regular Expressions and Automata An alphabet is a ﬁnite set of symbols. Given an alphabet Σ, any subset of Σ ∗ is a language over Σ. The set of regular languages over Σ is denoted by Reg(Σ ∗ ) and is deﬁned as the smallest family of languages containing ∅ and {a} for every symbol a in Σ and closed under union, catenation and Kleene star. A regular expression E over an alphabet Σ is inductively deﬁned by E = 0, E = 1, E = a, E = (F + G), E = (F · G), E = (F ∗ ) with a a symbol in Σ, and F and G two regular expressions over Σ. The language denoted by a regular expression is inductively deﬁned by L(0) = ∅, L(1) = {ε}, L(a) = {a}, L(F + G) = L(F ) ∪ L(G), L(F · G) = L(F ) · L(G) and L(F ∗ ) = L(F )∗ , with a a symbol in Σ, and F and G two regular expressions over Σ. By construction, the language denoted by a regular expression is regular. The alphabetic width |E| of E is the number of occurrences of symbols of Σ appearing in E. A finite automaton A is a 5-tuple (Σ, Q, I, F, δ) where Σ is an alphabet, Q is a ﬁnite set of states, I ⊂ Q a set of initial states, F ⊂ Q a set of final states and δ ⊂ Q × Σ × Q a set of transitions. The set δ can be seen as a function from Q × Σ to 2Q deﬁned by q ∈ δ(q, a) ⇔ (q, a, q ) ∈ δ. The domain of the function δ canbe extended to 2Q × Σ ∗ by setting, for all Q ⊂ Q, δ(Q , ε) = Q , δ(Q , a) = q∈Q δ(q, a), δ(Q , a · w) = δ(δ(Q , a), w) for all word w in Σ ∗ . The language recognized by the automaton A is the set L(A) = {w ∈ Σ ∗ | δ(I, w) ∩ F = ∅}. A language

Multi-Tilde-Bar Derivatives

323

L is recognizable if there exists an automaton that recognizes it. The set of recognizable languages over Σ is denoted by Rec(Σ ∗ ). Kleene theorem [19] asserts that Reg(Σ ∗ ) = Rec(Σ ∗ ). Consequently , for every regular language L, there exist an automaton A and an expression E such that L = L(E) = L(A). The Multi-tilde-Bar Operators [8,9] are deﬁned for The unary operators tilde, denoted by , and bar, denoted by every expression E by L( E ) = L(E) ∪ {ε} and L( E ) = L(E) \ {ε}. They are extended to multi-tilde-bar operators, which are applied to a list of expressions, according to the following deﬁnitions. Let n be a positive integer. For convenience, the list (E1 , . . . , En ) of expressions is denoted by E1,n . Similarly, a catenation E1 · · · En is denoted by E1···n . The set of integers {1, . . . , n} is denoted by 1, n. The subset of pairs (i, j) such that if 1 ≤ i ≤ j ≤ n is denoted by 1, n2≤ . The set of ﬁnite lists of pairs in 1, n2≤ is denoted by Sn . Let S be a list in Sn . Let k be in 1, n. The list S≤k (resp. S≥k ) is deﬁned by S≤k = ((i, f ) ∈ S | f ≤ k) (resp. S≥k = ((i − k + 1, f − k + 1) ∈ S | i ≥ k)). Let us notice that a renumbering is performed for the computation of S≥k . A list S is said to be free if for all pairs (i, f ), (i , f ) in S such that (i, f ) = (i , f ), i, f ∩ i , f = ∅. Let L1 , . . . , Ln be n nonempty regular languages over Σ and w be a word in L1 · · · Ln . A sequence (w1 , . . . , wn ) satisfying w1 · · · wn = w ∧ ∀k ∈ 1, n, wk ∈ Lk is said to be a split up of w over (L1 , . . . , Ln ). Multi-tilde-bar operators are a natural combination of multi-tilde and multibar operators [9]. The respective role of tildes and bars is explicited in the two following deﬁnitions. Definition 1. Let (w1 , . . . , wn ) be a split up of a word w over a list of languages The sequence (w1 , . . . , wn ) (L1 ∪ {ε}, . . . , Ln ∪ {ε}). Let T be a free list in Sn . is generated by the list T if it holds: wk = ε if k ∈ (i,f )∈T i, f and wk ∈ Lk otherwise. Bars are used to forbid some combinations of tildes. Consequently, the satisfaction of a bar by a sequence has to be deﬁned with a list of tildes as a context. Definition 2. Let E1,n be a list of n expressions. Let (w1 , . . . , wn ) be a split up of a word w over (L(E1 ) ∪ {ε}, . . . , L(En ) ∪ {ε}) generated by a free list T in Sn . Let b = (i, f ) be a pair in 1, n2≤ \ T . The bar b is said to be satisﬁed by (w1 , . . . , wn ) w.r.t. T if at least one of the three following conditions is satisfied: (1) there exists a pair t in T such that t overlaps b, (2) there exists a pair t in T such that b is included in t, (3) wi · · · wf = ε. According to the two previous deﬁnitions, the language denoted by a multi-tildebar can be expressed as follows: Definition 3 ([8]). Let E1,n be a list of expressions over an alphabet Σ and L be the list (L(E1 ) ∪ {ε}, . . . , L(En ) ∪ {ε}) of languages. Let B and T be two lists

324

P. Caron, J.-M. Champarnaud, and L. Mignot

in Sn such that B ∩ T = ∅. The multi-tilde-bar E = language L(E)=

T ;B

E1,n denotes the

w ∈ Σ ∗ |there exists a split up of w over L generated by a free sublist T of T satisfying every bar in B w.r.t. T .

Example 1. Let us consider the EMRE E1 deﬁned by ∗ (a b), (b∗ a) · a∗ (i.e. ( a∗ b )( b∗ a ) · a∗ ). E1 = (1,1),(2,2);(1,2) The language denoted by E1 is the set L(E1 ) = (((L(a∗ b) ∪ {ε}) · (L(b∗ a) ∪ {ε})) \ {ε}) · L(a∗ ). Definition 4. Let Σ be an alphabet. An Extended to multi-tilde-bar Regular Expression (EMRE) over Σ is inductively defined by: E = 0, E = 1, E = a, E1,n , E = E1 + E2 , E = E1 · E2 , E = E1∗ , E = T ;B where E1 , . . . , En are any n EMREs over an alphabet Σ, a is any symbol in Σ and T and B are any two disjoint lists in Sn . Definition 5. An EMRE is said to be total if and only if for any of its multitilde-bar subsexpressions E1,n it holds T ∪ B = 1, n2≤ . T ;B Lemma 1 ([8]). Any EMRE admits an equivalent total one.

3

Quotient Formulae

We now recall the inductive computation of the quotient w−1 (L) of a language L w.r.t. a word w in Σ ∗ , that is the set {w ∈ Σ ∗ | ww ∈ L}. Lemma 2. Let L be language in Reg(Σ ∗ ) and w be a word in Σ ∗ . The quotient w−1 (L) of L w.r.t. w is inductively computed as follows: ε−1 (L) = L, (aw )−1 (L) = w−1 (a−1 (L)), −1 a (∅) = a−1 ({ε}) = a−1 ({b}) = ∅, a−1 (a) = {ε}, −1 −1 −1 ∗ −1 ∗ a (L1 ∪ L2 ) = a−1 (L 1 )−1∪ a (L2 ), a−1 (L1 ) = a (L1 ) · L1 , a (L1 ) · L2 ∪ a (L2 ) if ε ∈ L1 , a−1 (L1 · L2 ) = otherwise. a−1 (L1 ) · L2 where L1 and L2 are any two languages in Reg(Σ ∗ ), a and b are any two distincts symbols in Σ and w is any word in Σ ∗ . Lemma 3. Let E = E1,n be a total EMRE over an alphabet Σ. Then: T ;B

L(E) =

{ε | (1, n) ∈ T } ∪ (L(E1 ) \ {ε}) · L( ∪ (1,k−1)∈T (L(Ek ) \ {ε}) · L( T

T≥2 ;B≥2

≥k+1 ;B≥k+1

E2,n ) . Ek+1,n )

E1,n be a total EMRE over an alphabet Σ and Corollary 1. Let E = T ;B let a be a symbol in Σ. Then: a−1 (L(E)) =

E2,n ) a−1 (L(E1 )) · L( T≥2 ;B≥2 ∪ (1,k−1)∈T a−1 (L(Ek )) · L( T

≥k+1 ;B≥k+1

Ek+1,n )

Multi-Tilde-Bar Derivatives

4

325

Word Derivatives of an EMRE

The set of all the word derivatives of a regular expression can be inﬁnite. However Brzozowski derivation yields a ﬁnite set of derivatives (called dissimilar derivatives) based on the use of the +ACI operator that is associative, commutative and idempotent. We extend these results to the case of EMREs and give the construction of the dissimilar derivative DFA of an EMRE. Definition 6. Let E be regular expression over the alphabet Σ and w be a word in Σ ∗ . The dissimilar derivative dda (E) of E w.r.t. w is inductively computed as d d d d dε (E) = E, d (E) = d ( da (E)), aw

d da (0)

w

= dda (1) = dda (b) = 0, dda (a) = 1, d d d d d ∗ ∗ da (F + G) = da (F ) + da (G), da (F ) = da (F ) · F , d d d da (F ) · G +ACI da (G) if ε ∈ L(F ), (F · G) = d da otherwise. da (F ) · G where F and G are any two regular expressions over the alphabet Σ, a and b are any two distincts symbols of Σ and w is any word in Σ ∗ . E1,n be a total EMRE over an alphabet Σ, let Definition 7. Let E = T ;B a be a symbol inΣ and w be a word in Σ ∗. Then: d

d (E) da

= d dw

da

+ACI

(E) =

(E1 ) ·

E2,n T≥2 ;B≥2 d ACI (1,k−1)∈T da (Ek ) ·

E d dw

T≥k+1 ;B≥k+1

, Ek+1,n

if w = ε, d ( (E)) if w = b · w ∧ b ∈ Σ ∧ w ∈ Σ ∗ . d b

Proposition 1. The derivative of an EMRE E w.r.t. a word w denotes the set w−1 (L(E)). Proposition 2. The set of dissimilar derivatives of an EMRE is finite. Definition 8. Let E be an EMRE over an alphabet Σ and DE be the set of the dissimilar derivatives of E. Let A = (Σ, Q, I, F, δ) be the automaton defined by Q = DE , I = {E}, F = {E ∈ Q | ε ∈ L(E )}, ∀E ∈ Q, ∀a ∈ Σ, δ(E , a) = { dda (E )}. The automaton A is the dissimilar derivative DFA of E. Proposition 3. The dissimilar derivative DFA of an EMRE E recognizes L(E). Example 2. Let us consider the total EMRE E1 = ( a∗ b )( b∗ a ) · a∗ deﬁned in Example 1. Successive dissimilar derivatives of E are computed as follows: d ∗ ∗ ∗ ∗ d ∗ da (E1 ) = a b · ( b a ) · a + a = E2 da (E4 ) = a = E5 d d ∗ ∗ ∗ ∗ ∗ ∗ db (E1 ) = ( b a ) · a + b a · a = E3 db (E4 ) = b a · a = E6 d ∗ d ∗ ∗ ∗ ∗ da (E5 ) = a = E5 da (E2 ) = a b · ( b a ) · a + a = E2 d d ∗ ∗ db (E5 ) = 0 db (E2 ) = ( b a ) · a = E4 d ∗ d ∗ da (E6 ) = a = E5 da (E3 ) = a = E5 d ∗ ∗ d (E3 ) = b∗ a · a∗ = E6 db (E6 ) = b a · a = E6 db

326

P. Caron, J.-M. Champarnaud, and L. Mignot a a b

E3

a

E6

a

E5 a

b b

b a

E1

b

E2

E4

Fig. 1. The Dissimilar Derivative DFA of E1

5

Partial Derivatives of an EMRE

Partial derivatives [2] of a regular expression are deﬁned as follows. Definition 9. The partial derivative of a regular expression E w.r.t. a word w is the set ∂∂a (E) of expressions inductively computed as follows: ∂ ∂ ∂ ∂ ∂ε (E) = E, ∂ (E) = ∂ ( ∂a (E)), aw

∂ ∂a (0)

w

∂ ∂ ∂ ∂a (1) = ∂a (b) = ∅, ∂a (a) = {1}, ∂ ∂ ∂ ∂ ∗ ∗ ∂a (F ) ∪ ∂a (G), ∂a (F ) = ∂a (F ) · F , ∂ ∂ (F ) · G ∪ ∂a (G) if ε ∈ L(F ), = ∂a ∂ otherwise. ∂a (F ) · G

= ∂ ∂a (F + G) = ∂ ∂a (F

· G)

where: F and G are any two regular expressions over the alphabet Σ, a and b ∗ are any two distincts symbols of Σ and w is any word in Σ and for any set of expressions E, ∂∂a (E) = E∈E ∂∂a (E), L(E) = E∈E L(E). We now deﬁne the partial derivatives of a total EMRE. E1,n be a total EMRE over an alphabet Σ, Definition 10. Let E = T ;B let a be a symbol in Σ and w be a word inΣ ∗ . Then: ∂ (E1 ) · E2,n ∂a T≥2 ;B≥2 , ∪ (1,k−1)∈T ∂∂a (Ek ) · Ek+1,n T≥k+1 ;B≥k+1

{E} if w = ε, ∂ (E) = ∂ ∂w ( ∂ (E)) if w = b · w ∧ b ∈ Σ ∧ w ∈ Σ ∗ . ∂ ∂b

∂ (E) ∂a

=

w

Proposition 4. Let E = ∗

and w be a word in Σ . Then

E1,n T ;B ∂ L( ∂w (E))

be a total EMRE over an alphabet Σ = w−1 (L(E)).

By deﬁnition, a partial derivative of an expression E is a set of expressions and each of these expressions is called a derivated term of E. We show that the set DE of all the derivated terms of an EMRE E is ﬁnite and we give the construction of the derivated term NFA. Lemma 4. Let E = E1,n be a total EMRE over an alphabet Σ and T ;B let w be a word in Σ+ . Then: ∂ ∂w

(E) ⊂

w=uv∧v=ε

n ∂ k=1 ∂v

(Ek ) ·

T≥k+1 ;B≥k+1

Ek+1,n .

Multi-Tilde-Bar Derivatives

327

Proposition 5. Let E be a total EMRE . Then: (#DE ) ≤ |E| + 1.

Definition 11. Let E be an EMRE over an alphabet Σ . Let A = (Σ, Q, I, F, δ) be the automaton defined by Q = DE , I = {E}, F = {E ∈ Q | ε ∈ L(E )}, for any expression E ∈ Q, for any symbol a in Σ, δ(E , a) = ∂∂a (E ). The automaton A is the derivated term NFA of E. Proposition 6. The derivated term automaton of an EMRE E recognizes L(E). Example 3. Let us consider the total EMRE E1 = ( a∗ b )( b∗ a ) · a∗ deﬁned in Example 2. Successive derivated terms of E are computed as follows: ∂ ∂ ∗ ∗ ∗ ∗ ∗ ∂a (E1 ) = {a b( b a ) · a , a )} ∂a (E3 ) = {a } = {E3 } ∂ = {E2 , E3 } ∂b (E3 ) = ∅ ∂ ∂ ∗ ∗ ∗ ∗ ∗ ∂b (E1 ) = {( b a ) · a , b a · a } ∂a (E4 ) = {a } = {E3 } ∂ ∗ ∗ = {E4 , E5 } ∂b (E4 ) = {b a · a } = {E5 } ∂ ∂ ∗ ∗ ∗ ∗ = {E3 } ∂a (E5 ) = {a ∂a (E2 ) = {a b( b a ) · a } = {E2 } ∂ ∗ ∗ ∂ ∗ ∗ ∂b (E5 ) = {b a · a } = {E5 } ∂ (E2 ) = {( b a ) · a } = {E4 } b

a a a E1

a

E2

b

E4

b

a

E5

E3

a

b b

b

Fig. 2. The Derivated Term NFA of E1

6

Conclusion

We have shown how the Brzozowski derivation and the Antimirov one can be applied to the case of (simple) regular expressions extended to multi-tilde-bar operators. The computation of the c-continuations for such expressions has been already investigated even though it is not presented here. The main interest of c-continuations is that they allow us to eﬃciently implement Glushkov and Antimirov NFAs. We also intend to generalize these derivations to the case of unrestricted regular expressions extended to multi-tilde-bar operators.

References 1. Almeida, M., Moreira, N., Reis, R.: Antimirov and Mosses’s rewrite system revisited. Int. J. Found. Comput. Sci. 20(4), 669–684 (2009) 2. Antimirov, V.: Partial derivatives of regular expressions and ﬁnite automaton constructions. Theoret. Comput. Sci. 155, 291–319 (1996)

328

P. Caron, J.-M. Champarnaud, and L. Mignot

3. Antimirov, V.M., Mosses, P.D.: Rewriting extended regular expressions. Theor. Comput. Sci. 143(1), 51–72 (1995) 4. Berry, G., Sethi, R.: From regular expressions to deterministic automata. Theoret. Comput. Sci. 48(1), 117–126 (1986) 5. Brzozowski, J.A.: Derivatives of regular expressions. J. Assoc. Comput. Mach. 11(4), 481–494 (1964) 6. Brzozowski, J.A.: Quotient complexity of regular languages. Journal of Automata, Languages and Combinatorics 15(1/2), 71–89 (2010) 7. Brzozowski, J.A., Leiss, E.L.: On equations for regular languages, ﬁnite automata, and sequential networks. Theor. Comput. Sci. 10, 19–35 (1980) 8. Caron, P., Champarnaud, J.M., Mignot, L.: Erratum to “acyclic automata and small expressions using multi-tilde-bar operators”. [Theoret. Comput. Sci. 411(3839), 3423–3435] (2010); Theor. Comput. Sci. 412(29), 3795–3796 (2011) 9. Caron, P., Champarnaud, J.M., Mignot, L.: Multi-bar and multi-tilde regular operators. Journal of Automata, Languages and Combinatorics 16(1), 11–26 (2011) 10. Caron, P., Champarnaud, J.-M., Mignot, L.: Partial Derivatives of an Extended Regular Expression. In: Dediu, A.-H., Inenaga, S., Mart´ın-Vide, C. (eds.) LATA 2011. LNCS, vol. 6638, pp. 179–191. Springer, Heidelberg (2011) 11. Caron, P., Champarnaud, J.M., Mignot, L.: A general frame for the derivation of regular expressions (submitted, 2012) 12. Champarnaud, J.-M., Jeanne, H., Mignot, L.: Approximate Regular Expressions and Their Derivatives. In: Dediu, A.-H., Mart´ın-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 179–191. Springer, Heidelberg (2012) 13. Champarnaud, J.M., Ouardi, F., Ziadi, D.: An eﬃcient computation of the equation K-automaton of a regular K-expression. Fundam. Inform. 90(1-2), 1–16 (2009) 14. Champarnaud, J.M., Ziadi, D.: Canonical derivatives, partial derivatives, and ﬁnite automaton constructions. Theoret. Comput. Sci. 239(1), 137–163 (2002) 15. Conway, J.H.: Regular algebra and ﬁnite machines. Chapman and Hall (1971) 16. Frishert, M.: FIRE Works & FIRE Station: A ﬁnite automata and regular expression playground. Ph.D. thesis, Eindhoven University, Netherlands (2005) 17. Ginzburg, A.: A procedure for checking equality of regular expressions. J. ACM 14(2), 355–362 (1967) 18. Ilie, L., Yu, S.: Follow automata. Inf. Comput. 186(1), 140–162 (2003) 19. Kleene, S.: Representation of events in nerve nets and ﬁnite automata. Automata Studies Ann. Math. Studies 34, 3–41 (1956) 20. Krob, D.: Diﬀerentation of K-rational expressions. Internat. J. Algebra Comput. 2(1), 57–87 (1992) 21. Lombardy, S., Sakarovitch, J.: Derivatives of rational expressions with multiplicity. Theor. Comput. Sci. 332(1-3), 141–177 (2005) 22. Owens, S., Reppy, J.H., Turon, A.: Regular-expression derivatives re-examined. J. Funct. Program. 19(2), 173–190 (2009) 23. Sulzmann, M., Lu, K.: Partial derivative regular expression pattern matching (December 2007) (manuscript) 24. Yu, S.: Regular languages. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages. Word, Language, Grammar, vol. I, pp. 41–110. Springer, Berlin (1997)