On Strongly Context-Free Languages
Gheorghe Paun
Institute of Mathematics of the Romanian Academy PO Box 1-764, 70700 Bucuresti, Romania
Grzegorz Rozenberg
Department of Computer Science Leiden University PO Box 9512, 2300 RA Leiden, The Netherlands
Arto Salomaa
Turku Centre for Computer Science TUCS Technical Report No 205 October 1998 ISBN 952-12-0293-9 ISSN 1239-1891
Abstract Starting from the question (inspired by the so-called computing by carving, from the DNA-based computing area) \which is easier to generate, a language or its complement?", we brie y investigate the context-free languages whose complements are also context-free. We call them strongly context-free languages. After examining the closure properties of the family of strongly context-free languages, we prove that there are such languages of a bounded complexity in terms of the number of nonterminals or productions necessary to generate them, whereas the complexity of their complements is arbitrarily large.
Keywords: Computing by carving, Chomsky hierarchy, Complementation, Descriptional Complexity
TUCS Research Group
Mathematical Structures of Computer Science
1 Introduction Most of the experiments reported so far in the DNA-based computing area (see [1], [10], etc.) develop along two mains phases: (1) generate a (large) set of candidate solutions to the problem, and (2) remove the non-solutions, by an iterated ltering procedure. This suggests considering in a systematic way this strategy of computing, not very usual in (theoretical) computer science, called computing by carving in [13]; see also [9]. In formal language theory terms, this idea leads to several interesting questions. For instance, computing by carving can mean that when we need to produce the strings of a language L we rst produce a superlanguage M L, and then \chop" from M the strings which are not in L. The parts of M ? L to be removed can be identi ed, for instance, by nite means: start from an initial nite set of strings and iterate on them a nite state sequential transducer (a gsm). It is shown in [13] that in this way we can compute exactly the complements of recursively enumerable languages. Moreover, the sequential transducers with three states are enough { see [11] for precise de nitions and for the proof of this result. We continue here this direction of research with a particular case: compute by carving context-free languages. More speci cally, we consider the following problem: given a context-free language L over an alphabet V such that also its complement is context-free (such a language is called strongly context-free), which is easier to generate, L or V ? L ? (\Easier" here is understood in terms of descriptional complexity, [4].) We solve the problem for the measures V ar (the number of variables) and Prod (the number of productions) in a way which indicates that \computing by carving" can be quite useful in this framework: there are context-free languages which need arbitrarily many nonterminals or arbitrarily many rules when generating them, but their complements (also context-free languages) can be generated by grammars with a number of nonterminals, respectively rules, bounded by a constant given in advance. For other measures of descriptional complexity of context-free languages (for instance, Symb = the total number of symbols appearing in the rules of a grammar) the problem remains open. We also investigate the closure properties of the family of strongly context-free languages: it is closed under complementation, intersection with regular sets, inverse morphisms, left and right derivatives, and mirror image, but not under union, intersection, concatenation, morphisms, Kleene +, and left and right quotients by nite languages. 1
2 Strongly Context-free Languages For an alphabet V , we denote by V the free monoid generated by V ; is the empty string, jxj is the length of x 2 V . The mirror image of a string x 2 V is denoted by mi(x). If x 2 V and L ; L V , then the left derivative of L with respect to x is @xl (L ) = fy 2 V j xy 2 L g, while the left quotient of L with respect to L is L nL = fy 2 V j xy 2 L for some x 2 L g. The right derivative and quotient are de ned in the symmetric way. For a language L V we denote by alph(L) the set of symbols appearing in the strings of L. A language L V is said to be: thin if card(L \ V ) 1 for all n 1, slender if there is a constant k such that card(L \ V ) k for all n 1, f-slender, for a mapping f : N ?! N, if card(L \ V ) f (n), for all n 1, co-thin, co-slender, co-f-slender if V ? L is thin, slender, f -slender, respectively. (See [15], [6], [8], etc., for results about such languages.) By REG; LIN; CF; CS; REC; RE we denote the families of regular, linear, context-free, context-sensitive, recursive, and recursively enumerable languages, respectively. By DCF we denote the family of deterministic context-free languages. For further elements of formal language theory we refer to [16]. For an arbitrary family of languages FL we denote by SFL the family fL 2 FL j alph(L) ? L 2 FLg; we say that the languages in SFL are of a strong FL type. The following two assertions about families of strongly FL languages are obvious: { Each family SFL is closed under complementation. { If FL is closed under complementation, then SFL = FL; this is the case for REG and CS . We now compare the families SFL, for FL in the Chomsky hierarchy, with each other and with the families in the Chomsky hierarchy. We collect in the next theorem several facts about this topic. Theorem 1. (i) REG = SREG SLIN LIN CF CS = SCS REC = SRE RE . (ii) SLIN SCF CF . (iii) LIN and SCF are incomparable. (iv) DCF SCF . (v) SCF contains inherently ambiguous context-free languages, but not all unambiguous context-free languages are in SCF. (vi) There are co-thin context-free languages and linear-slender contextfree languages which are not in SCF. 1
1
1
1
2
2
2
2
2
1
1
1
Proof. All inclusions are obvious. The strictness and the incomparability follow from the following assertions. The Dyck language (over any number of symbol pairs) is deterministic context-free, but not linear. Observe that, for the Dyck language D over the alphabet V = fa; bg, we have V ? D = DfbgV [ V fagD. The language generated by the (minimal) linear grammar G = (fS g; fa; b; c; dg; S; fS ! aSa; S ! bSb; S ! dSa; S ! dSb; S ! dS; S ! cg) is not in SCF (its complement is not context-free: Problem 7, page 206, [5]). The language L = fan bn j n 1g [ fan b n j n 1g is linear, strongly context-free, but not deterministic context-free. (It is also slender.) Moreover, the language L = fap bq cr ds et j (p = q and r = s) or (q = r and s = t); p; q; r; s; t 1g is inherently ambiguous and strongly contextfree (Problem 5, page 246, [5]), while there are unambiguous context-free languages whose complement is not context-free (Problem 6, page 246, [5]). For V = fa; b; cg, consider the morphism h : V ?! V de ned by h(a) = abc; h(b) = b; h(c) = c ; and the in nite string generated by the D0L system G = (V; a; h) (also used in [7]) w = abcbc bc bc : : :bc b : : : Consider the language Pref (w), of all pre xes of w. Clearly, this language is thin and not context-free. According to [2], the complement of this language is context-free. In order to complete the proof of (vi), we consider the language L = fan bncm j n; m 1g [ fan bmcm j n; m 1g: It is easy to see that 8 0; if n = 1; 2, < n card(L \ fa; b; cg ) = : n ? 1; if n is odd, n 3, n ? 2; if n is even, n 4. Therefore, L is f -slender, for f (n) = n ? 1 (it is linear-slender). The rst assertion in point (v) can be formulated in a stronger form: there are inherently ambiguous languages in SCF such that their complement is also inherently ambiguous: [12]. We do not know whether or not there are unambiguous strongly context-free languages such that their complement is inherently ambiguous. Open problem 1: Are there slender context-free languages which are not strongly context-free ? (Conjecture: not.) Let us also mention that it is not decidable whether or not a context-free language is strongly context-free { see Theorem 2.4 in Chapter VIII of [18]. 2
1
2
2
2
4
3
8
i
2
3 Closure Properties of SCF and SLIN From the previous remarks, we see that the families of interest for a further study are SCF and SLIN . We start by investigating their closure properties. Theorem 2. The families SCF, SLIN are closed under intersection with regular languages, inverse morphisms, mirror image, left and right derivatives, but they are not closed under union, intersection, concatenation, morphisms, Kleene +, left or right quotients by nite languages. Proof. The closure under intersection with regular languages holds for all families SFL for which FL is closed under union with regular languages and intersection with regular languages. Let h : V ?! V be a morphism and L V a language from SCF . That is, L 2 CF and alph(L) ? L 2 CF ; therefore, also h? (L) 2 CF . Because V ? L = (V ? alph(L)) [ (alph(L) ? L); we also have V ? L 2 CF (the language V ? alph(L) is regular). Consider the language alph(h? (L)) ? h? (L). It is easy to see that we have the equation 1
2
2
1
2
2
2
2
1
1
alph(h? (L)) ? h? (L) = (V ? h? (L)) ? (V ? alph(h? (L))): 1
1
1
1
1
We also have
1
V ? h? (L) = h? (V ? L): Because V ? L 2 CF , we have V ? h? (L) 2 CF . As V ? alph(h? (L)) is regular, it follows that alph(h? (L)) ?h? (L) is context-free. Consequently, h? (L) 2 SCF . The mirror image and the complementation commute: for all x 2 V and L V we have x 2 mi(L) , mi(x) 2 L, or, equivalently, x 2= mi(L) , mi(x) 2= L, that is, x 2 alph(L) ? mi(L) , mi(x) 2 alph(L) ? L. Therefore, x 2 alph(L) ? mi(L) , mi(x) 2 alph(L) ? L , x 2 mi(alph(L) ? L), showing alph(L) ? mi(L) = mi(alph(L) ? L). If L 2 SCF , then mi(L) 2 SCF . Also alph(L) ? L is context-free and, thus, mi(alph(L) ? L) is context-free, proving that mi(L) 2 SCF . The argument is similar for L a language in SLIN . 1
1
1
2
1
2
1
1
1
1
1
1
In view of the closure under mirror image, it is enough to discuss left derivatives (and quotients). l (L) = @ l (@ l (L)); thus, it For all a 2 V; w 2 V ; L V we have @aw w a suces to consider (left) derivatives with respect to letters. Assume L 2 SCF; alph(L) = V; and consider @al (L), for a 2 V . Also 0 L = L \ fagV = fagL is in SCF , for L = @al (L) = @al (L0). Now, for all x 2 V , x 2 L , ax 2 L0, which we can write in the form x 2 V ? L , 1
1
1
1
4
ax 2 V ? L0. On the other hand, by the de nition of the derivative, ax 2 V ? L0 , x 2 @al (V ? L0 ). This shows that V ? L = V ? @al (L0) = @al (V ? L0). We know that V ? L0 2 CF , whence @al (V ? L0 ) 2 CF , showing that @al (L0) = @al (L) is in SCF . 1
The proof is the same for the linear case. For union and intersection we consider the languages L = fan bnam j n; m 1g; L = fanbmam j n; m 1g: They are both strongly linear, but their intersection is not context-free, hence not strongly context-free. From the closure under complementation we also obtain the non-closure under union. If we take L0i = Li [fg; i = 1; 2; then L0 ; L0 2 SLIN , but L0 L0 2= SCF : we have (fa; bg ? L0 L0 ) \ a b a = fan bm ap j n; m; p 1; n 6= m; m 6= pg (a string in a b a which is not in L0 or in L0 should be of the form an bm ap with both n 6= m and m 6= p). Denote the language fan bm ap j n; m; p 1; n 6= m; m 6= pg by L. L is not a context-free language. In order to prove this assertion, we can use the Ogden lemma. Let r be the constant in the lemma (all strings z of length at least r can be written as z = uvwxy , with jvwxj r, vx 6= , such that zi = uviwxiy 2 L for all i 0). Consider a string of the form z = an bm an for any m r and n = m!+ m. We mark the occurrences of b. In any decomposition z = uvwxz we must have one of v; x containing { hence composed of { symbols b and the other one consisting of symbols a; b, or c. Assume that we have u = as , x = bt . The other cases can be treated in a similar way. We have 1 t m. For i = m!=t + 1 (note that this is an integer) we get zi = am m?s s m =t bm?t t m =t cm m = am m s m =t bm m cm m , which is not in the language L, a contradiction. Therefore, L0 L0 is not strongly context-free, which implies the nonclosure under concatenation. For the language L = fan bncm j n; m 1g [ fanbmam j n; m 1g; which is strongly linear, and the morphism h which maps a and c in a, and b in b, we obtain h(L ) = L [ L , which is not in SCF . Thus, we obtain the non-closure under morphisms (even codings). For Kleene + we consider the language L = fanbna j n 1g [ fabn an j n 1g [ fag: It is strongly linear. The language L is context-free, but not strongly context-free, because its complement is not context-free: one can see that (fa; bg ? L ) \ a b a = fan bm ap j m > n; m > p; n; p 1g: 1
2
1
+
1
+
+
+
!
)
2
+
+
2
!+
+ (
1
2
1
!+
2
!+
1
+ (
!+
2
3
3
1
2
4
+ 4
+
+
+
+
4
5
!
+1)
+ (
!
+1)
!+
(The strings in fa; bg which are of the form an bm ap but are not in L are neither in a fan bn a j n 1ga nor in a fabn an j n 1ga, and this implies the relations between n; m; p as mentioned above; note that by concatenating strings in fan bn a j n 1g and fabn an j n 1g we do not obtain strings in an bmap.) The language L = fan bm ap j m > n; m > p; n; p 1g is not context-free. Assume the contrary and take a reduced (all nonterminals are used in a terminal derivation) context-free grammar G for L. Because of the form of strings in L, all recursive derivations in G must be of the form A =) w Aw ; B =) z Bz with w =) ai ; w =) bj and z =) bk; z =) al, with i j and k l. Thus, there is a constant q (given by the nonrecursive derivations in G { their number is nite) such that all strings in L(G) are of the form an bmap with m + q n + p. However, all strings of the form an bn an are in L. For n > q +1 we have n +1+ q < n + n. Therefore, the strings of this type cannot be generated by G, a fact which contradicts the equality L(G) = L. Consequently, L is not strongly context-free. For the left quotient we note that we have + 4
1
2
1
2
1
2
1
2
+1
+ 4
fc; dgn(fcgL [ fdgL ) = L [ L ; which is not strongly context-free although fcgL [fdgL is strongly linear. 1
2
1
2
1
2
It is very illustrating to compare the closure under derivative with the non-closure result concerning quotient with fc; dg, from the point of view of pushdown automata. For the former result, we have to construct a pushdown automaton for the language L when we know an automaton A for fagL . This we can do: the new automaton A0 rst reads the empty word and makes a move possible for A when reading a. After that, A0 just simulates A. Such a construction is not possible if we have to go from an automaton A for fcgL [fdgL to an automaton A0 for L [ L . There is no way for A0 to distinguish the processing of L from that of L because A0 has to enter the two procedures by reading just the empty word. 1
1
1
2
1
1
2
2
4 Descriptional Complexity We now address the main problem which motivates the interest for considering strongly context-free languages, the comparison of complexity of a language and of its complement. Our main result, Theorem 3 below, was already described in the Introduction. For a context-free grammar G = (N; T; S; P ) we denote by V ar(G); Prod(G) the cardinalities of N and P , respectively, and we also de ne X Symb(G) = (jxj + 2): A!x2P
6
For L 2 CF and K 2 fV ar; Prod; Symbg we de ne K (L) = minfK (G) j L = L(G); G a context-free grammarg:
Theorem 3. For all n 1, there is a stricly linear language Ln (resp. 13 (resp.
L0n ) over fa; bg with Prod(Ln) n and Prod(fa; bg ? Ln) V ar(L0n) n and (V ar(fa; bg ? L0n ) 3). Proof. For the measure Prod, we consider the language
Lm = fai bi j 0 i mg: The fact that Lm 2 SLIN is obvious. Let G = (N; fa; bg; S; ; P ) be a context-free grammar generating the language Lm . If there is a derivation S =) uAvBw =) ai bi such that there are terminal derivations A =) x ; A =) x , B =) y ; B =) y , with x = 6 x and y = 6 y , then at least one of the strings which can be generated by nishing the derivations S =) ux vy w; S =) ux vy w; S =) ux vy w; S =) ux vy w is not in Lm. (An easy examination of cases can be done, depending on the form of x ; x ; y ; y : they can be strings of the forms ak ; bk ; ak bl .) Therefore, in each sentential form ap1
1
2
1
2
1
1
2
1
2
2
2
1
1
2
2
1
2
1
2
pearing in a terminal derivation at most one nonterminal will derive two dierent terminal strings. Thus, all other nonterminal occurrences can be replaced, in all rules where they appear, by the unique terminal string they introduce. This procedure leads to a grammar G0 equivalent with G, with Prod(G0) Prod(G), and which is linear. Consequently, the context-free grammar which is minimal as the number of productions and generates the language Lm is linear. Now, from Lemma 2.3 in [3] we know that if G is a linear grammar with k productions and L(G) is a nite language, then L(G) contains at most 2k? strings. Our language Lm contains m + 1 strings, hence we must have Prod(Lm) log (m + 1) +1: Therefore, there are languages Lm with arbitrarily large Prod(Lm ) (it increases when m increases). However, we can write fa; bg ? Lm = fa; bgbafa; bg [ fa; bgam fa; bg [ fa; bgbm fa; bg [ faibj j i; j 0; i 6= j; i + j 1g: This language can be generated by the grammar with the following rules: 1
2
+1
+1
S ! XbaX; X ! ; S ! aSb; S ! Xam X; X ! aX; S ! A; A ! aA; B ! Bb; S ! Xbm X; X ! bX; S ! B; A ! a; B ! b: (If, after using several times the rule S ! aSb, we introduce the nonterminal A, then we get a string of the form ai bj ; i > j ; if we introduce the +1
+1
7
nonterminal B , then we get a string of the form ai bj ; j > i; if we introduce the nonterminal X , then we either introduce a substring ba or a substring am ; bm ; in all cases we get strings not in Lm.) Therefore, Prod(fa; bg ? Lm ) 13 and the assertion for Prod is proved. +1
+1
Note that Symb(G), where G is the previous grammar for fa; bg ? Lm , is equal to 2m + 51. However, also the language Lm can be generated by a grammar of a comparable size (linear in m). Such a grammar is that containing the following rules:
S ! ; S ! aA b; Ai ! ; Ai ! aAi b; for i = 1; 2; : : :; m ? 2; Am? ! ; Am? ! ab; 1
+1
1
1
which has Symb(G) = 7m ? 1. For the measure V ar we consider the language
Ln =
[n i
(ai b) : +
=1
It is known (see, e.g., [4]) that V ar(Ln ) = n + 1. It is easy to see that the complement of this language can be generated by the grammar with the following productions:
S ! XaibXaj bX; for 1 i; j n; i 6= j; X ! ak bX; for 1 k n; X ! ; S ! ; S ! Y an Y; S ! bY; S ! Y bbY; S ! Y a; Y ! aY; Y ! bY; Y ! : +1
Consequently, V ar(fa; bg ? Ln ) 3: Open problem 2: Are there results such as that in Theorem 3 for Symb, or for other measures of descriptional complexity of context-free languages ? Of a particular interest is the index of context-free languages (the minimal number of nonterminals simultaneously present in a sentential form in a terminal derivation). It is known that there are context-free languages of in nite index, [17]. Are there such languages which have the complement context-free and of a nite index ?
Note: Work supported by the Academy of Finland, Project 137358. 8
References [1] L. M. Adleman, Molecular computation of solutions to combinatorial problems, Science, 226 (1994), 1021 { 1024. [2] J. Berstel, Every iterated morphism yields a co-CFL, Inform. Processing Letters, 22, 1 (1986), 7 { 9. [3] W. Bucher, K. Culik II, H. A. Maurer, D. Wotschke, Concise description of nite languages, Theor. Computer Sci., 14 (1981), 227 { 246. [4] J. Gruska, Descriptional complexity of context-free languages, Proc. MFCS Symp., High Tatras, 1973, 71 { 83. [5] M. A. Harrison, Introduction to Formal Language Theory, AddisonWesley, Reading, Mass., 1978. [6] J. Honkala, On slender languages, Bulletin of the EATCS, 64 (1998), 145 { 152. [7] J. Honkala, On a problem of G. Paun, Bulletin of the EATCS, 64 (1998), 341. [8] L. Ilie, On a conjecture about slender context-free languages, Theoretical Computer Sci., 132 (1994), 427 { 434. [9] V. Manca, C. Martin-Vide, Gh. Paun, New computing paradigms suggested by DNA computing: Computing by carving, Proc. of the 4th Intern. Meeting on DNA Based Computers (L. Kari, H. Rubin, D. H. Wood, eds), Pennsylvania Univ., June 1998, 41 { 56. [10] Q. Ouyang, P. D. Kaplan, S. Liu, A. Libchaber, DNA solution of the maximal clique problem, Science, 278 (1997), 446 { 449. [11] V. Manca, C. Martin-Vide, Gh. Paun, Iterated gsm mappings; a collapsing hierarchy, to appear. [12] Gh. Paun, A positive answer to problem P11, Bulletin of the EATCS, 19 (1983), 9 { 10. [13] Gh. Paun, (DNA) Computing by carving, Technical Report TCS-9717, Center for Theoretical Study at Charles Univ. and the Academy of Sciences of the Czech Republic, Prague, 1997. [14] Gh. Paun, G. Rozenberg, A. Salomaa, DNA Computing. New Computing Paradigms, Springer-Verlag, Berlin, 1998. [15] Gh. Paun, A. Salomaa, Thin and slender languages, Discrete Appl. Math., 61 (1995), 257 { 270. 9
[16] G. Rozenberg, A. Salomaa, eds., Handbook of Formal Languages, 3 volumes, Springer-Verlag, Heidelberg, 1997. [17] A. Salomaa, On the index of a context-free grammar and language, Inform. Control, 14 (1969), 474 { 477. [18] A. Salomaa, Formal Languages, Academic Press, New York, 1973.
10
Turku Centre for Computer Science Lemminkaisenkatu 14 FIN-20520 Turku Finland http://www.tucs.abo.
University of Turku Department of Mathematical Sciences
Abo Akademi University Department of Computer Science Institute for Advanced Management Systems Research
Turku School of Economics and Business Administration Institute of Information Systems Science