Iterated GSM Mappings: A Collapsing Hierarchy - CiteSeerX

Comment

Report 1 Downloads 29 Views

Iterated GSM Mappings: A Collapsing Hierarchy Vincenzo MANCA

Universita degli Studi di Pisa Dipartimento di Informatica Corso Italia 40, 56125 Pisa, Italy

Carlos MARTIN-VIDE

Research group in Mathematical Linguistics and Language Engineering Rovira i Virgili University Pl. Imperial Tarraco 1, 43005 Tarragona, Spain

Gheorghe Paun

Institute of Mathematics of the Romanian Academy PO Box 1-764, 70700 Bucuresti, Romania

Turku Centre for Computer Science TUCS Technical Report No 206 October 1998 ISBN 952-12-0294-7 ISSN 1239-1891

Abstract With motivations from various areas (Lindenmayer systems, iterated reading of literary works, self-generated in nite sequences, \computing by carving" as suggested in the DNA computing area), in several places mechanisms based on iterated (non-deterministic) nite state sequential transducers were considered. It is known that such mechanisms can characterize the family of recursively enumerable languages. We continue here the study of such devices, investigating the hierarchy on the number of states. We nd that this hierarchy collapses: four states are enough in order to characterize the recursively enumerable languages, three states lead to non-recursive languages and cover the ET0L languages, while two states can cover the E0L (hence also context-free) languages. The case of deterministic transducers remains open.

Keywords: Computing by carving, Chomsky hierarchy, GSM mapping

TUCS Research Group

Mathematical Structures of Computer Science

1 Introduction Iterated gsm (generalized sequential machine) mappings were already investigated in seventies (see, e.g., [22], [14], [18]), and it is a still unexhausted topic (see, e.g., the recent paper [11]). Because, as it is easy to see, such a device can simulate the work of a type-0 Chomsky grammar, in this way we obtain characterizations of recursively enumerable languages. (Depending on the precise mode of de ning the accepted language, we need or not an intersection with T , where T is the alphabet of the desired language, as a nal squeezing operation.) A similar phenomenon, of repeated translations of a text, appears in many areas, especially in linguistics and in literary studies. It is enough to mention the need of iterated improvements of the translation of a text from a language to another one (especially when the text is a piece of ction and the phrase structure in the two languages is dierent), or the iterated reading of literary works, in the sense advocated in [6] (the reading process applied to poetry creates an in nite openness, leading to an in nite variety of possible interpretations, equivalent to in nitely many possible in nite extensions of the initial nite text). Further discussion about these points and about similar ones can be found in [5]. A similar process of repeated \translation" appears when constructing in nite self-reading sequences, in the sense of [17]. A recent area where the need of iterated gsm's has appeared is DNA computing, in relation with the so-called computing by carving, [15]. (A preliminary version of [15] has appeared as a Technical Report { CTS-9717 { of the Center for Theoretical Study at Charles University and the Academy of Sciences of the Czech Republic, Prague, in December 1997.) We will brie y present this idea in the next section. In short, we \compute" a language by identifying strings in its complement and \carving" them from a given superlanguage; the removed strings are obtained by iterating a gsm on a starting regular language. A language can be computed in this manner if and only if it is the complement of a recursively enumerable language. Thus, in this way we also get non-recursively enumerable languages (but we do not cover all recursively enumerable ones). In this framework it appears as a natural problem to consider simple gsm mappings, for instance, with a bounded number of states. The problem was formulated in [15] whether or not the number of states induces an in nite hierarchy of the languages computed by carving. We solve this problem here. Somewhat surprisingly, this hierarchy collapses at a rather low level: three states suce. We place this question in a more general framework, that of languages obtained by an iterated use of a sequential transducer. We nd that (non-deterministic) iterated transducers with four states characterize the recursively enumerable languages (the extra state is needed in order to select strings which are terminal with respect to a given alphabet; in the 1

case of carving this is accomplished in a dierent way { see details in the sections below). Several connected results are also given (three states suce in order to obtain non-recursive languages and to cover the ET0L languages, while two states suce in order to obtain all E0L languages). These results deal with non-deterministic gsm's. The case of iterated deterministic gsm's remains almost completely open. We only know that by iterating such gsm's with two states we can get non-context-free non-0L languages. The fact whether or not the hierarchy on the number of states is in nite remains to be investigated for this case.

2 Computing by Carving DNA computing is a new area of research, fastly emerging and highly promising, trying to make use of the huge paralellism and of the clear syntactic structure and behavior of DNA molecules. The rst practical step in this respect was the experiment reported in [1], where a small instance of the Hamiltonian Path Problem in a graph was solved by biochemical means. Another successful experiment was reported in [13]: nding the size of the maximal clique in a graph. (Note that both these problems are NP-complete and that their biochemical solutions are linear in the number of biochemical steps.) Several other experiments are discussed in [2], [3], [9], [12]. The procedures in [1] and [13] follow the same phases (and similar strategies can be found in other experiments mentioned above): (1) one rst generates a large set of candidate solutions to the problem, (2) one identi es sets of non-solutions, and (3) one removes such non-solutions from the candidate set, repeatedly, until only the solutions remain. In [13] one explicitly speaks about the complete data pool from which, by speci c ltering procedures, one removes the non-solutions. Inspired by these observations and by the (theoretically) easy way of implementing the dierence of languages in DNA terms, the general strategy of computing by carving was proposed in [15], proceeding along the three steps mentioned above. In short, in order to construct the set of the solutions to a problem, we rst construct a superset of it (one which is \easy" to be constructed) and then we lter it iteratively (by \carving" sets of nonsolutions, also \easy" to be constructed) until only the solutions of the problem remain. In some sense, the idea is not new: this is exactly the style of the Erathostene's sieve for constructing the prime numbers, where the \complete data pool" is the set of natural numbers (which is given \for free" in advance). For the formal language theory, the computing by carving leads to a rather new way of identifying a language L: start from a superlanguage of L, say M , large and easy to be obtained, then remove from M strings or sets of 2

strings, iteratively, until obtaining L. Contrast this strategy with the usual \positivistic" grammatical approach, where one produces the strings in L at the end of successful derivations, discarding the \wrong derivations" (derivations not ending with a terminal string), and constantly ignoring/avoiding the complement of L. In particular, M above can be V , the total language over the alphabet V of the language L (V is the language of all strings of symbols in V ; the empty string is denoted by and the set of all non-empty strings over V is denoted by V + ; the length of a string x 2 V is denoted by jxj; for formal language theory notions and results which we will use here we refer to [20]). Therefore, as a particular case, we can try to produce V ? L, the complement of L, and to extract it from V . As one knows, the family of recursively enumerable languages is not closed under complementation. Therefore, by carving we can \compute" non-recursively enumerable languages ! However, identifying a language L by generating rst its complement, H = V ? L, and then computing the dierence V ? H is quite a rough procedure. A more subtle one, present in fact also in the DNA computing experiments discussed above, is that where L is obtained at the end of several dierence operations, possibly at the end of an in nite sequence of such operations: we construct V , as well as certain languages L1 ; L2; : : :, and we iteratively compute V ? L1, (V ? L1 ) ? L2,. . . , such that L is obtained at the limit. Still, this is not completely satisfactory: the languages Li ; i 1, can be individually simple and rather complex as a sequence. For instance, take Li as consisting of the ith string in the lexicographic order in a given language, which can be of any complexity in the Chomsky hierarchy. Each Li is a singleton, their union could be non-recursively enumerable ! Thus, the sequence of languages Li ; i 1, must be de ned in a regular, nite, way. The proposal in [15] is to consider regular sequences of languages: A sequence L1; L2; : : : of languages over an alphabet V is called regular if L1 is a regular language and there is a gsm g such that Li+1 = g(Li); i 1. (A gsm is a construct g = (K; V1; V2; s0; F; P ), where K is the nite set of states, V1 is the input alphabet, V2 is the output alphabet, s0 2 K is the initial state, F K is the set of nal states, and P is the nite set of transition rules, of the form sa ! xs0, with s; s0 2 K; a 2 V1; x 2 V2. For u; x 2 V2 ; a 2 V1; v 2 V1 ; s; s0 2 K we write usav ` uxs0v if and only if sa ! xs0 2 P . For a string w 2 V1 we de ne g(w) = fz 2 V2 j s0w ` zs; for some s 2 F g; the mapping g is extended in the natural way to languages. Of course, in order to can iterate the gsm mapping asociated with g = (K; V1; V2; s0; F; P ) we must have V2 V1 .) 3

Thus, L1 is a given regular language, all other languages are iteratively constructed by applying the gsm g to L1. In this way, we can identify the sequence by its generating pair (L1 ; g ). Then, a language L V is said to be C-REG computable if there is a regular sequence of languages, L1 ; L2; : : : (that is, a pair (L1; gS) as speci ed above), and a regular language M V such that L = M ? ( i1 Li ). Remark 1. In the de nition above of C-REG computable languages we have allowed M to beSany regular language. In [15] one also asks that M is a superlanguage of i1 Li , which ts better the idea of carving, but raises some technical diculties (we also need a given regular language L0 to be removed from M together with the languages Li ; i 1, in a regular sequence). We can denote by g i the ith iteration of g; i 0, and then Li+1 = gi(L1);Si 0 (taking g 0(H ) = H for all H V ). Denoting by g (L1) the union i0 g i (L1), we can write L = M ? g (L1). We are again back to the complement with respect to M of a single language, g (L1), but this language is the union of a regular sequence of languages (the languages in the sequence are de ned by nite tools: a regular grammar { or a nite automaton { for the language L1, and the gsm g ). The following result is proved in [15]: Theorem 1. A language is C-REG computable if and only if it can be written as the complement of a recursively enumerable language. Corollary 1. (i) Every context-sensitive language is C-REG computable. (ii) The family of C-REG computable languages is incomparable with the family of recursively enumerable languages. Proof. Assertion (i) follows from the fact that the family of contextsensitive languages is closed under the complementation, see [10], [21]. Assertion (ii) is a consequence of the non-closure under the complementation of the family of recursively enumerable languages. In order to put the previous result in a better perspective, let us denote by Co-RE the family of complements of recursively enumerable languages and by UREG the family of languages of the form g (L1 ), where g is a gsm and L1 is a regular language. The following necessary condition for a language to be in UREG is proved in [15]. Lemma 1. If (L1; g) identi es an in nite language L = g(L1), then there is a constant k such that for each x 2 L we can nd y 2 L such that

jxj < jyj kjxj:

For instance, the language n

L = fa22 j n 1g 4

does not have the property in Lemma 1, hence L 2= UREG. Therefore, UREG is a proper subfamily of RE . However, from Theorem 1 we have Co-RE = fL1 ? L2 j L1 2 REG; L2 2 UREGg. Moreover, we have (proofs of this result can be found in [22], [14], [18]): Theorem 2. Every recursively enumerable language L T can be written in the form L = g (fa0g) \ T , where g is a gsm { depending on L { and a0 is a xed symbol not in T. Informally speaking, modulo an intersection with a star language, T , the family UREG is equal to RE . At the end of [15] one asks the question whether or not the number of states in a gsm induces an in nite hierarchy of C-REG computable languages. We shall solve here this problem, by proving that the answer is negative.

3 Iterated Finite State Sequential Transducers We rst introduce a general set-up for our investigations, by de ning the notion of an iterated nite state sequential transducer (in short, an IFT), as a language generating device. Among the reasons for considering this notion are the facts that, according to Theorem 2, in order to characterize RE by iterated gsm's we can start from a single symbol (not from a regular language, as in the de nition of UREG), but we also need an intersection with a language T . Such an intersection can easily be done by a gsm (for instance, using a special state for this purpose), so we introduce this feature \inside" the gsm work. Because we explicitly provide the starting symbol a0 in the machinery, we prefer to use a new terminology, speaking of IFT's and not of gsm's of a special form and with a special functioning. An IFT is a construct = (K; V; s0; a0; F; P ), where K; V are disjoint alphabets (the set of states and the alphabet of ), s0 2 K (the initial state), a0 2 V (the starting symbol), F K (the set of nal states), and P is a nite set of transition rules of the form sa ! xs0 ; for s; s0 2 K; a 2 V; x 2 V (in state s, the device reads the symbol a, passes to state s0 , and produces the string x). For s; s0 2 K and u; v; x 2 V ; a 2 V we de ne

usav ` uxs0v i sa ! xs0 2 P: This is a direct transition step with respect to . We denote by ` the re exive and transitive closure of the relation `. Then, for w; w0 2 V we de ne

w =) w0 i s0 w ` w0s; for some s 2 K: 5

We say that w derives w0 ; note that this means that w0 is obtained by translating the string w, starting from the initial state of and ending in any state of , not necessarily a nal one. We denote by =) the re exive and transitive closure of the relation =). If in the writing above we have s 2 F (we stop in a nal state), then we write =f) instead of =); that is, w =f) w0 i s0 w ` w0 s; for some s 2 F: The language generated by is

L( ) = fw 2 V j a0 =) w0 =f) w; for some w0 2 V g: Therefore, we iteratively translate the strings obtained by starting from a0, without care about the states we reach at the end of each translation, but at the last step we necessarily stop in a nal state. This makes possible the avoiding of the intersection with a language of the form of T , as in the statement of Theorem 2. The IFT's as de ned above are non-deterministic. If for each pair (s; a) 2 K V there is at most one transition sa ! xs0 in P , then we say that is deterministic. We denote by IFTn ; n 1, the family of languages of the form L( ), for non-deterministic with at most n states; when using deterministic IFT's we write DIFTn instead of IFTn . The union of all families IFTn ; DIFTn ; n 1, is denoted by IFT; DIFT , respectively. In the next section we will compare these families with the families in the Chomsky hierarchy, REG, CF, CS, RE (of regular, context-free, contextsensitive, recursively enumerable languages, respectively), and with families in the Lindenmayer hierarchy, 0L and E 0L (of languages generated by interactionless L systems and by extended interactionless L systems, respectively); D0L; DE 0L are the families corresponding to 0L; E 0L generated by deterministic L systems. We refer to [19] and [20] for details about these hierarchies. Convention. When comparing the power of two generative devices, G1; G2, the empty string is ignored: G1 is equivalent to G2 if and only if L(G1) ? fg = L(G2 ) ? fg.

4 The Hierarchy IF Tn Collapses The following relations are direct consequences of the de nitions: Lemma 2. (i) IFTn IFTn+1; DIFTn DIFTn+1; n 1. (ii) DIFTn IFTn ; n 1, DIFT IFT . (iii) IFT RE . Because IFT's with one state are equivalent to nite substitutions (the only state is both initial and nal), while deterministic IFT's with only one state are equivalent to morphisms, we have: 6

Lemma 3. IFT1 = 0L; DIFT1 = D0L: Moreover, we have: Lemma 4. E 0L IFT2: Proof. Consider an E0L system G = (V; T; w; P ) (the total alphabet, the terminal alphabet, the axiom, and the set of rewriting rules, of the form a ! x; for a 2 V; x 2 V ; for each a 2 V there is at least one rule a ! x 2 P ; one says that P is complete). Let a0 ; a1 be two new symbols. We construct the IFT

= (fs0 ; s1g; V [ fa0; a1g; s0; a0; fs1g; P 0); where

P 0 = fs0 a0 ! a1 ws0 ; s0a1 ! a1 s0 g [ fs0a ! xs0 j a ! x 2 P g [ fs0a1 ! s1g [ fs1a ! as1 j a 2 T g: The rst step, s0 a0 ` a1ws0 , introduces the axiom of G and the left marker a1 . Each subsequent translation s0 a1 z ` a1 z 0 s0 precisely corresponds to an equivalent derivation step z =) z 0 in G. The nal state s1 can be reached only by a transition s0 a1 z ` s1 z ; the work of can be nished only when z 2 T . After removing the symbol a1 , the nal state s1 cannot be reached again, hence no further iteration can modify the string. Consequently, L(G) = L( ). Corollary 2. CF IFT2. This corollary is also proved in [18] (Theorem 1.7) for iterated gsm's (using an intersection with T as a squeezing mechanism), where a characterization of context-free languages is also obtained: two-state iterated gsm's such that a copying cycle exists for each state generate only contextfree languages (a copying cycle is a cycle using a rule of the form sa ! as, that is leaving unchanged the symbol a). In view of the previous theorem, the copying cycle property is crucial in this characterization. The IFT constructed above is not deterministic, even when starting from a deterministic E0L system G: we have the rules s0 a1 ! a1 s0 and s0 a1 ! s1 with the same left hand member. We do not know whether or not DE 0L DIFT2 , but still we have: Lemma 5. DIFT2 ? CF 6= ;; DIFT2 ? 0L 6= ;: Proof. Consider the DIFT

= (fs0; s1g; fa0; a; bg; s0; a0; fs1g; P ); 7

with

P = fs0a0 ! abas1 ; s0a ! as0 ; s0 b ! bs1; s1a ! aas1 g: We obtain

L( ) = faba2n j n 0g:

This is obviously a non-context-free language, and itnis also non-0L: in order to generate strings with arbitrarily large suxes a2 we need a rule of the form a ! ai ; i 2; using this rule for rewriting the leftmost occurrence of a in any string from L( ) we get a string not in L( ). The hierarchy IFTn ; n 1, is not an in nite one. This fact is not unexpected: we know from [22], [14], [18] that iterated gsm's characterize the family RE , which can be written as RE IFT ; start the proof of this inclusion from a universal type-0 grammar (such a grammar is constructed, for instance, in [4]); in this way we obtain an IFT with a xed number of states; by slightly modifying it, we can obtain an IFT generating any given recursively enumerable language (at the rst step of the work of our IFT we introduce the code of the particular grammar which is used by the universal one in order to simulate the particular grammar; then, our IFT works as the universal grammar, hence it generates the language of the particular grammar). This argument only shows that the hierarchy IFTn ; n 1, collapses. Using the construction in [4], we get a huge number of states in our \universal" IFT. However, this hierarchy collapses at a surprisingly low number of levels: four. The hierarchy on the number of states collapses also in the case of iterated gsm's in the form considered in [14], [18], [22]: the previous argument holds for this case as well. In fact, the result is stated explicitly in [18] (Theorem 2.12), but the argument is much more complex: one rst gives a normal form for iterated gsm's, one proves that state-complexity families with respect to such normal form iterated gsm's are full AFL's, and then one uses the fact that the family RE is a principal AFL (see [8]). No estimation of the number of levels of the obtained hierarchy is given in [18]. (The question is not considered in [14], [22].) Lemma 6. RE = IFT4. Proof. We only have to prove the inclusion RE IFT4. Consider a language L 2 RE; L T , and take a grammar G = (N; T; S; P ) in the Geert normal form generating it, [7]. That is, we have N = fS; A; B; C g and the rules in P are of the following forms: S ! x; for x 2 (N [ T )+ , and ABC ! . (There is only one non-context-free rule, the erasing rule ABC ! .) Consider three new symbols, a0; a1; a2, and construct the IFT

= (fs0; s1; sA ; sB g; V [ fa0; a1; a2g; s0; a0; fs1g; P 0); 8

where

P 0 = fs0 a0 ! a1 Sa2s0 g [ fs0 ! s0 j 2 N [ T [ fa1; a2gg [ fs0S ! xs0 j S ! x 2 P g [ fs0A ! sA ; sAB ! sB ; sB C ! s0g [ fs0a1 ! s1g [ fs1a ! as1 j a 2 T g [ fs1a2 ! s1g: At the rst step we introduce the axiom of G, marked to the left with the symbol a1 and to the right with the symbol a2 . At the subsequent translations, the context-free rules S ! x of P are simulated by using the state s0 . The deletion of a substring ABC is performed as follows: s0 x1 ABCx2 ` x1s0 ABCx2 ` x1 sA BCx2 ` x1sB Cx2 ` x1s0 x2 ` x1x2 s0: Of course, none or several applications of rules in P , irrespective whether or

not they are context-free or non-context-free, can be simulated at the same translation, but this does not change the generated language. We can reach the nal state s1 only by erasing the symbol a1 by a step s0 a1 w ` s1 w; after that, only terminal symbols can be parsed, hence the string w obtained in this way is a terminal one. After erasing the symbol a1 , the nal state cannot be reached once again, hence new rules in P cannot be simulated in a successful way. The presence of the symbol a2 ensures the fact that we do not reach the end of a string in a state sA or sB (note that a2 cannot be scanned in states sA ; sB ). Consequently, L( ) = L(G), that is RE IFT4 . As a particular case, the construction above gives a new proof of the inclusion CF IFT2 (if the rule ABC ! is missing, then we simulate a context-free grammar and the states sA ; sB are no longer necessary). We do not know whether or not the result above is optimal. Anyway, IFT's with three states are very powerful: Lemma 7. The family IFT3 contains non-recursive languages. Proof. Consider two n-tuples of non-empty strings, x = (x1; : : :; xn ); y = (y1 ; : : :; yn ), over the alphabet fa; bg. Let a0; b0 be two new symbols associated with a; b, respectively. Denote by h the coding de ned by h(a) = a0 ; h(b) = b0, and by mi(x) the mirror image of x, for any string x. We construct the IFT

= (fs0; s1; s2g; fa; b; a0; b0; a0; c; dg; s0; a0; fs0g; P ); with the following rules: 9

1. s0 a0 ! cxi d mi(h(yi ))cs0; for 1 i n; 2. s0 ! s0 , for 2 fa; b; a0; b0; cg, 3. s0 d ! xi d mi(h(yi))s0; for 1 i n, 4. s0 d ! s0 , 5. s0 a ! s1 ; s1 a0 ! s0 ; 6. s0 b ! s2 ; s2 b0 ! s0 : It is easy to see that after replacing a0, at the rst step, with a string cxi d mi(h(yi))c; for some 1 i n; at the next iterations one replaces the occurrence of d by a string of the form xj d mi(h(yj )); for 1 j n: In this way, we obtain all strings of the form cxi1 xi2 : : :xir d mi(h(yir )) : : :mi(h(yi2 )) mi(h(yi1 ))c, for some r 1; 1 ij n; 1 j r. Then, after removing the symbol d, by the rule of type 4, the following iterations either leave the string unchanged (by using rules of type 2), or a pair aa0 or bb0 is deleted, via states s1 ; s2, respectively (by using rules of types 5 and 6). If we arrive in s1 (after erasing an occurrence of the symbol a) and the next symbol is not a0 , then the work of is blocked (at least the rightmost occurrence of the symbol c has remained non-scanned). The same happens if we arrive in s2 (after deleting an occurrence of the symbol b) and the next symbol is not b0. Therefore, we can erase all symbols in between the two occurrences of the marker c if and only if h?1 (mi(h(yir )) . . . mi(h(yi2 )) mi(h(yi1 ))) = mi(xi1 xi2 . . . xir ). This means ?1 xi1 xi2 . . . xir = h (mi(mi(h(yir )).. . mi(h(yi2 )) mi(h(yi1 )))). Because mi(mi(h(yir )).. . mi(h(yi2 )) mi(h(yi1 ))) = h(yi1 )h(yi2 ). . . h(yir ) = h(yi1 yi2 . . . yir ), and h is one-to-one, this is equivalent to xi1 xi2 . . . xir = yi1 yi2 . . . yir , that is, to the fact that the Post Correspondence Problem for the n-tuples x; y (abbreviated, PCP (x; y )) has a solution. Consequently, the string cc belongs to the language L( ) if and only if PCP (x; y ) has a solution. This is undecidable, hence the language L( ) is not recursive. Synthesizing the results above, and using the known relations among Chomsky and Lindenmayer families, we obtain: Theorem 3. (i) IFT1 = 0L E 0L IFT2 IFT3 IFT4 = RE . (ii) CF IFT2 . (iii) IFT3 ? CS 6= ;: We do not know which of the non-proper inclusions in this theorem are proper. Also the relationships between the family CS and the families IFT2; IFT3 are open. According to the following result, if CS IFT2, then RE IFT3 (which we do not believe to be very plausible). 10

Theorem 4. If CS IFTn, for some n 1, then RE IFTn+1. Proof. Consider a recursively enumerable language L T . Take a

type-0 grammar G = (N; T; S; P ) generating L and a new symbol, c. We construct the length-increasing grammar

G0 = (N; T [ fcg; S; P 0); with

P 0 = fu ! v j u ! v 2 P; juj jvjg [ fu ! vcs j u ! v 2 P; juj = jvj + s; s 1g [ fc ! c; c ! c j 2 N [ T g: It is easy to see that L(G0) L ? t fcg and that for each string w 2 L there is a string w0 2 L(G0); w0 2 fwg ? t fcg. (We have denoted by ?t the shue operation, de ned by x ?t y = fx1y1x2y2 : : :xn yn j x = x1x2 : : :xn; y = y1y2 : : :yn ; n 1; xi; yi 2 V ; 1 i ng, for x; y 2 V , and extended in the natural way to languages.) Indeed, the length-decreasing rules of P are replaced by monotonous rules which introduce occurrences of the symbol c; this symbol is moved freely to right and to left, making room for left hand sides of rules in P . Consider now an IFT = (K; V; s0; a0; F; P 00) generating the contextsensitive language L(G0) and having card(K ) = n. Let a00 ; c1; c2; c3 be new symbols and s1 a new state, and construct the IFT

0 = (K [ fs1g; V [ fa00 ; c1; c2; c3g; s0; a00; fs1g; P 000); where P 000 contains all transition rules in P 00 as well as the following transitions: 1. s0 a00 ! c1 xc2s, for all s0 a0 ! xs 2 P 00 with x 2 V ; s 2 K , 2. s0 c1 ! c1s0 ; sc2 ! c2 s; for all s 2 K , 3. sc2 ! c3 s, for all s 2 F , 4. s0 c1 ! s1 ;

s1 c ! s 1 ; s1 c3 ! s1; s1a ! as1; for all a 2 T . After replacing a00 by a string introduced by when scanning a0, bounded by the markers c1; c2, the IFT 0 simulates the work of , modulo the passing over c1 and c2 and leaving them unchanged. The symbol c3 cannot be scanned in a state of K . This symbol can be introduced only from a state 11

in F , by a rule sc2 ! c3s; s 2 F . If such a rule is used, then the obtained string, of the form c1 wc3, cannot be further translated by the rules in P 00. This means that a rule of the form sc2 ! c3s; s 2 F , is used only once, at the end of simulating in 0 a sequence of translations in . On the other hand, the symbol c2 cannot be scanned in the state s1 , hence the work of 0 cannot be nished as long as the symbol c2 is present in the string. Thus, a rule sc2 ! c3s; s 2 F , must be used. This ensures the fact that the work of ends in a nal state, as imposed by F (note that the states in F are no longer nal states in 0). A string of the form c1wc3 can be processed on the path from s0 to s1 ; all symbols c1; c3; c are erased, all symbols in T are left unchanged. This is the only way of reaching the end of a string in the nal state s1 . Therefore, if w above is a string from L(G0) such that w 2 fz g ? t fcg, for z 2 L, then 0 the string generated by is exactly z . In conclusion, L( 0) = L, that is, L 2 IFTn+1 . Some indication on the size of the family IFT3 is also given by the following result, which shows that this family includes the largest family of L languages generated without interaction, ET 0L (the family of languages generated by extended tabled interactionless L systems). It is known that this family is strictly included in CS . Theorem 5. ET 0L IFT3. Proof. For every language L 2 ET 0L there is an ET0L system which generates L and has only two tables, [19]. Let G = (V; T; w; P1; P2) be such a system (the components above are: the total alphabet, the terminal alphabet, the axiom, and the tables, that is, nite and complete sets of rules as in E0L systems). Let a0 ; c be new symbols and construct the IFT:

= (fs0; s1; s2g; V [ fa0; cg; s0; a0; fs1g; P ); P = fs0a0 ! cws0g [ fs0c ! cs0g [ fs0a ! xs0 j a ! x 2 P1g [ fs0c ! cs2g [ fs2a ! xs2 j a ! x 2 P2g [ fs0c ! s1g [ fs1a ! as1 j a 2 T g: It is easy to see that L(G) = L( ). If one stays in state s0 when scanning c, then one has to completely rewrite the string according to the rules in P1 (the passing to s1 or to s2 is possible only when one scans c); if one passes to s2 after scanning c, then one has to use the rules in P2 . The rules in the two tables cannot be mixed. Finally, if one passes to s1 after scanning (and removing) c, then the work of should be nished: even if the rules in P1 can be used again and again, the nal state s1 cannot be reached any more. The use of s1 also ensures that the obtained string is terminal. 12

For an easy reference, we sum-up the results in Theorems 3 and 5 in the diagram in Figure 1 (which also contains the families of languages generated by deterministic IFT's). We close this section with some results about the closure properties of families in the hierarchy IFTn ; n 1. Theorem 6. If L; L1; L2 2 IFTn and h is a morphism, then h(L); L+; L1 [ L2, and L1L2 are in IFTn+1 ; n 1. Proof. Morphisms. Let = (K; V; s0; a0; F; P ) be an IFT with card(K ) = n, let a00 ; c1; c2; c02 be new symbols, and s1 a new state. We construct the IFT

0 = (K [ fs1 g; V [ fa00; c1; c2; c02g; s0; a00; fs1g; P 0); where

P 0 = fs0a00 ! c1 a0c2 s0 ; s0 c1 ! c1 s0 g [ P [ fsc2 ! c2s j s 2 K g [ fsc2 ! c02s j s 2 F g [ fs0c1 ! s1; s1c02 ! s1g [ fs1a ! h(a)s1 j a 2 V g: The nal state s1 can be reached only once, after erasing the symbol c1, and only in this state we can erase the symbol c02 ; on the other hand, in the state s1 we cannot scan the symbol c2, hence it must be replaced by c02 in a state in F . With these observations, it is easy to see that L( 0) = h(L( )): Kleene +. Let = (K; V; s0; a0; F; P ) be an IFT with card(K ) = n, let a0 ; c1; c2; c02 be new symbols, and s1 a new state. For each a 2 V , let a0 be a new symbol; denote by V 0 their set and by g the coding de ned by g(a) = a0; a 2 V . We construct the IFT

0 = (K [ fs1g; V [ V 0 [ fa0 ; c1; c2; c02g; s0; a0; fs1g; P 0); where

P 0 = fs0a0 ! c1 a00c2s0 ; s0 c1 ! c1 s0 g [ fsa0 ! g(x)s0 j sa ! xs0 2 P g [ fsc2 ! c2s j s 2 K g [ fsc2 ! c02s j s 2 F g [ fs0c1 ! s1; s1c02 ! s1g [ fs1a0 ! as1 j a 2 V g [ fs1c02 ! c1a00c2s0 g [ fs0a ! as0 j a 2 V g: 13

Initially one introduces the string c1a00 c2 (note that a00 is the primed version of the initial symbol of and not a new symbol). As in the case of morphisms, one reaches s1 and one continues succesfully only by erasing the symbol c1 and after changing the symbol c2 with c02 in a nal state of . In the state s1 one either removes the symbol c02 (it cannot be scanned in other states) and the process stops, or one replaces c02 with c1a00 c2 and the process is iterated. (The symbols in V are left unchanged in state s0 , the primed symbols are processed as in the IFT .) Therefore, L( 0) = L( )+:

IFT4 = RE 1 Z 6}Z Z

DIFT 6

IFT3

Z Z

CS

6 kQ 6Q

QQ

Q

6

IFT2 ET 0L DIFT4

7 6 3 6

E 0L DIFT3 }Z 6Z 6 Z Z Z IFT1 = 0L CF DIFT2 1 6 6 DIFT1 = D0L REG

...

Figure 1: Hierarchies of IFT families Union. Let i = (Ki; Vi; s0;i; a0;i; Fi; Pi ), i = 1; 2, be two IFT's with the same number of states (if necessary, we add dummy states to one of K1; K2); without loss of the generality, we assume the states named in the same way, s0 ; s1; : : :; sn?1 , in both sets K1; K2, which means that K1 = K2. We consider the new alphabets V10 ; V200 , of primed and double primed symbols associated with symbols in V1; V2, respectively, and we denote by g1 ; g2 the corresponding codings. Consider also the new symbols a0 ; c1; c2; c3; c02; c03 and the new state s1 . We construct the IFT

= (K1 [ fs1 g; V1 [ V2 [ V10 [ V20 [ fa0; c1; c2; c3; c02; c03g; s0; a0; fs1g; P ); 14

where

P = fs0a0 ! c1 a00;1c2s0 ; s0 a0 ! c1 a000;2c3s0g [ fs0c1 ! c1s0g [ fsa0 ! g1(x)s0 j sa ! xs0 2 P1g [ fsa00 ! g2(x)s0 j sa ! xs0 2 P2g [ fsc2 ! c2s j s 2 K1g [ fsc3 ! c3s j s 2 K2g [ fsc2 ! c02s j s 2 F1g [ fsc3 ! c03s j s 2 F2g [ fs0c1 ! s1g [ fs1a0 ! as1 j a 2 V1g [ fs1a00 ! as1 j a 2 V2g [ fs1c02 ! s1; s1c03 ! s1g: This time we initially introduce both strings c1 a00;1c2 and c1a000;2 c3. Because

of the primed and double primed symbols and because of the dierent right markers c2; c3, the transitions in the two IFT's 1 and 2 are not used in a wrong way, mixing them. The state s1 removes the primes and ensures the correct termination of the process, in the same way as in the previous cases. Thus, L( ) = L( 1) [ L( 2). Concatenation. We start exactly as above and we construct the IFT

0 = (K1 [ fs1 g; V1 [ V2 [ V10 [ V200 [ fa0 ; c1; c2; c3; c02; c03g; s0; a0; fs1g; P ); where P = fs0 a0 ! c1 c1a00;1c2s0g

[ fs0c1 ! c1s0g [ fsa0 ! g1(x)s0 j sa ! xs0 2 P1g [ fsa00 ! g2(x)s0 j sa ! xs0 2 P2g [ fsc2 ! c2s j s 2 K1g [ fsc3 ! c3s j s 2 K2g [ fsc2 ! c02s j s 2 F1g [ fsc3 ! c03s j s 2 F2g [ fs0c1 ! s1; s1c1 ! c1s1g [ fs1a0 ! as1 j a 2 V1g [ fs1a00 ! as1 j a 2 V2g [ fs1c02 ! a000;2c3s0g [ fs0a ! as0 j a 2 V1g [ fs1c03 ! s1g: 15

After producing a string of the form c1 c1g1 (w)c2, with w 2 L( 1), one replaces c2 by c02 , which imposes the use of the state s1 . By erasing one occurrence of c1 one passes to s1 , where one removes the primes and the symbol c02 is replaced with a000;2c3 . One now generates a string from L( 2), with double primed symbols. The process is successfully nished only when a state in F2 is reached. We obtain L( ) = L( 1)L( 2). The closure properties under other operations (especially the remaining two AFL operations, intersection with regular languages and inverse morphisms) remain open. Because each recursively enumerable language is the morphic image of a context-sensitive language, the previous assertion with respect to morphisms provides a new proof for the statement in Theorem 4.

5 Final Remarks; Back to Computing by Carving The main result of this paper is the fact that non-deterministic iterated nite state sequential transducers with four states characterize the recursively enumerable languages. This implies that the number of states does not induce an in nite hierarchy of languages which are C-REG computable: in view of Theorem 1, a language is C-REG computable if and only if it is the complement of a language in RE = IFT4. Thus, if we take into account the number of states of an IFT generating the complement of a C-REG computable language, then we can have at most four levels in the hierarchy of C-REG computable languages. If we take into account the number of states of the gsm g describing a regular sequence of languages, identi ed by a pair (L1 ; g ), then the hierarchy is still lower: there are at most three levels. To this aim, we will repeat the proof from [15] of one implication in Theorem 1, making use of the basic idea used here in the proof of Lemma 6. Let us denote by CREGn; n 1, the family of languages of the form M ? g (L1 ), for M; L1 regular languages and g a gsm with at most n states; by CREG we denote the union of all these families, that is, the family of all C-REG computable languages. Theorem 7. CREG1 CREG2 CREG3 = CREG: Proof. The inclusions are obvious. According to Theorem 1, each language L T , L 2 CREG, has the complement in RE . Take such a language L and consider a grammar G = (N; T; S; P ) for the language T ? L; take G in the Geert normal form, that is with N = fS; A; B; C g and with P containing context-free rules S ! x; x 2 (N [ T ), and the non-context-free rule ABC ! . We construct the regular sequence of languages starting with L1 = fS g 16

and using the gsm

g = (K; V; V; s0; fs0g; R);

with the following components: K = fs0; sA; sB g V = N [ T; R = fs0 ! s0 j 2 N [ T g

[ fs0S ! xs0 j S ! x 2 P g [ fs0A ! sA; sA B ! sB ; sB C ! s0g: It is easy to see that g (L1) \ T = L(G) (at each iteration of g one can simulate the application of a rule in P ; if several rules are simulated at the

same iteration, then this does not change the generated language). Therefore, for the regular language M = T we obtain M ? g (L1 ) = T ? L(G) = T ? (T ? L) = L: In conclusion, L 2 CREG3. The inclusion CREG1 CREG2 is clearly proper, because by iterating a gsm with only one state we get a language in IFT1 = 0L. The proof before does not imply that RE = IFT3, because one state has been saved here by using the language M , of a speci c form, which makes unnecessary the use of a special ( nal) state in order to check whether or not the string produced by iterating the gsm is terminal. The previous theorem shows that in order to classify the C-REG computable languages we need other parameters describing the complexity of the regular sequences of languages, the number of states is not suciently sensitive. Such more sensitive parameters remain to be found. Also the case of deterministic iterated transducers remains to be investigated.

Acknowledgement.

This work has been supported by CNR, Gruppo Nazionale per l'Informatica Matematica, Italy, by the Direccio General de Recerca, Generalitat de Catalunya (PIV), and by the Academy of Finland, Project 137358.

References [1] L. M. Adleman, Molecular computation of solutions to combinatorial problems, Science, 226 (1994), 1021 { 1024. [2] M. Amos, DNA Computation, PhD Thesis, Univ. of Warwick, Dept. of Computer Sci., 1997. [3] M. Amos, A. Gibbons, D. Hodgson, Error-resistant implementation of DNA computations, in Proc. of the Second Annual Meeting on DNA Based Computers, Princeton, 1996. 17

[4] C. Calude, Gh. Paun, Global syntax and semantics for recursively enumerable languages, Fundamenta Informaticae, 4, 2 (1981), 245 { 254. [5] J. Dassow, S. Marcus, Gh. Paun, Iterated reading of numbers and \black-holes", Periodica Mathematica Hungarica, 27, 2 (1993), 137 { 152. [6] U. Eco, Opera aperta, Bompiani, Milano, 1962. [7] V. Geert, Normal forms for phrase-structure grammars, RAIRO. Th. Inform. and Appl., 25 (1991), 473 { 496. [8] S. Ginsburg, Algebraic and Automata-Theoretic Properties of Formal Languages, North-Holland, Amsterdam, 1975. [9] F. Guarnieri, M. Fliss, C. Bancroft, Making DNA add, Science, 273 (1996), 220 { 223. [10] N. Immerman, Nondeterministic space is closed under complementation, SIAM J. Comput., 17, 5 (1988), 935 { 938. [11] M. Latteux, D. Simplot, A. Tertlutte, Iterated length-preserving transducers, Proc. Math. Found. Computer Sci. Conf., Brno, 1998. [12] R. J. Lipton, DNA solution of hard computational problems, Science, 268 (1995), 542 { 545. [13] Q. Ouyang, P. D. Kaplan, S. Liu, A. Libchaber, DNA solution of the maximal clique problem, Science, 278 (1997), 446 { 449. [14] Gh. Paun, On the iteration of gsm mappings, Revue Roum. Math. Pures Appl., 23, 4 (1978), 921 { 937. [15] Gh. Paun, (DNA) Computing by carving, submitted, 1998. [16] Gh. Paun, G. Rozenberg, A. Salomaa, DNA Computing. New Computing Paradigms, Springer-Verlag, Berlin, 1998. [17] Gh. Paun, A. Salomaa, Self-reading sequences, Amer. Math. Monthly, 103 (Febr. 1996), 166 { 168. [18] B. Rovan, A framework for studying grammars, Proc. MFCS 81, Lect. Notes in Computer Sci. 118, Springer-Verlag, 1981, 473 { 482. [19] G. Rozenberg, A. Salomaa, The Mathematical Theory of L Systems, Academic Press, New York, 1980. [20] G. Rozenberg, A. Salomaa, eds., Handbook of Formal Languages, 3 volumes, Springer-Verlag, Berlin, 1997. 18

[21] R. Szelepcsenyi, The method of forced enumeration for nondeterministic automata, Acta Inform., 26, 3 (1988), 279 { 284. [22] D. Wood, Iterated a-NGSM maps and ?-systems, Inform. Control, 32 (1976), 1 { 26.

19

Turku Centre for Computer Science Lemminkaisenkatu 14 FIN-20520 Turku Finland http://www.tucs.abo.

University of Turku Department of Mathematical Sciences

Abo Akademi University Department of Computer Science Institute for Advanced Management Systems Research

Turku School of Economics and Business Administration Institute of Information Systems Science

Recommend Documents

GSM ...