Some recent results on squarefree words

Report 4 Downloads 118 Views
Some

recent

results

on s q u a r e f r e e

words

by Jean Berstel Universit~

Pierre et M a r i e Cu r i e Paris

and

L.I.T.P

"F~r die Entwicklung der logischen W i s s e n s c h a f t e n wird es, ohne ROcksicht auf e t w a i g e Anwendungen, yon Bedeutung sein, a u s g e d e h n t e Felder f~r S p e k u l a t i o n Ober s c h w i e r i g e P r o b l e m e zu finden." Axel Thue, 1912.

I. Introduction.

When Axel Thue wrote these lines in the introduction to his 1912 paper on squarefree words, he c e r t a i n l y did not feel as a the o r e t i c a l computer scientist. During the past seventy years, there was an increasing interest in s q u a r e f r e e words and more generally in repetitions in words. However, A. Thue's s e n t e n c e see m s still to hold : in some sense, he said that there is no reason to study s q u a r e f r e e words, e x c e p t e d that it's a difficult question, and that it is of primary importance to i n v e s t i g a t e new domains. Seventy y e a r s later, these q u e s t i o n s are n o longer new, and one may ask if s q u a r e f r e e w o r d s served already. First, we observe that infinite squarefree, o v e r l a p - f r e e or cube-free words indeed served as e x a m p l e s or c o u n t e r - e x a m p l e s in several, quite different domains. In s y m b o l i c dynamics, they were introduced by Morse in 1921 [36] . A n o t h e r use is in group theory, whe r e an infinite s q u a r e - f r e e word is one (of the numerous) steps in disproving the B u r n s i d e c o n j e c t u r e (see Adjan[2]). Closer to computer s c i e n c e is Morse and H e d l u n d ' s i n t e r p r e t a t i o ~ in relation with chess [37]. We also m e n t i o n a p p l i c a t i o n s to formal language theory : Brzozowsky, K. Culik II and Gabriellian [7] use squarefree words in c o n n e c t i o n with noncounting languages, J. Goldstine uses the M o r s e s e q u e n c e to show that a p r o p e r t y of some family of l a n g u a g e s [22]. See also Shyr [52], and R e u t e n a u e r [a3]. All t h e s e are c a s e s w h e r e r e p e t i t i o n - f r e e w o r d s served as explicit examples. In other cases, q u e s t i o n s about t h e s e words led to new insights in other domains, such as for DOL languages and for c o n t e x t - f r e e languages. At the present time, the set of results on repetitions c o n s t i t u t e s a topic in c o m b i n a t o r i c s on words.

15

This paper g i v e s a s u r v e y o f s o m e recent results c o n c e r n i n g s q u a r e f r e e words and r e l a t e d t o p i c s . I n t h e past y e a r s , the interest in this topic was indeed g r o w i n g , and a number o f r e s u l t s a r e now a v a i l a b l e . An account o f b a s i c r e s u l t s may be found in Salomaa [ a 5 , 4 6 ] and i n L o t h a i r e [ 3 0 ] . For e a r l i e r work, see a l s o H e d l u n d ' s paper [25]. The more g e n e r a l concept o f unavoidable p a t t e r n is i n t r o d u c e d i n Bean, E h r e n f e u c h t , McNulty [4]. Part 2 deals with powers and repetitions, part 3 with language-theoretic results, part 4 gives the e s t i m a t a t i o n s on g r o w t h , p a r t 5 d e s c r i b e s r e s u l t s on morphisms.

2.

Powers and r e p e t i t i o n s .

A seuare i s a word o f t h e form xx, w i t h x a nonempty word. Cubes and k - t h powers a r e d e f i n e d a c c o r d i n g l y . A word is squarefree i f none o f h i s f a c t o r s ( i n t h e sense o f L o t h a i r e [ 3 0 ] , or subwords) i s a square. A word i s o y e r l a p - f r e e i f i t c o n t a i n s no factor of the form xuxux~ w i t h x nonempty. The concept o f k - t h power f r e e words where k i m p l i c i t l y i s a p o s i t i v e i n t e g e r , can be extended t o rational numbers as f o l l o w s : If r = n + s is a rational positive number w i t h n p o s i t i v e i n t e g e r and 0 < s < I , t h e n an r - t h power i s a word o f t h e form

with exactly n consecutive satisfying lu' I/lul = s.

U~U ' u's

and

one

left

factor

u'

of

u

The Thue-Morse sequence m = 011010011001011010010110...

contains word

squares

and i s o v e r l a p - f r e e (Thue [ 5 4 ] ,

t

Morse [36])~

the

= abcacbabcba...

d e r i v e d from m by t h e i n v e r s e morphism a - - ) 0 1 1 , s q u a r e - f r e e (Thue [ 5 5 ] ) . The F i b o n a c c i word

b -->01 7c - - ) 0

is

f = abaababaabaababaabab .... c o n t a i n s cubes but i s 4 - t h power f r e e (see e , g . Karhumaki [ 2 7 ] ) . Many o t h e r s p e c i a l i n f i n i t e words w i t h some r e p e t i t i o n p r o p e r t y a r e known. U s u a l l y , t h e y a r e c o n s t r u c t e d by i t e r a t i n g morphisms o r by t a g systems i n t h e sense o f Minsky [ 3 5 ] . (See a l s o P a n s i o t ' s paper in the proceedings). L e t us mention t h a t some words may a l s o be d e f i n e d by an e x p l i c i t d e s c r i p t i o n o f t h e p o s i t i o n s o f t h e l e t t e r s occurring in them. This h o l d s f o r t h e Thue-Morse sequence~ s i n c e the i-th l e t t e r can be shown t o be 0 o r i a c c o r d i n g t o t h e number of "i" in the b i n a r y e x p a n s i o n o f i being even o r odd. A more systematic treatement of these d e s c r i p t i o n s is given in Christol~ Kamae, Mend~s-France, Rauzy [ 1 0 ] , One o f t h e p r o p e r t i e s o f t h e s e g e n e r a l i z e d sequences i s g i v e n by Cerny [ 9 ] . He d e f i n e s , f o r a g i v e n f i x e d word w o v e r { 0 , i } an i n f i n i t e word by s e t t i n g t h e i - t h letter to 0 o r t o I when t h e number o f occurrences o f w i n t h e b i n a r y expansion of i is even or is odd. Thus the o r i g i n a l

16

Thue-Morse sequence that the infinite factors of the form whe r e

k= 2 ~wl

and

is word

the special case where w=l. Cerny shows that is obtained in this m a n n e r has no (xu)~x

x is nonempty.

Squares are unavoidable over two letters, and they are avoidable over three letters. Here "unavoidable" m e a n s that e v e r y long enough word has a square. On the contrary, a v o i d a b l e m e a n s that there are i n f i n i t e s q u a r e - f r e e words. So one may ask for the minimal avoidable repetition or (almost) equivalently for the maximal unavoidable repetition over a fixed k letter alphabet, Denote the maximal u n a v o i d a b l e repetition over k letters by s(k). If s(k)=r, then every long enough word has a r-th power, and there is an infinite word with no factor of the form wa with w an r-th power and a the first letter of w. The T h u e - M o r s e s e q u e n c e shows that s(2)=2 (since s q u a r e s are u n a v o i d a b l e over 2 letters). Over three letters, squares are avoidable. So s(3)+c(m), from which t h e c o n c l u s i o n f o l l o w s by t a k i n g n=22. There s t i l l remains a gap between t h e upper and t h e l o w e r bounds, but t h e v e r y p r e c i s e v a l u e i s n o t so i m p o r t a n t . There i s a l s o a s i m i l a r p r o o f o f t h e r e s u l t by B r i n k h u i s [ 6 ] . An analoguous p r o o f shows t h a t t h e number o f c u b e f r e e words o v e r a two letter a l p h a b e t a l s o grows e x p o n e n t i a l l y . In c o n t r a s t , t h e r e is a very interesting p o l y n o m i a l bound on t h e number o f o v e r l a p - f r e e words : THEOREM (Restivo~ Salemi [ 4 2 ] ) . - There i s a c o n s t a n t C such t h a t t h e number p(n) o f o v e r l a p - f r e e words o f l e n g t h n o v e r a two l e t t e r alphabet s a t b f i e s l o g 15 p(n) ~ C. n

19

The proof is based on a c l e v e r f a c t o r i z a t i o n of o v e r l a p - f r e e words into factors which are the initial factors of length ~ of the two letter T h u e - M o r s e s e q u e n c e and those obtained by e x c h a n g i n g a and b. Each overlap-free word is shown to have a unique factorization of this kind. A computation of all possible f a c t o r i z a t i o n s for words of length n then gives the upper bound, It remains to i n v e s t i g a t e the tree of s q u a r e f r e e words in more detail. This tree is obtained by assigning a node to each squarefree word and by c o n n e c t i n g the node of a word to the node for each e x t e n s i o n by a letter added on the right. S i n c e there are infinitely many squarefree words, this tree is infinite. Therefore, there are infinite paths in it (Konig's lemma). But there are also finite branches in it, as for e x a m p l e abacaba. These correspond to maximal squarefree words which cannot be extended by any of the three letters. These right-maximal squarefree words were d e s c r i b e d by Li [29] : they have e x a c t l y the expected form, namely (over three letters) : wvuabuacvuabua p r o v i d e d t h e y a r e s q u a r e f r e e . They a r e d e r i v e d from t h e s i m p l e s t of them, abacaba, by i n s e r t i n g a word u b e f o r e t h e a ' s , a word v b e f o r e t h e uabua's, and w i n f r o n t . I t was shown by K a k u t a n i (see [ 2 1 ] ) t h a t t h e r e a r e u n c o u n t a b l y many infinite s q u a r e f r e e words o v e r t h r e e letters. So one may ask "where" t h e s e words a r e i n t h e t r e e : more p r e c i s e l y , i s t h e t r e e uniform in s o m e sense ? One c o u l d imagine indeed t h a t t h e r e a r e infinite paths i n t h e t r e e where a l l l e a v i n g paths a r e f i n i t e , yielding a " s p a r s e " i n f i n i t e branch. That t h i s cannot happen was proved by S h e l t o n and Soni. THEOREM (Shelton, Soni w o r d s over t h r e e letters

[50~51]).is perfect.

The set

of infinite

squarefree

This statement means t h a t if there is an i n f i n i t e word going t h r o u g h a node o f t h e t r e e , t h e n t h i s i n f i n t e word w i l l e v e n t u a l l y split into two (and t h e r e f o r e i n t o infinitely many) i n f i n i t e squarefree words. There is a related result w h i c h say8 that one must not walk too much in the tree to find an infinite path. T H E O R E M (Shelton, Soni [51]).- There is a constant K such that if u is a s q u a r e f r e e finite word on a t h r e e letter alphabet of length n and if u can be extended to a s q u a r e f r e e word uv of length n + K*n 311 , then u can be e x t e n d e d to an infinite s q u a r e f r e e word.

5. S q u a r e f r e e

morphisms.

The first, and up to now the only t e c h n i q u e to construct squarefree words which was systematically investigated are morphisms. The method goes as follows. First, a e n d o m o r p h i s m is iterated, giving an infinite set of words (which can also be c o n s i d e r e d as an infinite word). Then a second m o r p h i s m is applied t o t h e s e t ( i n f i n i t e word). I f e v e r y t h i n g i s c o n v e n i e n t l y choosen, the result is squarefree.

20

This technique was used a l r e a d y by Axel Thue [55] to compute the first infinite squarefree word. Of course, there exist infinite squarefree words which cannot be c o n s t r u c t e d this way, since there are uncountably many of these words. However~ the method still is very useful. The sets of words, s q u a r e f r e e or not, obtained by morphism, have interesting c o m b i n a t o r i a l properties. Among these, their '~subword complexity". See E h r e n f e u c h t et al. [17,18,19]. One of the giv e n ~YJOrphism

basic

questions

h : A*-->

asked

in this context

is whether

a

B~

is squarefree. By definition, preserves s q u a r e f r e e words, i.e. word w h e n e v e r w is squarefree.

h is a s q u a r e f r e e m o r p h i s m if h if the image h(w) is a s q u a r e f r e e

Examples

: The m o r p h i s ~ of Thue [55] h(a)=abcab , h(b)=acabcb , h(c>=acbcacb is squarefree. The following m o r p h i s m (see Hall [23], h ( a ) = a b c ~ h(b)=ac , h(c)=a is not s q u a r e f r e e since h(abc)= a b c a c a b c .

Istrail[26])

The last morphism is too "simple" to be squarefree. Indeed, A. Carpi [8] has shown that a s q u a r e f r e e m o r p h i s m over three letters must have size at least 18. Here the size is the sum of the lengths of the images of the letters. Thue's morphis~, g i v e n above has size 18, so it is (already) optimal, The second m o r p h i s m has only size 6. Several people have investigated s q u a r e f r e e n e s s of morphisms, and have derived c o n d i t i o n s that e n s u r e that they are. The most precise d e s c r i p t i o n is that given by C r o c h e m o r e : T H E O R E M (Crochemore [12,13]).- Let h : A~---> B'be a morphism, with A having at least t h r e e letters. Then h is s q u a r e f r e e iff the two following c o n d i t i o n s hold: i) h(x) is s q u a r e f r e e for s q u a r e f r e e words x in A of length 3; ii) No h(a), for a in A, c o n t a i n s a internal presquare. R o u g h l y speaking, a p r e s q u a r e is a factor u of h(a) such that h