Journal of Computer and System Sciences 79 (2013) 658–668
Contents lists available at SciVerse ScienceDirect
Journal of Computer and System Sciences www.elsevier.com/locate/jcss
Several extensions of the Parikh matrix L-morphism b,1 ˇ Hamed M.K. Alazemi a,∗,1 , Anton Cerný a b
Department of Computer Engineering, Kuwait University, Kuwait Department of Information Science, Kuwait University, Kuwait
a r t i c l e
i n f o
Article history: Received 24 January 2011 Received in revised form 24 November 2012 Accepted 14 January 2013 Available online 31 January 2013 Keywords: Parikh matrix mapping q-Matrix Partial word Fuzzy word
a b s t r a c t The Parikh matrix mapping is a morphism assigning to each word w over a k-letter alphabet a (k + 1) × (k + 1) upper triangular matrix with entries expressing the number of occurrences in w of some specific subwords. To tackle the problem of ambiguity of this mapping two new mappings have been proposed in literature, assigning to words matrices with polynomial entries (q-matrices). One is a more subtle, but still ambiguous morphism, the other is unambiguous but not a morphism. We show that the former mapping can be extended to match even a fairly general extension of the original Parikh matrix morphism. Then we introduce an unambiguous q-matrix morphism based on the same general Parikh matrix mapping. Finally, we consider the problem of incomplete information on word symbols and show that the general Parikh matrix mapping can be further extended to deal with counts of fuzzy subword occurrences. © 2013 Elsevier Inc. All rights reserved.
1. Introduction Parikh matrix mapping was introduced in [1] as an algebraic tool for finding scattered subword counts in words. This mapping is a monoid morphism, which assigns to each word an upper triangular matrix. Thus investigation of internal structure of words can be performed using the tools of the standard matrix calculus. Unfortunately, the original Parikh matrix mapping deals only with subwords being factors of a concatenation (in a fixed order) of all the symbols of the underlying alphabet. The mapping has been extended in [2] to consider all factors of a given fixed word. In [3], a further extension of the Parikh matrix mapping — the Parikh L-morphism — is presented, dealing with subwords given by any finite language. In general, the Parikh matrix image of a word does not uniquely characterize the word. A great deal of attention has been payed to investigation of ambiguity of the Parikh matrix mapping (see [4] for recent results and overview of the topic). A more subtle approach has been taken in [5] by considering matrices with polynomial entries (q-matrices), while the resulting q-matrix morphism is still ambiguous. In the present paper, we continue this effort and present an extension of the q-matrix mapping from [5] based on the shape of the Parikh L-morphism from [3]. Our mapping is capable to deal with subwords from any finite print language. In [6], the q-matrix mapping has been modified to a different mapping (so-called q-matrix encoding), matching unambiguously words with q-matrices. However, the resulting mapping is not a morphism any more. We fix this gap here and provide, again as an extension of the Parikh L-morphism, an unambiguous (one-to-one) morphism encoding words to matrices with polynomial entries. In various applications, e.g., in DNA sequencing, we often have to deal with words with holes. A hole is a position containing an unknown symbol. Such a partial word (see, e.g., [7]) match any word where the holes are filled arbitrarily by
* 1
Corresponding author. ˇ E-mail addresses:
[email protected] (H.M.K. Alazemi),
[email protected] (A. Cerný). This work was supported by Kuwait University, Research Grant No. EO 05/10.
0022-0000/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jcss.2013.01.018
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
659
any symbol. We extend this model and consider fuzzy words, where each hole may contain only symbols from a specific subset of the alphabet. We show here how to adjust the general Parikh matrix mapping to deal with subword counts of scattered fuzzy subwords. We will proceed as follows. After presenting basic terms and notation in Section 2, we will briefly introduce the Parikh matrix mapping and its extensions in Section 3. In Section 4 we provide the extension of the q-matrix mapping from [5] to the case of general Parikh matrix mapping. In Section 5, we describe an unambiguous morphism encoding words to qmatrices. Finally, in Section 6 we present an extension of the Parikh L-morphism, capable of counting subword occurrences of fuzzy words. 2. Basic notions Let Σ = {s1 , s2 , . . . , sk } denote a fixed alphabet of size k 1. We will denote the sets of all words and all words of length i 0 over Σ by Σ ∗ and Σ i , respectively; the empty word will be denoted by λ, the length of a word w by | w |, the size of a language L by | L | and the mirror image of a word w by w R . Without further notice, any language under consideration will be assumed to be a finite language over Σ . We identify the singleton language { w } with the word w. For two words x, y ∈ Σ ∗ we denote δx, y = (if x = y then 1 else 0). The words t , u , v are called prefix, factor, suffix of the word tuv, respectively (any of them may be empty). A language is a print language if it does not contain the factor aa for any symbol a ∈ Σ . A language is prefix-closed [factorial], if it contains with each word all its prefixes [all its factors]. The prefix closure [the factorial closure] of a language L, denoted as P ( L ), [as F ( L )] is the smallest prefix-closed [factorial] language containing L. The symbol decomposition of a word w is w expressed as concatenation of symbols from Σ : w = a0 a1 · · · an−1 , where ai ∈ Σ , i = 1, 2, . . . , n. The set [n] = {0, 1, 2, . . . , n − 1} is the set of positions in w (positions are numbered starting from 0). A set ι = {i 0 < · · · < im−1 } ⊆ [n], m 0, is called subword position in w. The word σ w (ι) = ai 0 ai 1 · · · aim−1 is the (scattered) subword occurring at position ι in w; the position ι is the occurrence of the word σ w (ι). The number of distinct occurrences of the subword u in w is denoted as | w |u . For example, |babbaba|ab = 4, |babbaba|bab = 6. The only occurrence of λ in a word w is ∅, thus | w |λ = 1 for every w. The number of subword occurrences can be computed by a recurrent formula (see (6.3.3) in [8])
| wa|ub = | w |ub + δa,b | w |u . where w , u ∈ Σ ∗ and a, b
(1)
∈ Σ.
3. The Parikh L-morphism The Parikh mapping [9], well known from the classical formal languages theory, assigns to each word w the vector
[| w |s1 , | w |s2 , . . . , | w |sk ]. This mapping has been extended in [1] to Parikh matrix mapping assigning matrices to words. The non-zero entries in the Parikh matrix of a word w are the occurrence counts of the factors of the word s1 s2 · · · sk in w. A further extension of this mapping was considered in [2], where the matrix entries are occurrence counts of the factors of an arbitrary (but fixed) word u. We will consider here an even more general Parikh L-mapping Ψ L introduced in [3]. The mapping can be described as follows. Consider a language L. Let M L denote the set of all upper triangular |P ( L )| × |P ( L )| matrices with unit main diagonal, where the rows and columns of the matrix are labeled by words from P ( L ) in alphabetical order (based on non-decreasing length, while the words of the same length are ordered lexicographically). The entry in Ψ L ( w ) at position u , v may be non-zero only if u is a prefix of v, i.e., v = ux for some word x. In that case, the entry is equal to | w |x . Formally, this is expressed as
ΨL ( w ) u,v = | w |x δux, v . x∈F ( L )
Ψ L is a morphism mapping the free monoid (Σ ∗ , .) to the multiplicative monoid (with the usual matrix multiplication) (M L , ·). Example 1. Let Σ = {a, b}. Consider the language L = {λ, a, ab, abb, bab, baba}. Then P ( L ) = {λ, a, b, ab, ba, abb, bab, baba} and
a b ab ba abb bab baba λ
⎛
1 ⎜0 ⎜ ⎜0 ⎜ ⎜0 ΨL (a) = ⎜ ⎜0 ⎜ ⎜0 ⎝0 0
1 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
⎞
0 0 0 0 0 0 0 0 0 0⎟ ⎟ 0 1 0 0 0⎟ ⎟ 1 0 0 0 0⎟ ⎟ 0 1 0 0 0⎟ ⎟ 0 0 0 0 0⎟ ⎠ 0 0 0 1 1 0 0 0 0 1
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
660
a b ab ba abb bab baba λ
⎛
1 ⎜0 ⎜ ⎜0 ⎜ ⎜0 ΨL (b) = ⎜ ⎜0 ⎜ ⎜0 ⎝0 0
Then, for w = babba,
a b ab ba abb bab baba λ
⎛
1 ⎜0 ⎜ ⎜0 ⎜ ⎜0 ΨL ( w ) = ⎜ ⎜0 ⎜ ⎜0 ⎝0 0
0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0
2 1 0 0 0 0 0 0
0 1 0 1 0 0 0 0
3 0 1 0 0 0 0 0
2 3 0 1 0 0 0 0
0 0 0 0 1 0 0 0
4 0 2 0 1 0 0 0
0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0
1 3 0 3 0 1 0 0
2 0 2 0 3 0 1 0
⎞
0 0⎟ ⎟ 0⎟ ⎟ 0⎟ ⎟. 0⎟ ⎟ 0⎟ ⎠ 0 1
⎞
2 0⎟ ⎟ 2⎟ ⎟ 0⎟ ⎟. 4⎟ ⎟ 0⎟ ⎠ 2 1
4. An extension of the q-matrix morphism It is well known that The Parikh matrix mapping is ambiguous, however, it encodes more information on words than the Parikh vector. The Parikh matrix contains the Parikh vector on its second diagonal. It is easy to find examples of pairs of words with different Parikh matrices but identical Parikh vectors. On the other hand, e.g., the words acb and cab over the alphabet {a, b, c } have the same Parikh matrix. In [6], the authors introduced an extension of the Parikh mapping called the Parikh q-matrix mapping. The mapping is a morphism representing a word over a k-letter alphabet as a k-dimensional upper triangular matrix with entries that are non-negative integral polynomials in variable q. By appropriately embedding the k-letter alphabet Σ into a (k + 1)-letter alphabet and putting q = 1, they obtain, as particular case, the Parikh mapping for the k-symbol alphabet from [1]. The Parikh q-matrix mapping is still not unambiguous. The two words acb and cab still have the same Parikh q-matrix. However, the words abaaba and baaaab over the two-letter alphabet have the same Parikh matrix but different parikh q-matrices. Thus the q-matrix mapping produces matrices that carry more information about the argument w than the numerical Parikh matrix. We will extend here this mapping to any print language L. Our general q, L-matrix mapping will be a straight generalization of both the original Parikh matrix mapping and the Parikh q-matrix mapping. Assume a print language L, a word w = a0 a1 · · · an−1 ∈ Σ ∗ , words u ∈ P ( L ) and t = t 0 t 1 · · · tm−1 ∈ F ( L ), t i ∈ Σ , m 0 and an occurrence ι = {i 0 < i 1 < · · · < im−1 } of t in w. (By default, if m = 0 then t = λ and ι = ∅.) Thus
w = x0 ai 0 x1 ai 1 · · · xm−1 aim−1 xm u ,t
where t = t 0 t 1 · · · tm−1 = ai 0 ai 1 · · · aim−1 . For j = 0, . . . , m − 1, let B j {b ∈ Σ | ub ∈ P ( L )}.) Denote
w ,u,t (ι) =
|t |
= {b ∈ Σ|ut 0 t 1 · · · t j −1 b ∈ P ( L )}. (By default, B 0u ,t =
|x j |b .
j =0 b∈ B u ,t j
Consider now the polynomial
S Lw ,u ,t (q) =
q w ,u,t (ι) .
(2)
ι⊆[| w |],σ w (ι)=t
Lemma 2. Let L be a print language, w ∈ Σ ∗ , u , ut ∈ P ( L ). L 1. If | w |t = 0 then S w ,u ,t (q ) = 0. 2. Denote | w |(u ) = b;ub∈P ( L ) | w |b . Then
S Lw ,u ,λ (q) = q| w |(u) .
(3)
S Lw ,u ,t (1) = | w |t .
(4)
3.
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
661
4. Assume a symbol a ∈ Σ . Then
S Lwa,u ,t (q) = S Lw ,u ,t (q) S aL,ut ,λ (q).
(5)
5. If uta ∈ P ( L ), then
S Lwa,u ,ta (q) = S Lw ,u ,t (q) + S Lw ,u ,ta (q).
(6)
Proof. 1. The sum in (2) consists of zero terms in this case. 2. (3) follows from the fact that w ,u ,λ (∅) = b;ub∈P ( L ) | w |b = | w |(u ) .
3. S Lw ,u ,t (1) = ι⊆[n],σ (ι)=t 1 w ,u,t (ι) = |{ι ⊆ [n]|σ (ι) = t }| = | w |t . 4. In the equality (5), S aL,ut ,λ (q) = q if uta ∈ P ( L ), otherwise S aL,ut ,λ (q) = 1. In the former case, by augmenting w by a
each exponent from S Lw ,t (q) will increase by 1, in the latter case it will not be changed. / L. The first term in the right-hand side of (6) corresponds to all occurrences of ta 5. Since L is a print language, taa ∈ in wa, such that the last a in the occurrence is the last a in wa; the second term corresponds to the remaining occurrences of ta in wa. 2 We will now define the Parikh q, L-matrix morphism Ψq, L : Σ ∗ → M L . Let a ∈ Σ . The matrix Ψq, L (a) is given as
Ψq, L (a) u , v = S aL,u ,λ (q)δu , v + δua, v ,
for u , v ∈ L .
Thus [Ψq, L (a)]u , v = q if u = v = a, [Ψq, L (a)]u , v = 1 if u = v = a or ua = v, otherwise [Ψq, L (a)]u , v = 0. The matrix Ψq, L (a) can be obtained from the matrix Ψ L (a) by replacing [Ψ L (a)]a,a by q (if a ∈ P ( L )). Theorem 3. Let L be a print language and w ∈ Σ ∗ . Let u , v ∈ L. Then
Ψq, L ( w ) u , v = S Lw ,u ,t (q)δut , v .
(7)
t ∈F ( L )
L Proof. Induction on | w |. If w = λ, then the equality 3 implies S λ, u ,λ (q ) = 1 and
L S Lw ,u ,t (q)δut , v = S λ, u ,λ (q)δu λ, v = δu , v .
t ∈F ( L )
Hence Ψq, L ( w ) is indeed the unit matrix. Let 7 be true for some w ∈ Σ ∗ and let a ∈ Σ . Then, by the inductive hypothesis,
Ψq, L ( wa) u , v = Ψq, L ( w ) u ,z Ψq, L (a) z, v z∈P ( L )
=
S Lw ,u ,t (q)δut ,z
S aL,z,λ (q)δz, v + δza, v
z∈P ( L ) t ∈F ( L )
=
S Lw ,u ,t (q)δut ,z S aL,z,λ (q)δz, v
z∈P ( L ) t ∈F ( L )
+
S Lw ,u ,t (q)δut ,z δza, v
z∈P ( L ) t ∈F ( L )
=
S Lw ,u ,t (q) S aL,ut ,λ (q)δut , v +
t ∈F ( L )
(8)
S Lw ,u ,t (q)δuta, v
(9)
t ∈F ( L )
satisfied, since both the expression (9) and the right-hand side of (7) evaluates to 0. If u is a prefix of v, we consider two cases. / F ( L )a. In this case δuta, v = 0 for every t ∈ F ( L ) and (9) yields 1. v = ut for some t ∈
Ψq, L ( wa) u , v = S Lw ,u ,t (q) S aL, v ,λ (q).
In this case, applying 4 of Lemma 2, we obtain
Ψq, L ( wa) u , v = S Lwa,u ,t (q) = S Lwa,u ,t (q)δut , v . t ∈F ( L )
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
662
2. v = ut a for some t ∈ F ( L ) − F ( L )a (since L is a print language, v cannot have the suffix aa). Then
Ψq, L ( wa) u , v = S Lw ,u ,t a (q) S aL,ut a,λ (q) + S Lw ,u ,t (q).
(10)
According to 2 of Lemma 2, S aL,ut a,λ (q) = q|a|(ut a) = q0 = 1, since L is a print language. Using 5 of Lemma 2, Eq. (10) can be rewritten as
Ψq, L ( wa) u , v = S Lw ,u ,t a (q) + S Lw ,u ,t (q) = S Lwa,u ,t a (q) = S Lwa,u ,t (q)δut , v .
2
t ∈F ( L )
Corollary 4. If x is a factor of some word from L and ux is its prefix then
S Lw ,u ,x (q) = Ψq, L ( w ) u ,ux . Observing 3 of Lemma 2 we obtain Corollary 5. At q = 1, for every w ∈ Σ ∗ , Ψq, L ( w ) evaluates to Ψ L ( w ). One can easily see that the Parikh q, L-matrix morphism Ψq,P (s1 s2 ···sk ) is on {s1 , s2 , . . . , sk }∗ identical to the morphism Ψq from [6] for the extended alphabet Σ = {s1 , s2 , . . . , sk , sk+1 }. Corollary 5 does not require any artificial extension of the alphabet Σ . The authors in [6] showed that the morphism Ψq is not injective, however it may provide more information on w than the classical Parikh matrix. Obviously, the same is true for Ψq, L . Example 6. Let L = {λ, a, b, ab, ba, abb, bab, baba}. Then
a b ab ba abb bab baba λ
a b ab ba abb bab baba λ
⎛
q ⎜0 ⎜ ⎜0 ⎜ ⎜0 Ψq, L (a) = ⎜ ⎜0 ⎜ ⎜0 ⎝0 0
⎛
q ⎜0 ⎜ ⎜0 ⎜ ⎜0 Ψq, L (b) = ⎜ ⎜0 ⎜ ⎜0 ⎝0 0
1 1 0 0 0 0 0 0
⎞
0 0 q 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0⎟ ⎟ 1 0 0 0⎟ ⎟ 0 0 0 0⎟ ⎟ 1 0 0 0⎟ ⎟ 0 1 0 0⎟ ⎠ 0 0 q 1 0 0 0 1
0 1 q 0 0 1 0 0 0 0 0 0 0 0 0 0
0 1 0 q 0 0 0 0
0 0 0 0 q 0 0 0
⎞
0 0 0 0 0 0⎟ ⎟ 0 0 0⎟ ⎟ 1 0 0⎟ ⎟ 0 1 0⎟ ⎟ 1 0 0⎟ ⎠ 0 1 0 0 0 1
By computing the product
Ψq, L (bab) = Ψq, L (b)Ψq, L (a)Ψq, L (b) we obtain
a b ab ba abb bab baba λ
⎛
q3 ⎜0 ⎜ ⎜0 ⎜ ⎜0 Ψq, L (bab) = ⎜ ⎜0 ⎜ ⎜0 ⎝0 0
q2 q2 0 0 0 0 0 0
q2 + q 0 1 0 0 0 0 0
0 2q 0 q2 0 0 0 0
q 0 0 0 q2 0 0 0
0 1 0
1 0 0 0
⎞
0 0⎟ ⎟ 0⎟ ⎟ q+1 0⎟ ⎟. 0 q + 1 0⎟ ⎟ 1 0 0⎟ ⎠ 0 1 0 0 0 1
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
λ,b
= {a, b}, B 1λ,b = {a},
Let us check the value [Ψq, L (bab)]λ,b using Theorem 3. Observe that B 0
L L Ψq, L (bab) λ,b = S bab ,λ,t (q)δλt ,b = S bab,λ,b (q) = t ∈F ( L )
663
q w ,λ,t (ι) = q0+1 + q2+0 = q2 + q
ι⊆[3],σbab (ι)=b
since, for the first occurrence of b in bab, the preceding factor λ contains 0 subwords from the set {a, b} and the following factor ab contains 1 subword from {a}; for the second occurrence, the preceding factor ab contains 2 subwords from the set {a, b} and the following factor λ contains 0 subwords from {a}. Similarly, B a0,b = B b1 = {b} and
L L Ψq, L (bab) a,ab = S bab ,a,t (q)δat ,ab = S bab,a,b (q) = t ∈F ( L )
q w ,a,t (ι) = q0+1 + q1+0 = 2q.
ι⊆[3],σbab (ι)=b
One of the important properties of the Parikh matrix mapping is a characterization of inverse of the Parikh matrix for a word w in terms of alternate Parikh matrix of the mirror image of w [1]. This characterization was proved to be valid for an extension of the Parikh matrix mapping counting subword occurrences of factors of any fixed word u [2], but only in the case when u does not contain a factor aa, i.e., when {u } is a print language. A modified version of this theorem was proved in [6] for the q-matrix mapping. We will illustrate here that a similar characterization is still valid for matrices obtained by our Ψq, L mapping. We will first the alternate Parikh q, L-matrix morphism by setting, for u , v ∈ P ( L ),
Ψ q, L (a) u , v = (−1)|u |+| v |
q S aL,u ,λ (q)
δu , v
+ δua, v .
The matrix Ψ q, L (a) is indeed a q-matrix, since the only values the expression S aL,u ,λ (q) can take are 1 and q. Before we formulate the inverse matrix theorem for our case, we need to prove the following lemma. Lemma 7. Let a ∈ Σ , L ⊂ Σ ∗ be a print language and k = |P ( L )|. Then
Ψq, L (a)Ψ q, L (a) = qI k where I k is the k × k identity matrix. Proof. Let u , v ∈ P ( L ).
Ψq, L (a)Ψ q, L (a) u , v = Ψq, L (a) u ,z Ψ q, L (a) z, v z∈P ( L )
=
S aL,u ,λ (q)δu ,z + δua,z
z∈P ( L )
= S aL,u ,λ (q)(−1)|u |+| v |
(−1)|z|+| v |
q S aL,u ,λ (q)
q S aL,z,λ (q)
δz, v + δza, v
δu , v + δua, v + (−1)|ua|+| v |
We will use the fact that δuaa, v = 0 since L is a print language. If u = v then
Ψq, L (a)Ψ q, L (a) u , v = S aL,u ,λ (q)(−1)|u |+|u |
q S aL,u ,λ (q)
= q.
If ua = v then S aL,u ,λ (q) = q, since ua ∈ P ( L ), and S aL,u ,λ (q) = 1, since uaa ∈ / P ( L ). Then
Ψq, L (a)Ψ q, L (a) u , v = S aL,u ,λ (q)(−1)|u |+| v | + (−1)|ua|+| v | q
1 S aL,ua,λ (q)
If u = v and ua = v then
Ψq, L (a)Ψ q, L (a) u , v = 0.
2
Now we can provide the generalization of the inverse matrix theorem. Theorem 8. Let L ⊂ Σ ∗ be a print language, k = |P ( L )|, and w ∈ Σ ∗ . Then
Ψq, L ( w )Ψ q, L w R = q| w | I k .
= −q + q = 0.
q S aL,ua,λ (q)
δua, v + δuaa, v .
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
664
Proof. Induction on | w |. The assertion is trivial for w = λ. Let it be true for some word w and let a ∈ Σ . Then, using Lemma 7 and the inductive hypothesis we obtain
Ψq, L ( wa)Ψ q, L ( wa) R = Ψq, L ( w )Ψq, L (a) Ψ q, L (a)Ψ q, L w R = Ψq, L ( w ) Ψq, L (a)Ψq, L (a) Ψ q, L w R = Ψq, L ( w )qI k Ψ q, L w R qΨq, L ( w )Ψ q, L w R = qq| w | I k = q| wa| I k .
2
5. An unambiguous q-matrix morphism In [5] a generalization of the Parikh mapping, called Parikh q-matrix encoding was provided. This is an injective mapping again assigning to words matrices with polynomial entries. Though this encoding can be computed as matrix product, it is not a morphism. In the remaining part of this section, we will provide a rather simple injective morphism assigning to words matrices with entries being polynomials in q with non-negative coefficients. Let L be a language. We consider the set of (1 + |P ( L )|) × (1 + |P ( L )|) matrices with entries being polynomials in variable q. The rows and columns will be indexed starting from 0, the 0-th row and column will be labeled by the dummy word ⊥ and the remaining rows and columns will be indexed and labeled in the same way as in M L . We denote this set of matrices as M⊥∪ L [q]. In fact, we will need just the matrices where the elements of the 0-th row will be polynomials in the variable q and all other entries will be integers. Let Σ L = {b1 , b2 , . . . , br } ⊆ Σ be the set of all symbols occurring in L. Let us denote, for 1 s r, as j s the first index in the matrices from M⊥∪ L being labeled by a word from Σ ∗ b s . Without loss of generality, we may assume j 1 < j 2 < · · · < j r . By default, we denote j 0 = 1. We define the morphism Ψˆ q, L : Σ ∗ → M⊥∪ L [q] as follows. Let a ∈ Σ . For 0 i , j |P ( L )|,
Ψˆ q, L (a)
i, j
=
⎧ q ⎪ ⎪ ⎪ ⎨1 0
⎪ ⎪ ⎪ ⎩0 [Ψ L (a)]i , j
if i = j = 0 if i = 0, j 1 and the column j is labeled by a word from Σ ∗ a if i = 0, j 1 otherwise if i 1, j = 0 otherwise.
Assume a word w ∈ Σ ∗ . The last two lines of the definition of the morphism Ψˆ q, L guarantee, that the matrix obtained
from Ψˆ q, L ( w ) by omitting the 0-th row and 0-th column is precisely the matrix Ψ L ( w ). We will prove that Ψˆ q, L is an injective mapping if all symbols from Σ appear in L. Let us denote, for s ∈ {1, 2, . . . , r }, the polynomial p s ( w ) = [Ψˆ q, L ( w )]0,i s . Proposition 9. Let w ∈ Σ ∗ , a ∈ Σ , s ∈ {1, 2, . . . , r }. Then 1. [Ψˆ q, L ( w )]0,0 = q| w | , [Ψˆ q, L ( w )]0,1 = 0.
2. All coefficients of all polynomials in the 0-th row of Ψˆ q, L ( wa) are non-negative.
3. The degree of each non-zero polynomial [Ψˆ q, L ( w )]0, j , 2 j |P ( L )|, is at most | w | − 1.
4. For j s−1 < j < j s , [Ψˆ q, L ( wa)]0, j = 5. p s ( wa) = q| w | δa,b s
+
j s−1 i =1
j
c [Ψˆ q, L ( w )]0,i , where c i 0 for i = 1, 2, . . . , j. i =1 i ˆ c i [Ψq, L ( w )]0,i + p s ( w ), where c i 0 for i = 1, 2, . . . , j s−1 .
Proof. All entries in Ψ L ( w ) are non-negative. The assertion 1 is straightforward, 2, 3 and 4 can be easily (commonly) proved by induction on | w |. 2 Lemma 10. Assume a word w = a0 a2 · · · an−1 ∈ Σ ∗ , ai ∈ Σ , s ∈ {1, 2, . . . , r }, and 0 m < n. Then am = b s iff p s ( w ) is the first polynomial in the sequence p 1 ( w ), p 2 ( w ), . . . , p r ( w ) having a non-zero coefficient with qm . Proof. Induction on | w | using (9). The assertion is trivially true for w = λ. Let it be true for a word w and let us consider the word wa for some a ∈ Σ . For m < | w | the assertion follows from 4 and 5 of Proposition 9 and the inductive hypothesis. For m = | w | it follows from 3 and 5 of Proposition 9 and the inductive hypothesis. 2 Thus all occurrences of symbol b s in w can be decoded from the polynomials p 1 ( w ), p 2 ( w ), . . . , p s ( w ). Consequently, if
Σ L = Σ , the sequence p 1 ( w ), p 2 ( w ), . . . , pk ( w ) uniquely determines w.
Corollary 11. If L is a language containing all symbols from Σ , the morphism Ψˆ q, L ( w ) is injective. Example 12. Let L = {a, abc }. In this case P ( L ) = P (abc ) and Ψ L is the usual Parikh mapping for the alphabet of size 3. Then
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
⊥ λ a ab abc ⊥ λ a ab abc
⊥ λ a ab abc
⎛
q ⎜0 ⎜ Ψˆ q, L (a) = ⎜ 0 ⎝0 0
⎛
q ⎜0 ⎜ Ψˆ q, L (c ) = ⎜ 0 ⎝0 0
⎞
0 1 0 1 1 0 0 1 0 0 0 1 0 0 0
0 0⎟ ⎟ 0⎟, ⎠ 0 1
0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
1 0⎟ ⎟ 0⎟ ⎠ 1 1
⎛
q4 ⎜0 ⎜ Ψˆ q, L (abcb) = ⎜ 0 ⎝0 0
0 1 0 0 0
⊥ λ a ab abc
⎛
q ⎜0 ⎜ Ψˆ q, L (b) = ⎜ 0 ⎝0 0
0 0 1 1 0 0 0 1 1 0 0 1 0 0 0
665
⎞
0 0⎟ ⎟ 0⎟ ⎠ 0 1
⎞
⎞
1 q3 + q + 2 q2 + q + 1 1 2 1 ⎟ ⎟ 1 2 1 ⎟. ⎠ 0 1 1 0 0 1
Let us consider the morphism Ψˆ q,P (s1 s2 ···sk ) (as in Example 12). For a word w, let us denote as p w (q) the polynomial j
[Ψˆ q,P (s1 s2 ···sk ) ( w )]0, j +1 , 0 j k. These polynomials encode the complete information on the word w. They satisfy the following recurrence relations (where a ∈ Σ ), easily provable from the fact that Ψˆ q,P (s1 s2 ···sk ) is a morphism: p λj (q) = 0
0 j k
p 0w (q) = 0 ws p j i (q) = q| w |
1 i, j k
+ p wj−1 (q) δsi ,s j + p wj (q) s w | w |t δs1 ···si t ,s1 ···s j p ji (q) = qp w j (q) +
1 i , j k.
t ∈F (s1 ···sk )
6. Parikh L-matrices, partial words and fuzzy words A word w of length n 0 with symbol decomposition w = a0 a1 · · · an−1 is actually a mapping w : [n] → Σ such that w (i ) = ai for i ∈ [n]. In some applications, e.g. in DNA sequencing, we do not have a complete information on the word w. Letters in some positions are missing — such positions are referred to as holes. In such situation we consider w to be just a partial mapping, with the domain D ( w ) ⊆ [n]. Such partial mappings are called partial words. More on partial words, including some definitions, which we extend here, can be found in [7]. It is practical to consider default completions of partial words. Let us extend the alphabet Σ by an additional special symbol . The companion of a partial word w is the word w on the alphabet Σ = Σ ∪ { } such that, for i ∈ [n],
w (i ) =
w (i )
if i ∈ D ( w ) otherwise.
Partial words may be identified with their companions. A partial word is complete if it does not contain . Thus a partial word is any word on the alphabet Σ , where the symbol is to be treated in a special way: a hole denoted by can match any symbol from Σ . Two partial words u , v ∈ Σ are compatible, denoted as u ↑ v, if |u | = | v | and u (i ) = v (i ) for all i ∈ D (u ) ∩ D ( v ). For complete words, compatibility means equality. In [10] a generalization of the concept of partial word called fuzzy word has been proposed. A hole in a partial word over an alphabet Σ denotes a place where any symbol from Σ may appear. If some more information about that particular position in the word is available, the choice of symbols, which may appear at that particular position, may be restricted to some subset of the alphabet Σ only. Such position may be denoted, instead of , by that subset of Σ . A position surely containing a symbol a may then be denoted by the set {a}. Let us consider a new alphabet Γ = 2Σ − ∅ consisting of all non-empty subsets of Σ as (new) symbols. The words over the alphabet Γ may be considered as representing words from Σ with uncertainty at each position described by the corresponding subset of Σ . Partial words may now be observed as those where each symbol is either a singleton set or Σ (playing the role of ). The alphabet Γ is partially ordered by the set inclusion, the minimal sets are the singleton sets corresponding to the symbols of the original alphabet Σ . In a further generalization step, we consider any alphabet Γ with a partial order relation . Words over Γ will be called fuzzy words. Let Σ denote the set of all minimal elements of Γ . The fuzzy words consisting entirely of symbols from Σ will be called complete. Let x, y be two words from Γ ∗ with symbol decompositions x = a0 a1 · · · am−1 and y = b0 b1 · · · bn−1 . We will write x y (or y x) and say that x is contained in y if m = n and, for i ∈ [m], ai b i . The words x, y are fully compatible (denoted as
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
666
x ⇑ y) if m = n and, for each 0 i < m, either ai b i or b i ai . The words x, y are compatible (denoted as x ↑ y) if there is a word z ∈ Γ ∗ such that z x and z y. Two fully compatible words x, y are compatible, since the word z = c 0 c 1 · · · cm−1 where c i = min(ai , b i ), i ∈ [m], satisfies z x and z y. A fuzzy word can be viewed as representing the set of all complete words compatible with it. Within the rest of this section, “word” will mean “fuzzy word”, while still sometimes using the full term “fuzzy word”. We will observe occurrences of fuzzy subwords in a fuzzy word. We will base are considerations on the relation only, one can proceed in a similar way using the relations ↑ and ⇑. While we will still apply the notation δ to the words on Γ ∗ in the original sense, we will use a modified version of our Kronecker symbol, as well. For two words x, y ∈ Γ ∗ we denote δx, y = (if x y then 1 else 0). Observe that, generally, it is not true that δx, y = δ y,x . For complete words x, y, δx, y = δx, y . Let ∗ w , u ∈ Γ with symbol decompositions w = a0 a1 · · · an−1 , u = b0 b1 · · · bm−1 , m, n 0, and let ι be a subword position in w. The set ι is called fuzzy (subword) occurrence of u in w if u σ w (ι). In this case we equivalently say that u fuzzy-occurs at ι in w. We introduce the notation of | w |u as the number of fuzzy subword occurrences of the word u in the word w. { a,b,c } For example, assume Γ = 2 − ∅. Let us, for the rest of this section, use the convention, that when using sets from Γ as symbols, we replace the usual curly braces by angle brackets and singleton sets will be written without brackets. The word a, baa, b, c bab, c contains four fuzzy occurrences of the subword aa, c b: {0, 2, 3}, {0, 2, 5}, {1, 2, 3} and {1, 2, 5}. The same word contains five fuzzy occurrences of the word bac: {0, 1, 2}, {0, 1, 5}, {0, 2, 5}, {0, 4, 5}, and {3, 4, 5}. Thus |a, baa, b, c bab, c |aa,c b = 4 and |a, baa, b, c bab, c |bac = 5. For words w , u ∈ Σ ∗ , | w |u = | w |u . (We still apply ∗ the notation | w |u to the words from Γ in the original sense.) It is not difficult to observe, that the equality 1 can be extended to fuzzy words. Let w ∈ Γ ∗ , a, b ∈ Γ . Then
| wa|ub = | w |ub + δb,a | w |u .
(11)
Lemma 13. Let w ∈ Γ ∗ . Then for each u ∈ Γ , | w |u =
| w | x δu , x .
x∈Γ ∗
Proof.
| w |u = ι ⊆ | w | u σ (ι) = | w | x δu , x .
2
x∈Γ ∗
We will now modify the definition of the general Parikh matrix mapping to deal with fuzzy subword occurrences in fuzzy words. Let us assume a (finite) language L ∈ Γ ∗ . We define morphism Ψ L : Γ ∗ → M L as follows. For u , v ∈ P ( L ), a ∈ Γ ,
ΨL (a) u , v = δu , v + δb,a δub, v . b∈Γ
Since, for closed words, δu , v = δu , v , Ψ L
is an extension of the morphism Ψ L : Σ ∗ → M L defined in Section 3. Again a
general formula exists expressing the entries of Ψ L ( w ) as counts of fuzzy subword occurrences of words from L in w. Theorem 14. Let L ∈ Γ ∗ be a language. For all u , v ∈ P ( L ), and w ∈ Γ ∗ ,
ΨL ( w ) u,v = | w |x δux, v = δux, v | w | y δx, y . x∈F ( L )
(12)
y ∈Γ ∗
x∈F ( L )
Proof. It is enough to prove the first equality, the second equality then follows from Lemma 13. We use induction on | w |. The assertion is true for w = λ. Let it be true for a word w ∈ Γ ∗ and let a ∈ Γ . Then, by the inductive hypothesis
ΨL ( wa) u , v = ΨL ( w )Ψ L (a) u , v = Ψ L ( w ) u ,z Ψ L (a) z, v z∈P ( L )
=
| w |x δux,z δz, v + δb,a δzb, v
z∈P ( L ) x∈F ( L )
=
x∈F ( L )
| w |x
b∈Γ
δux, v +
δb,a δuxb, v
.
b∈Γ
Let us consider the expression (13) in the following three cases.
(13)
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
1. If u is not a prefix of v, then
| w |x
δux, v +
x∈F ( L )
| w |x
δux, v +
x∈F ( L )
| wa|x δux, v .
δb,a δuxb, v = | w |λ = | wa|λ = | wa|x δux, v . x∈F ( L )
b∈Γ
| w |x
δux, v +
x∈F ( L )
∈ Γ , then
δb,a δuxb, v = | wa| yc + | wa| y δa,c
b∈Γ
and (11) implies
=0=
x∈F ( L )
3. If v = u yc for some y ∈ Γ ∗ , c
b∈Γ
2. If u = v, then
δb,a δuxb, v
667
| w |x
δux, v +
x∈F ( L )
δb,a δuxb, v = | wa| yc = | wa|x δux, v .
2
x∈F ( L )
b∈Γ
Corollary 15. Let L ⊂ Γ ∗ , w ∈ Γ ∗ and let x be a factor of some word in L. Then ux ∈ P ( L ) for some u ∈ Γ ∗ and
| w |x = ΨL ( w ) u ,ux . Example 16. Let Γ = 2{a,b,c } − ∅. L = {a, b, aa, c a}. In this case P ( L ) = {λ, a, b, aa, c , aa, c a}. Choose w = aa, b, c a, c a, b, c . Then
a b aa, c aa, c a λ
a b aa, c aa, c a λ
a b aa, c aa, c a λ
a b aa, c aa, c a λ
⎛
1 ⎜0 ⎜ ΨL (a) = ⎜ 0 ⎝0 0
⎞
1 0 0 0 1 0 0 0⎟ ⎟ 0 1 0 0⎟ ⎠ 0 0 1 1 0 0 0 1
⎛
1 ⎜ ⎜0 ΨL a, c = ⎜ 0 ⎝0 0
⎛
1 1 0 0 0
1 ⎜0 ⎜ ΨL a, b, c = ⎜ 0 ⎝0 0
⎞
0 0 0 0 1 0⎟ ⎟ 1 0 0⎟ ⎠ 0 1 1 0 0 1 1 1 0 0 0
1 0 1 0 0
0 1 0 1 0
⎛
⎞
0 0⎟ ⎟ 0⎟ ⎠ 1 1
1 ⎜ ⎜0 ΨL aa, b, c a, c a, b, c = ⎜ 0 ⎝0 0
4 2 6 1 0 3 0 1 0 0 0 1 0 0 0
⎞
4 3⎟ ⎟ 0⎟. ⎠ 4 1
Thus, e.g., [Ψ L ( w )]λ,aa,c a = 4, since there are 4 fuzzy occurrences of aa, c a in w: {0, 1, 2}, {0, 1, 3}, {0, 2, 3}, and {1, 2, 3}.
[ΨL ( w )]λ,a = [ΨL ( w )]aa,c,aa,ca = 4 since at each position in w there is a fuzzy occurrence of a. 7. Conclusion We presented here two q-matrix morphisms dealing with subwords from an arbitrary finite language. The first one documents that the q-matrix morphism from [6] can be extended to match the structure of the Parikh matrix L-morphism. The other is, in our knowledge, the first q-matrix morphism being injective. This was achieved for the price of increasing the size of the Parikh L-matrix (while the only entries of positive degree appear in the first row only). One may ask whether
ˇ H.M.K. Alazemi, A. Cerný / Journal of Computer and System Sciences 79 (2013) 658–668
668
an injective q-matrix morphism can be constructed mapping words to matrices of the same size as the Parikh L-matrices. We documented, as well, that the Parikh matrix L-morphism can be adjusted to deal with uncertainty in words and count fuzzy subword occurrences of fuzzy words specified by any finite language. Acknowledgments The authors would like to express their gratitude to Arto Salomaa and to an unknown reviewer for their valuable comments and recommendations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
A. Mateescu, A. Salomaa, K. Salomaa, S. Yu, A sharpening of the Parikh mapping, RAIRO Theor. Inform. Appl. 35 (6) (2001) 551–564. T.-F. Serb˘ ¸ anut˘ ¸ a, Extending Parikh matrices, Theoret. Comput. Sci. 310 (1–3) (2004) 233–246. ˇ H.M.K. Alazemi, A. Cerný, Counting subwords using a trie automaton, Internat. J. Found. Comput. Sci. 22 (6) (2011) 1457–1469. V.N. Serb˘ ¸ anut˘ ¸ a, On Parikh matrices and ambiguity, PhD thesis, University of Bucharest, April 2010. Ö. E˘gecio˘glu, A q-matrix encoding extending the Parikh matrix mapping, Tech. Rep. 14, Department of Computer Science at UC Santa Barbara, 2004. Ö. E˘gecio˘glu, O.H. Ibarra, A matrix q-analogue of the Parikh mapping, in: J.-J. Lévy, E.W. Mayr, J.C. Mitchell (Eds.), IFIP TCS, Kluwer, 2004, pp. 125–138. F. Blanchet-Sadri, Algorithmic Combinatorics on Partial Words, Discrete Math. Appl. , Chapman & Hall/CRC, 2007. M. Lothaire, Combinatorics on Words, Cambridge University Press, 1997. R.J. Parikh, On context-free languages, J. ACM 13 (4) (1966) 570–581, http://doi.acm.org/10.1145/321356.321364. ˇ A. Cerný, Fuzzy words, in: Italian Conference on Theoretical Computer Science, 2010, pp. 1–4, http://www.cs.unicam.it/ictcs2010/abstract/paper8.pdf.